Sample records for statistical cluster analysis

  1. Spatiotemporal Analysis of the Ebola Hemorrhagic Fever in West Africa in 2014

    NASA Astrophysics Data System (ADS)

    Xu, M.; Cao, C. X.; Guo, H. F.

    2017-09-01

    Ebola hemorrhagic fever (EHF) is an acute hemorrhagic diseases caused by the Ebola virus, which is highly contagious. This paper aimed to explore the possible gathering area of EHF cases in West Africa in 2014, and identify endemic areas and their tendency by means of time-space analysis. We mapped distribution of EHF incidences and explored statistically significant space, time and space-time disease clusters. We utilized hotspot analysis to find the spatial clustering pattern on the basis of the actual outbreak cases. spatial-temporal cluster analysis is used to analyze the spatial or temporal distribution of agglomeration disease, examine whether its distribution is statistically significant. Local clusters were investigated using Kulldorff's scan statistic approach. The result reveals that the epidemic mainly gathered in the western part of Africa near north Atlantic with obvious regional distribution. For the current epidemic, we have found areas in high incidence of EVD by means of spatial cluster analysis.

  2. DICON: interactive visual analysis of multidimensional clusters.

    PubMed

    Cao, Nan; Gotz, David; Sun, Jimeng; Qu, Huamin

    2011-12-01

    Clustering as a fundamental data analysis technique has been widely used in many analytic applications. However, it is often difficult for users to understand and evaluate multidimensional clustering results, especially the quality of clusters and their semantics. For large and complex data, high-level statistical information about the clusters is often needed for users to evaluate cluster quality while a detailed display of multidimensional attributes of the data is necessary to understand the meaning of clusters. In this paper, we introduce DICON, an icon-based cluster visualization that embeds statistical information into a multi-attribute display to facilitate cluster interpretation, evaluation, and comparison. We design a treemap-like icon to represent a multidimensional cluster, and the quality of the cluster can be conveniently evaluated with the embedded statistical information. We further develop a novel layout algorithm which can generate similar icons for similar clusters, making comparisons of clusters easier. User interaction and clutter reduction are integrated into the system to help users more effectively analyze and refine clustering results for large datasets. We demonstrate the power of DICON through a user study and a case study in the healthcare domain. Our evaluation shows the benefits of the technique, especially in support of complex multidimensional cluster analysis. © 2011 IEEE

  3. Comparisons of non-Gaussian statistical models in DNA methylation analysis.

    PubMed

    Ma, Zhanyu; Teschendorff, Andrew E; Yu, Hong; Taghia, Jalil; Guo, Jun

    2014-06-16

    As a key regulatory mechanism of gene expression, DNA methylation patterns are widely altered in many complex genetic diseases, including cancer. DNA methylation is naturally quantified by bounded support data; therefore, it is non-Gaussian distributed. In order to capture such properties, we introduce some non-Gaussian statistical models to perform dimension reduction on DNA methylation data. Afterwards, non-Gaussian statistical model-based unsupervised clustering strategies are applied to cluster the data. Comparisons and analysis of different dimension reduction strategies and unsupervised clustering methods are presented. Experimental results show that the non-Gaussian statistical model-based methods are superior to the conventional Gaussian distribution-based method. They are meaningful tools for DNA methylation analysis. Moreover, among several non-Gaussian methods, the one that captures the bounded nature of DNA methylation data reveals the best clustering performance.

  4. Comparisons of Non-Gaussian Statistical Models in DNA Methylation Analysis

    PubMed Central

    Ma, Zhanyu; Teschendorff, Andrew E.; Yu, Hong; Taghia, Jalil; Guo, Jun

    2014-01-01

    As a key regulatory mechanism of gene expression, DNA methylation patterns are widely altered in many complex genetic diseases, including cancer. DNA methylation is naturally quantified by bounded support data; therefore, it is non-Gaussian distributed. In order to capture such properties, we introduce some non-Gaussian statistical models to perform dimension reduction on DNA methylation data. Afterwards, non-Gaussian statistical model-based unsupervised clustering strategies are applied to cluster the data. Comparisons and analysis of different dimension reduction strategies and unsupervised clustering methods are presented. Experimental results show that the non-Gaussian statistical model-based methods are superior to the conventional Gaussian distribution-based method. They are meaningful tools for DNA methylation analysis. Moreover, among several non-Gaussian methods, the one that captures the bounded nature of DNA methylation data reveals the best clustering performance. PMID:24937687

  5. ICAP - An Interactive Cluster Analysis Procedure for analyzing remotely sensed data

    NASA Technical Reports Server (NTRS)

    Wharton, S. W.; Turner, B. J.

    1981-01-01

    An Interactive Cluster Analysis Procedure (ICAP) was developed to derive classifier training statistics from remotely sensed data. ICAP differs from conventional clustering algorithms by allowing the analyst to optimize the cluster configuration by inspection, rather than by manipulating process parameters. Control of the clustering process alternates between the algorithm, which creates new centroids and forms clusters, and the analyst, who can evaluate and elect to modify the cluster structure. Clusters can be deleted, or lumped together pairwise, or new centroids can be added. A summary of the cluster statistics can be requested to facilitate cluster manipulation. The principal advantage of this approach is that it allows prior information (when available) to be used directly in the analysis, since the analyst interacts with ICAP in a straightforward manner, using basic terms with which he is more likely to be familiar. Results from testing ICAP showed that an informed use of ICAP can improve classification, as compared to an existing cluster analysis procedure.

  6. Symptom Clusters in Advanced Cancer Patients: An Empirical Comparison of Statistical Methods and the Impact on Quality of Life.

    PubMed

    Dong, Skye T; Costa, Daniel S J; Butow, Phyllis N; Lovell, Melanie R; Agar, Meera; Velikova, Galina; Teckle, Paulos; Tong, Allison; Tebbutt, Niall C; Clarke, Stephen J; van der Hoek, Kim; King, Madeleine T; Fayers, Peter M

    2016-01-01

    Symptom clusters in advanced cancer can influence patient outcomes. There is large heterogeneity in the methods used to identify symptom clusters. To investigate the consistency of symptom cluster composition in advanced cancer patients using different statistical methodologies for all patients across five primary cancer sites, and to examine which clusters predict functional status, a global assessment of health and global quality of life. Principal component analysis and exploratory factor analysis (with different rotation and factor selection methods) and hierarchical cluster analysis (with different linkage and similarity measures) were used on a data set of 1562 advanced cancer patients who completed the European Organization for the Research and Treatment of Cancer Quality of Life Questionnaire-Core 30. Four clusters consistently formed for many of the methods and cancer sites: tense-worry-irritable-depressed (emotional cluster), fatigue-pain, nausea-vomiting, and concentration-memory (cognitive cluster). The emotional cluster was a stronger predictor of overall quality of life than the other clusters. Fatigue-pain was a stronger predictor of overall health than the other clusters. The cognitive cluster and fatigue-pain predicted physical functioning, role functioning, and social functioning. The four identified symptom clusters were consistent across statistical methods and cancer types, although there were some noteworthy differences. Statistical derivation of symptom clusters is in need of greater methodological guidance. A psychosocial pathway in the management of symptom clusters may improve quality of life. Biological mechanisms underpinning symptom clusters need to be delineated by future research. A framework for evidence-based screening, assessment, treatment, and follow-up of symptom clusters in advanced cancer is essential. Copyright © 2016 American Academy of Hospice and Palliative Medicine. Published by Elsevier Inc. All rights reserved.

  7. Application of microarray analysis on computer cluster and cloud platforms.

    PubMed

    Bernau, C; Boulesteix, A-L; Knaus, J

    2013-01-01

    Analysis of recent high-dimensional biological data tends to be computationally intensive as many common approaches such as resampling or permutation tests require the basic statistical analysis to be repeated many times. A crucial advantage of these methods is that they can be easily parallelized due to the computational independence of the resampling or permutation iterations, which has induced many statistics departments to establish their own computer clusters. An alternative is to rent computing resources in the cloud, e.g. at Amazon Web Services. In this article we analyze whether a selection of statistical projects, recently implemented at our department, can be efficiently realized on these cloud resources. Moreover, we illustrate an opportunity to combine computer cluster and cloud resources. In order to compare the efficiency of computer cluster and cloud implementations and their respective parallelizations we use microarray analysis procedures and compare their runtimes on the different platforms. Amazon Web Services provide various instance types which meet the particular needs of the different statistical projects we analyzed in this paper. Moreover, the network capacity is sufficient and the parallelization is comparable in efficiency to standard computer cluster implementations. Our results suggest that many statistical projects can be efficiently realized on cloud resources. It is important to mention, however, that workflows can change substantially as a result of a shift from computer cluster to cloud computing.

  8. Multivariate statistical analysis: Principles and applications to coorbital streams of meteorite falls

    NASA Technical Reports Server (NTRS)

    Wolf, S. F.; Lipschutz, M. E.

    1993-01-01

    Multivariate statistical analysis techniques (linear discriminant analysis and logistic regression) can provide powerful discrimination tools which are generally unfamiliar to the planetary science community. Fall parameters were used to identify a group of 17 H chondrites (Cluster 1) that were part of a coorbital stream which intersected Earth's orbit in May, from 1855 - 1895, and can be distinguished from all other H chondrite falls. Using multivariate statistical techniques, it was demonstrated that a totally different criterion, labile trace element contents - hence thermal histories - or 13 Cluster 1 meteorites are distinguishable from those of 45 non-Cluster 1 H chondrites. Here, we focus upon the principles of multivariate statistical techniques and illustrate their application using non-meteoritic and meteoritic examples.

  9. Visualizing statistical significance of disease clusters using cartograms.

    PubMed

    Kronenfeld, Barry J; Wong, David W S

    2017-05-15

    Health officials and epidemiological researchers often use maps of disease rates to identify potential disease clusters. Because these maps exaggerate the prominence of low-density districts and hide potential clusters in urban (high-density) areas, many researchers have used density-equalizing maps (cartograms) as a basis for epidemiological mapping. However, we do not have existing guidelines for visual assessment of statistical uncertainty. To address this shortcoming, we develop techniques for visual determination of statistical significance of clusters spanning one or more districts on a cartogram. We developed the techniques within a geovisual analytics framework that does not rely on automated significance testing, and can therefore facilitate visual analysis to detect clusters that automated techniques might miss. On a cartogram of the at-risk population, the statistical significance of a disease cluster is determinate from the rate, area and shape of the cluster under standard hypothesis testing scenarios. We develop formulae to determine, for a given rate, the area required for statistical significance of a priori and a posteriori designated regions under certain test assumptions. Uniquely, our approach enables dynamic inference of aggregate regions formed by combining individual districts. The method is implemented in interactive tools that provide choropleth mapping, automated legend construction and dynamic search tools to facilitate cluster detection and assessment of the validity of tested assumptions. A case study of leukemia incidence analysis in California demonstrates the ability to visually distinguish between statistically significant and insignificant regions. The proposed geovisual analytics approach enables intuitive visual assessment of statistical significance of arbitrarily defined regions on a cartogram. Our research prompts a broader discussion of the role of geovisual exploratory analyses in disease mapping and the appropriate framework for visually assessing the statistical significance of spatial clusters.

  10. On the blind use of statistical tools in the analysis of globular cluster stars

    NASA Astrophysics Data System (ADS)

    D'Antona, Francesca; Caloi, Vittoria; Tailo, Marco

    2018-04-01

    As with most data analysis methods, the Bayesian method must be handled with care. We show that its application to determine stellar evolution parameters within globular clusters can lead to paradoxical results if used without the necessary precautions. This is a cautionary tale on the use of statistical tools for big data analysis.

  11. K-means cluster analysis of tourist destination in special region of Yogyakarta using spatial approach and social network analysis (a case study: post of @explorejogja instagram account in 2016)

    NASA Astrophysics Data System (ADS)

    Iswandhani, N.; Muhajir, M.

    2018-03-01

    This research was conducted in Department of Statistics Islamic University of Indonesia. The data used are primary data obtained by post @explorejogja instagram account from January until December 2016. In the @explorejogja instagram account found many tourist destinations that can be visited by tourists both in the country and abroad, Therefore it is necessary to form a cluster of existing tourist destinations based on the number of likes from user instagram assumed as the most popular. The purpose of this research is to know the most popular distribution of tourist spot, the cluster formation of tourist destinations, and central popularity of tourist destinations based on @explorejogja instagram account in 2016. Statistical analysis used is descriptive statistics, k-means clustering, and social network analysis. The results of this research were obtained the top 10 most popular destinations in Yogyakarta, map of html-based tourist destination distribution consisting of 121 tourist destination points, formed 3 clusters each consisting of cluster 1 with 52 destinations, cluster 2 with 9 destinations and cluster 3 with 60 destinations, and Central popularity of tourist destinations in the special region of Yogyakarta by district.

  12. Manipulating measurement scales in medical statistical analysis and data mining: A review of methodologies

    PubMed Central

    Marateb, Hamid Reza; Mansourian, Marjan; Adibi, Peyman; Farina, Dario

    2014-01-01

    Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD). Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables. PMID:24672565

  13. Sulfur in Cometary Dust

    NASA Technical Reports Server (NTRS)

    Fomenkova, M. N.

    1997-01-01

    The computer-intensive project consisted of the analysis and synthesis of existing data on composition of comet Halley dust particles. The main objective was to obtain a complete inventory of sulfur containing compounds in the comet Halley dust by building upon the existing classification of organic and inorganic compounds and applying a variety of statistical techniques for cluster and cross-correlational analyses. A student hired for this project wrote and tested the software to perform cluster analysis. The following tasks were carried out: (1) selecting the data from existing database for the proposed project; (2) finding access to a standard library of statistical routines for cluster analysis; (3) reformatting the data as necessary for input into the library routines; (4) performing cluster analysis and constructing hierarchical cluster trees using three methods to define the proximity of clusters; (5) presenting the output results in different formats to facilitate the interpretation of the obtained cluster trees; (6) selecting groups of data points common for all three trees as stable clusters. We have also considered the chemistry of sulfur in inorganic compounds.

  14. cluster trials v. 1.0

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Mitchell, John; Castillo, Andrew

    2016-09-21

    This software contains a set of python modules – input, search, cluster, analysis; these modules read input files containing spatial coordinates and associated attributes which can be used to perform nearest neighbor search (spatial indexing via kdtree), cluster analysis/identification, and calculation of spatial statistics for analysis.

  15. Application of multivariable statistical techniques in plant-wide WWTP control strategies analysis.

    PubMed

    Flores, X; Comas, J; Roda, I R; Jiménez, L; Gernaey, K V

    2007-01-01

    The main objective of this paper is to present the application of selected multivariable statistical techniques in plant-wide wastewater treatment plant (WWTP) control strategies analysis. In this study, cluster analysis (CA), principal component analysis/factor analysis (PCA/FA) and discriminant analysis (DA) are applied to the evaluation matrix data set obtained by simulation of several control strategies applied to the plant-wide IWA Benchmark Simulation Model No 2 (BSM2). These techniques allow i) to determine natural groups or clusters of control strategies with a similar behaviour, ii) to find and interpret hidden, complex and casual relation features in the data set and iii) to identify important discriminant variables within the groups found by the cluster analysis. This study illustrates the usefulness of multivariable statistical techniques for both analysis and interpretation of the complex multicriteria data sets and allows an improved use of information for effective evaluation of control strategies.

  16. Text grouping in patent analysis using adaptive K-means clustering algorithm

    NASA Astrophysics Data System (ADS)

    Shanie, Tiara; Suprijadi, Jadi; Zulhanif

    2017-03-01

    Patents are one of the Intellectual Property. Analyzing patent is one requirement in knowing well the development of technology in each country and in the world now. This study uses the patent document coming from the Espacenet server about Green Tea. Patent documents related to the technology in the field of tea is still widespread, so it will be difficult for users to information retrieval (IR). Therefore, it is necessary efforts to categorize documents in a specific group of related terms contained therein. This study uses titles patent text data with the proposed Green Tea in Statistical Text Mining methods consists of two phases: data preparation and data analysis stage. The data preparation phase uses Text Mining methods and data analysis stage is done by statistics. Statistical analysis in this study using a cluster analysis algorithm, the Adaptive K-Means Clustering Algorithm. Results from this study showed that based on the maximum value Silhouette, generate 87 clusters associated fifteen terms therein that can be utilized in the process of information retrieval needs.

  17. Statistical analysis of activation and reaction energies with quasi-variational coupled-cluster theory

    NASA Astrophysics Data System (ADS)

    Black, Joshua A.; Knowles, Peter J.

    2018-06-01

    The performance of quasi-variational coupled-cluster (QV) theory applied to the calculation of activation and reaction energies has been investigated. A statistical analysis of results obtained for six different sets of reactions has been carried out, and the results have been compared to those from standard single-reference methods. In general, the QV methods lead to increased activation energies and larger absolute reaction energies compared to those obtained with traditional coupled-cluster theory.

  18. Water quality analysis of the Rapur area, Andhra Pradesh, South India using multivariate techniques

    NASA Astrophysics Data System (ADS)

    Nagaraju, A.; Sreedhar, Y.; Thejaswi, A.; Sayadi, Mohammad Hossein

    2017-10-01

    The groundwater samples from Rapur area were collected from different sites to evaluate the major ion chemistry. The large number of data can lead to difficulties in the integration, interpretation, and representation of the results. Two multivariate statistical methods, hierarchical cluster analysis (HCA) and factor analysis (FA), were applied to evaluate their usefulness to classify and identify geochemical processes controlling groundwater geochemistry. Four statistically significant clusters were obtained from 30 sampling stations. This has resulted two important clusters viz., cluster 1 (pH, Si, CO3, Mg, SO4, Ca, K, HCO3, alkalinity, Na, Na + K, Cl, and hardness) and cluster 2 (EC and TDS) which are released to the study area from different sources. The application of different multivariate statistical techniques, such as principal component analysis (PCA), assists in the interpretation of complex data matrices for a better understanding of water quality of a study area. From PCA, it is clear that the first factor (factor 1), accounted for 36.2% of the total variance, was high positive loading in EC, Mg, Cl, TDS, and hardness. Based on the PCA scores, four significant cluster groups of sampling locations were detected on the basis of similarity of their water quality.

  19. Cluster Analysis of Minnesota School Districts. A Research Report.

    ERIC Educational Resources Information Center

    Cleary, James

    The term "cluster analysis" refers to a set of statistical methods that classify entities with similar profiles of scores on a number of measured dimensions, in order to create empirically based typologies. A 1980 Minnesota House Research Report employed cluster analysis to categorize school districts according to their relative mixtures…

  20. Coagulation-fragmentation for a finite number of particles and application to telomere clustering in the yeast nucleus

    NASA Astrophysics Data System (ADS)

    Hozé, Nathanaël; Holcman, David

    2012-01-01

    We develop a coagulation-fragmentation model to study a system composed of a small number of stochastic objects moving in a confined domain, that can aggregate upon binding to form local clusters of arbitrary sizes. A cluster can also dissociate into two subclusters with a uniform probability. To study the statistics of clusters, we combine a Markov chain analysis with a partition number approach. Interestingly, we obtain explicit formulas for the size and the number of clusters in terms of hypergeometric functions. Finally, we apply our analysis to study the statistical physics of telomeres (ends of chromosomes) clustering in the yeast nucleus and show that the diffusion-coagulation-fragmentation process can predict the organization of telomeres.

  1. Statistical Significance for Hierarchical Clustering

    PubMed Central

    Kimes, Patrick K.; Liu, Yufeng; Hayes, D. Neil; Marron, J. S.

    2017-01-01

    Summary Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other fields for their ability to simultaneously uncover multiple layers of clustering structure. A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation. Few approaches have been proposed for addressing this problem in the context of hierarchical clustering, for which the problem is further complicated by the natural tree structure of the partition, and the multiplicity of tests required to parse the layers of nested clusters. In this paper, we propose a Monte Carlo based approach for testing statistical significance in hierarchical clustering which addresses these issues. The approach is implemented as a sequential testing procedure guaranteeing control of the family-wise error rate. Theoretical justification is provided for our approach, and its power to detect true clustering structure is illustrated through several simulation studies and applications to two cancer gene expression datasets. PMID:28099990

  2. Clustering, randomness and regularity in cloud fields. I - Theoretical considerations. II - Cumulus cloud fields

    NASA Technical Reports Server (NTRS)

    Weger, R. C.; Lee, J.; Zhu, Tianri; Welch, R. M.

    1992-01-01

    The current controversy existing in reference to the regularity vs. clustering in cloud fields is examined by means of analysis and simulation studies based upon nearest-neighbor cumulative distribution statistics. It is shown that the Poisson representation of random point processes is superior to pseudorandom-number-generated models and that pseudorandom-number-generated models bias the observed nearest-neighbor statistics towards regularity. Interpretation of this nearest-neighbor statistics is discussed for many cases of superpositions of clustering, randomness, and regularity. A detailed analysis is carried out of cumulus cloud field spatial distributions based upon Landsat, AVHRR, and Skylab data, showing that, when both large and small clouds are included in the cloud field distributions, the cloud field always has a strong clustering signal.

  3. A Survey of Popular R Packages for Cluster Analysis

    ERIC Educational Resources Information Center

    Flynt, Abby; Dean, Nema

    2016-01-01

    Cluster analysis is a set of statistical methods for discovering new group/class structure when exploring data sets. This article reviews the following popular libraries/commands in the R software language for applying different types of cluster analysis: from the stats library, the kmeans, and hclust functions; the mclust library; the poLCA…

  4. Using Cluster Analysis for Data Mining in Educational Technology Research

    ERIC Educational Resources Information Center

    Antonenko, Pavlo D.; Toy, Serkan; Niederhauser, Dale S.

    2012-01-01

    Cluster analysis is a group of statistical methods that has great potential for analyzing the vast amounts of web server-log data to understand student learning from hyperlinked information resources. In this methodological paper we provide an introduction to cluster analysis for educational technology researchers and illustrate its use through…

  5. Variable Screening for Cluster Analysis.

    ERIC Educational Resources Information Center

    Donoghue, John R.

    Inclusion of irrelevant variables in a cluster analysis adversely affects subgroup recovery. This paper examines using moment-based statistics to screen variables; only variables that pass the screening are then used in clustering. Normal mixtures are analytically shown often to possess negative kurtosis. Two related measures, "m" and…

  6. Investigating Faculty Familiarity with Assessment Terminology by Applying Cluster Analysis to Interpret Survey Data

    ERIC Educational Resources Information Center

    Raker, Jeffrey R.; Holme, Thomas A.

    2014-01-01

    A cluster analysis was conducted with a set of survey data on chemistry faculty familiarity with 13 assessment terms. Cluster groupings suggest a high, middle, and low overall familiarity with the terminology and an independent high and low familiarity with terms related to fundamental statistics. The six resultant clusters were found to be…

  7. Clustered Stomates in "Begonia": An Exercise in Data Collection & Statistical Analysis of Biological Space

    ERIC Educational Resources Information Center

    Lau, Joann M.; Korn, Robert W.

    2007-01-01

    In this article, the authors present a laboratory exercise in data collection and statistical analysis in biological space using clustered stomates on leaves of "Begonia" plants. The exercise can be done in middle school classes by students making their own slides and seeing imprints of cells, or at the high school level through collecting data of…

  8. Coordinate based random effect size meta-analysis of neuroimaging studies.

    PubMed

    Tench, C R; Tanasescu, Radu; Constantinescu, C S; Auer, D P; Cottam, W J

    2017-06-01

    Low power in neuroimaging studies can make them difficult to interpret, and Coordinate based meta-analysis (CBMA) may go some way to mitigating this issue. CBMA has been used in many analyses to detect where published functional MRI or voxel-based morphometry studies testing similar hypotheses report significant summary results (coordinates) consistently. Only the reported coordinates and possibly t statistics are analysed, and statistical significance of clusters is determined by coordinate density. Here a method of performing coordinate based random effect size meta-analysis and meta-regression is introduced. The algorithm (ClusterZ) analyses both coordinates and reported t statistic or Z score, standardised by the number of subjects. Statistical significance is determined not by coordinate density, but by a random effects meta-analyses of reported effects performed cluster-wise using standard statistical methods and taking account of censoring inherent in the published summary results. Type 1 error control is achieved using the false cluster discovery rate (FCDR), which is based on the false discovery rate. This controls both the family wise error rate under the null hypothesis that coordinates are randomly drawn from a standard stereotaxic space, and the proportion of significant clusters that are expected under the null. Such control is necessary to avoid propagating and even amplifying the very issues motivating the meta-analysis in the first place. ClusterZ is demonstrated on both numerically simulated data and on real data from reports of grey matter loss in multiple sclerosis (MS) and syndromes suggestive of MS, and of painful stimulus in healthy controls. The software implementation is available to download and use freely. Copyright © 2017 Elsevier Inc. All rights reserved.

  9. [Cluster analysis applicability to fitness evaluation of cosmonauts on long-term missions of the International space station].

    PubMed

    Egorov, A D; Stepantsov, V I; Nosovskiĭ, A M; Shipov, A A

    2009-01-01

    Cluster analysis was applied to evaluate locomotion training (running and running intermingled with walking) of 13 cosmonauts on long-term ISS missions by the parameters of duration (min), distance (m) and intensity (km/h). Based on the results of analyses, the cosmonauts were distributed into three steady groups of 2, 5 and 6 persons. Distance and speed showed a statistical rise (p < 0.03) from group 1 to group 3. Duration of physical locomotion training was not statistically different in the groups (p = 0.125). Therefore, cluster analysis is an adequate method of evaluating fitness of cosmonauts on long-term missions.

  10. Finding Groups Using Model-Based Cluster Analysis: Heterogeneous Emotional Self-Regulatory Processes and Heavy Alcohol Use Risk

    ERIC Educational Resources Information Center

    Mun, Eun Young; von Eye, Alexander; Bates, Marsha E.; Vaschillo, Evgeny G.

    2008-01-01

    Model-based cluster analysis is a new clustering procedure to investigate population heterogeneity utilizing finite mixture multivariate normal densities. It is an inferentially based, statistically principled procedure that allows comparison of nonnested models using the Bayesian information criterion to compare multiple models and identify the…

  11. Constraining the mass–richness relationship of redMaPPer clusters with angular clustering

    DOE PAGES

    Baxter, Eric J.; Rozo, Eduardo; Jain, Bhuvnesh; ...

    2016-08-04

    The potential of using cluster clustering for calibrating the mass–richness relation of galaxy clusters has been recognized theoretically for over a decade. In this paper, we demonstrate the feasibility of this technique to achieve high-precision mass calibration using redMaPPer clusters in the Sloan Digital Sky Survey North Galactic Cap. By including cross-correlations between several richness bins in our analysis, we significantly improve the statistical precision of our mass constraints. The amplitude of the mass–richness relation is constrained to 7 per cent statistical precision by our analysis. However, the error budget is systematics dominated, reaching a 19 per cent total errormore » that is dominated by theoretical uncertainty in the bias–mass relation for dark matter haloes. We confirm the result from Miyatake et al. that the clustering amplitude of redMaPPer clusters depends on galaxy concentration as defined therein, and we provide additional evidence that this dependence cannot be sourced by mass dependences: some other effect must account for the observed variation in clustering amplitude with galaxy concentration. Assuming that the observed dependence of redMaPPer clustering on galaxy concentration is a form of assembly bias, we find that such effects introduce a systematic error on the amplitude of the mass–richness relation that is comparable to the error bar from statistical noise. Finally, the results presented here demonstrate the power of cluster clustering for mass calibration and cosmology provided the current theoretical systematics can be ameliorated.« less

  12. Quantification and statistical significance analysis of group separation in NMR-based metabonomics studies

    PubMed Central

    Goodpaster, Aaron M.; Kennedy, Michael A.

    2015-01-01

    Currently, no standard metrics are used to quantify cluster separation in PCA or PLS-DA scores plots for metabonomics studies or to determine if cluster separation is statistically significant. Lack of such measures makes it virtually impossible to compare independent or inter-laboratory studies and can lead to confusion in the metabonomics literature when authors putatively identify metabolites distinguishing classes of samples based on visual and qualitative inspection of scores plots that exhibit marginal separation. While previous papers have addressed quantification of cluster separation in PCA scores plots, none have advocated routine use of a quantitative measure of separation that is supported by a standard and rigorous assessment of whether or not the cluster separation is statistically significant. Here quantification and statistical significance of separation of group centroids in PCA and PLS-DA scores plots are considered. The Mahalanobis distance is used to quantify the distance between group centroids, and the two-sample Hotelling's T2 test is computed for the data, related to an F-statistic, and then an F-test is applied to determine if the cluster separation is statistically significant. We demonstrate the value of this approach using four datasets containing various degrees of separation, ranging from groups that had no apparent visual cluster separation to groups that had no visual cluster overlap. Widespread adoption of such concrete metrics to quantify and evaluate the statistical significance of PCA and PLS-DA cluster separation would help standardize reporting of metabonomics data. PMID:26246647

  13. Analysis of basic clustering algorithms for numerical estimation of statistical averages in biomolecules.

    PubMed

    Anandakrishnan, Ramu; Onufriev, Alexey

    2008-03-01

    In statistical mechanics, the equilibrium properties of a physical system of particles can be calculated as the statistical average over accessible microstates of the system. In general, these calculations are computationally intractable since they involve summations over an exponentially large number of microstates. Clustering algorithms are one of the methods used to numerically approximate these sums. The most basic clustering algorithms first sub-divide the system into a set of smaller subsets (clusters). Then, interactions between particles within each cluster are treated exactly, while all interactions between different clusters are ignored. These smaller clusters have far fewer microstates, making the summation over these microstates, tractable. These algorithms have been previously used for biomolecular computations, but remain relatively unexplored in this context. Presented here, is a theoretical analysis of the error and computational complexity for the two most basic clustering algorithms that were previously applied in the context of biomolecular electrostatics. We derive a tight, computationally inexpensive, error bound for the equilibrium state of a particle computed via these clustering algorithms. For some practical applications, it is the root mean square error, which can be significantly lower than the error bound, that may be more important. We how that there is a strong empirical relationship between error bound and root mean square error, suggesting that the error bound could be used as a computationally inexpensive metric for predicting the accuracy of clustering algorithms for practical applications. An example of error analysis for such an application-computation of average charge of ionizable amino-acids in proteins-is given, demonstrating that the clustering algorithm can be accurate enough for practical purposes.

  14. Ion induced electron emission statistics under Agm- cluster bombardment of Ag

    NASA Astrophysics Data System (ADS)

    Breuers, A.; Penning, R.; Wucher, A.

    2018-05-01

    The electron emission from a polycrystalline silver surface under bombardment with Agm- cluster ions (m = 1, 2, 3) is investigated in terms of ion induced kinetic excitation. The electron yield γ is determined directly by a current measurement method on the one hand and implicitly by the analysis of the electron emission statistics on the other hand. Successful measurements of the electron emission spectra ensure a deeper understanding of the ion induced kinetic electron emission process, with particular emphasis on the effect of the projectile cluster size to the yield as well as to emission statistics. The results allow a quantitative comparison to computer simulations performed for silver atoms and clusters impinging onto a silver surface.

  15. On the Distribution of Orbital Poles of Milky Way Satellites

    NASA Astrophysics Data System (ADS)

    Palma, Christopher; Majewski, Steven R.; Johnston, Kathryn V.

    2002-01-01

    In numerous studies of the outer Galactic halo some evidence for accretion has been found. If the outer halo did form in part or wholly through merger events, we might expect to find coherent streams of stars and globular clusters following orbits similar to those of their parent objects, which are assumed to be present or former Milky Way dwarf satellite galaxies. We present a study of this phenomenon by assessing the likelihood of potential descendant ``dynamical families'' in the outer halo. We conduct two analyses: one that involves a statistical analysis of the spatial distribution of all known Galactic dwarf satellite galaxies (DSGs) and globular clusters, and a second, more specific analysis of those globular clusters and DSGs for which full phase space dynamical data exist. In both cases our methodology is appropriate only to members of descendant dynamical families that retain nearly aligned orbital poles today. Since the Sagittarius dwarf (Sgr) is considered a paradigm for the type of merger/tidal interaction event for which we are searching, we also undertake a case study of the Sgr system and identify several globular clusters that may be members of its extended dynamical family. In our first analysis, the distribution of possible orbital poles for the entire sample of outer (Rgc>8 kpc) halo globular clusters is tested for statistically significant associations among globular clusters and DSGs. Our methodology for identifying possible associations is similar to that used by Lynden-Bell & Lynden-Bell, but we put the associations on a more statistical foundation. Moreover, we study the degree of possible dynamical clustering among various interesting ensembles of globular clusters and satellite galaxies. Among the ensembles studied, we find the globular cluster subpopulation with the highest statistical likelihood of association with one or more of the Galactic DSGs to be the distant, outer halo (Rgc>25 kpc), second-parameter globular clusters. The results of our orbital pole analysis are supported by the great circle cell count methodology of Johnston, Hernquist, & Bolte. The space motions of the clusters Pal 4, NGC 6229, NGC 7006, and Pyxis are predicted to be among those most likely to show the clusters to be following stream orbits, since these clusters are responsible for the majority of the statistical significance of the association between outer halo, second-parameter globular clusters and the Milky Way DSGs. In our second analysis, we study the orbits of the 41 globular clusters and six Milky Way-bound DSGs having measured proper motions to look for objects with both coplanar orbits and similar angular momenta. Unfortunately, the majority of globular clusters with measured proper motions are inner halo clusters that are less likely to retain memory of their original orbit. Although four potential globular cluster/DSG associations are found, we believe three of these associations involving inner halo clusters to be coincidental. While the present sample of objects with complete dynamical data is small and does not include many of the globular clusters that are more likely to have been captured by the Milky Way, the methodology we adopt will become increasingly powerful as more proper motions are measured for distant Galactic satellites and globular clusters, and especially as results from the Space Interferometry Mission (SIM) become available.

  16. Micro-heterogeneity versus clustering in binary mixtures of ethanol with water or alkanes.

    PubMed

    Požar, Martina; Lovrinčević, Bernarda; Zoranić, Larisa; Primorać, Tomislav; Sokolić, Franjo; Perera, Aurélien

    2016-08-24

    Ethanol is a hydrogen bonding liquid. When mixed in small concentrations with water or alkanes, it forms aggregate structures reminiscent of, respectively, the direct and inverse micellar aggregates found in emulsions, albeit at much smaller sizes. At higher concentrations, micro-heterogeneous mixing with segregated domains is found. We examine how different statistical methods, namely correlation function analysis, structure factor analysis and cluster distribution analysis, can describe efficiently these morphological changes in these mixtures. In particular, we explain how the neat alcohol pre-peak of the structure factor evolves into the domain pre-peak under mixing conditions, and how this evolution differs whether the co-solvent is water or alkane. This study clearly establishes the heuristic superiority of the correlation function/structure factor analysis to study the micro-heterogeneity, since cluster distribution analysis is insensitive to domain segregation. Correlation functions detect the domains, with a clear structure factor pre-peak signature, while the cluster techniques detect the cluster hierarchy within domains. The main conclusion is that, in micro-segregated mixtures, the domain structure is a more fundamental statistical entity than the underlying cluster structures. These findings could help better understand comparatively the radiation scattering experiments, which are sensitive to domains, versus the spectroscopy-NMR experiments, which are sensitive to clusters.

  17. Comparison of a non-stationary voxelation-corrected cluster-size test with TFCE for group-Level MRI inference.

    PubMed

    Li, Huanjie; Nickerson, Lisa D; Nichols, Thomas E; Gao, Jia-Hong

    2017-03-01

    Two powerful methods for statistical inference on MRI brain images have been proposed recently, a non-stationary voxelation-corrected cluster-size test (CST) based on random field theory and threshold-free cluster enhancement (TFCE) based on calculating the level of local support for a cluster, then using permutation testing for inference. Unlike other statistical approaches, these two methods do not rest on the assumptions of a uniform and high degree of spatial smoothness of the statistic image. Thus, they are strongly recommended for group-level fMRI analysis compared to other statistical methods. In this work, the non-stationary voxelation-corrected CST and TFCE methods for group-level analysis were evaluated for both stationary and non-stationary images under varying smoothness levels, degrees of freedom and signal to noise ratios. Our results suggest that, both methods provide adequate control for the number of voxel-wise statistical tests being performed during inference on fMRI data and they are both superior to current CSTs implemented in popular MRI data analysis software packages. However, TFCE is more sensitive and stable for group-level analysis of VBM data. Thus, the voxelation-corrected CST approach may confer some advantages by being computationally less demanding for fMRI data analysis than TFCE with permutation testing and by also being applicable for single-subject fMRI analyses, while the TFCE approach is advantageous for VBM data. Hum Brain Mapp 38:1269-1280, 2017. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.

  18. Temporal and spatial assessment of river surface water quality using multivariate statistical techniques: a study in Can Tho City, a Mekong Delta area, Vietnam.

    PubMed

    Phung, Dung; Huang, Cunrui; Rutherford, Shannon; Dwirahmadi, Febi; Chu, Cordia; Wang, Xiaoming; Nguyen, Minh; Nguyen, Nga Huy; Do, Cuong Manh; Nguyen, Trung Hieu; Dinh, Tuan Anh Diep

    2015-05-01

    The present study is an evaluation of temporal/spatial variations of surface water quality using multivariate statistical techniques, comprising cluster analysis (CA), principal component analysis (PCA), factor analysis (FA) and discriminant analysis (DA). Eleven water quality parameters were monitored at 38 different sites in Can Tho City, a Mekong Delta area of Vietnam from 2008 to 2012. Hierarchical cluster analysis grouped the 38 sampling sites into three clusters, representing mixed urban-rural areas, agricultural areas and industrial zone. FA/PCA resulted in three latent factors for the entire research location, three for cluster 1, four for cluster 2, and four for cluster 3 explaining 60, 60.2, 80.9, and 70% of the total variance in the respective water quality. The varifactors from FA indicated that the parameters responsible for water quality variations are related to erosion from disturbed land or inflow of effluent from sewage plants and industry, discharges from wastewater treatment plants and domestic wastewater, agricultural activities and industrial effluents, and contamination by sewage waste with faecal coliform bacteria through sewer and septic systems. Discriminant analysis (DA) revealed that nephelometric turbidity units (NTU), chemical oxygen demand (COD) and NH₃ are the discriminating parameters in space, affording 67% correct assignation in spatial analysis; pH and NO₂ are the discriminating parameters according to season, assigning approximately 60% of cases correctly. The findings suggest a possible revised sampling strategy that can reduce the number of sampling sites and the indicator parameters responsible for large variations in water quality. This study demonstrates the usefulness of multivariate statistical techniques for evaluation of temporal/spatial variations in water quality assessment and management.

  19. Comparison of Salmonella enteritidis phage types isolated from layers and humans in Belgium in 2005.

    PubMed

    Welby, Sarah; Imberechts, Hein; Riocreux, Flavien; Bertrand, Sophie; Dierick, Katelijne; Wildemauwe, Christa; Hooyberghs, Jozef; Van der Stede, Yves

    2011-08-01

    The aim of this study was to investigate the available results for Belgium of the European Union coordinated monitoring program (2004/665 EC) on Salmonella in layers in 2005, as well as the results of the monthly outbreak reports of Salmonella Enteritidis in humans in 2005 to identify a possible statistical significant trend in both populations. Separate descriptive statistics and univariate analysis were carried out and the parametric and/or non-parametric hypothesis tests were conducted. A time cluster analysis was performed for all Salmonella Enteritidis phage types (PTs) isolated. The proportions of each Salmonella Enteritidis PT in layers and in humans were compared and the monthly distribution of the most common PT, isolated in both populations, was evaluated. The time cluster analysis revealed significant clusters during the months May and June for layers and May, July, August, and September for humans. PT21, the most frequently isolated PT in both populations in 2005, seemed to be responsible of these significant clusters. PT4 was the second most frequently isolated PT. No significant difference was found for the monthly trend evolution of both PT in both populations based on parametric and non-parametric methods. A similar monthly trend of PT distribution in humans and layers during the year 2005 was observed. The time cluster analysis and the statistical significance testing confirmed these results. Moreover, the time cluster analysis showed significant clusters during the summer time and slightly delayed in time (humans after layers). These results suggest a common link between the prevalence of Salmonella Enteritidis in layers and the occurrence of the pathogen in humans. Phage typing was confirmed to be a useful tool for identifying temporal trends.

  20. ICAP: An Interactive Cluster Analysis Procedure for analyzing remotely sensed data. [to classify the radiance data to produce a thematic map

    NASA Technical Reports Server (NTRS)

    Wharton, S. W.

    1980-01-01

    An Interactive Cluster Analysis Procedure (ICAP) was developed to derive classifier training statistics from remotely sensed data. The algorithm interfaces the rapid numerical processing capacity of a computer with the human ability to integrate qualitative information. Control of the clustering process alternates between the algorithm, which creates new centroids and forms clusters and the analyst, who evaluate and elect to modify the cluster structure. Clusters can be deleted or lumped pairwise, or new centroids can be added. A summary of the cluster statistics can be requested to facilitate cluster manipulation. The ICAP was implemented in APL (A Programming Language), an interactive computer language. The flexibility of the algorithm was evaluated using data from different LANDSAT scenes to simulate two situations: one in which the analyst is assumed to have no prior knowledge about the data and wishes to have the clusters formed more or less automatically; and the other in which the analyst is assumed to have some knowledge about the data structure and wishes to use that information to closely supervise the clustering process. For comparison, an existing clustering method was also applied to the two data sets.

  1. Retrospective space-time cluster analysis of whooping cough, re-emergence in Barcelona, Spain, 2000-2011.

    PubMed

    Solano, Rubén; Gómez-Barroso, Diana; Simón, Fernando; Lafuente, Sarah; Simón, Pere; Rius, Cristina; Gorrindo, Pilar; Toledo, Diana; Caylà, Joan A

    2014-05-01

    A retrospective, space-time study of whooping cough cases reported to the Public Health Agency of Barcelona, Spain between the years 2000 and 2011 is presented. It is based on 633 individual whooping cough cases and the 2006 population census from the Spanish National Statistics Institute, stratified by age and sex at the census tract level. Cluster identification was attempted using space-time scan statistic assuming a Poisson distribution and restricting temporal extent to 7 days and spatial distance to 500 m. Statistical calculations were performed with Stata 11 and SatScan and mapping was performed with ArcGis 10.0. Only clusters showing statistical significance (P <0.05) were mapped. The most likely cluster identified included five census tracts located in three neighbourhoods in central Barcelona during the week from 17 to 23 August 2011. This cluster included five cases compared with the expected level of 0.0021 (relative risk = 2436, P <0.001). In addition, 11 secondary significant space-time clusters were detected with secondary clusters occurring at different times and localizations. Spatial statistics is felt to be useful by complementing epidemiological surveillance systems through visualizing excess in the number of cases in space and time and thus increase the possibility of identifying outbreaks not reported by the surveillance system.

  2. Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient.

    PubMed

    Yao, Jianchao; Chang, Chunqi; Salmi, Mari L; Hung, Yeung Sam; Loraine, Ann; Roux, Stanley J

    2008-06-18

    Currently, clustering with some form of correlation coefficient as the gene similarity metric has become a popular method for profiling genomic data. The Pearson correlation coefficient and the standard deviation (SD)-weighted correlation coefficient are the two most widely-used correlations as the similarity metrics in clustering microarray data. However, these two correlations are not optimal for analyzing replicated microarray data generated by most laboratories. An effective correlation coefficient is needed to provide statistically sufficient analysis of replicated microarray data. In this study, we describe a novel correlation coefficient, shrinkage correlation coefficient (SCC), that fully exploits the similarity between the replicated microarray experimental samples. The methodology considers both the number of replicates and the variance within each experimental group in clustering expression data, and provides a robust statistical estimation of the error of replicated microarray data. The value of SCC is revealed by its comparison with two other correlation coefficients that are currently the most widely-used (Pearson correlation coefficient and SD-weighted correlation coefficient) using statistical measures on both synthetic expression data as well as real gene expression data from Saccharomyces cerevisiae. Two leading clustering methods, hierarchical and k-means clustering were applied for the comparison. The comparison indicated that using SCC achieves better clustering performance. Applying SCC-based hierarchical clustering to the replicated microarray data obtained from germinating spores of the fern Ceratopteris richardii, we discovered two clusters of genes with shared expression patterns during spore germination. Functional analysis suggested that some of the genetic mechanisms that control germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns. This study shows that SCC is an alternative to the Pearson correlation coefficient and the SD-weighted correlation coefficient, and is particularly useful for clustering replicated microarray data. This computational approach should be generally useful for proteomic data or other high-throughput analysis methodology.

  3. Cluster Analysis in Nursing Research: An Introduction, Historical Perspective, and Future Directions.

    PubMed

    Dunn, Heather; Quinn, Laurie; Corbridge, Susan J; Eldeirawi, Kamal; Kapella, Mary; Collins, Eileen G

    2017-05-01

    The use of cluster analysis in the nursing literature is limited to the creation of classifications of homogeneous groups and the discovery of new relationships. As such, it is important to provide clarity regarding its use and potential. The purpose of this article is to provide an introduction to distance-based, partitioning-based, and model-based cluster analysis methods commonly utilized in the nursing literature, provide a brief historical overview on the use of cluster analysis in nursing literature, and provide suggestions for future research. An electronic search included three bibliographic databases, PubMed, CINAHL and Web of Science. Key terms were cluster analysis and nursing. The use of cluster analysis in the nursing literature is increasing and expanding. The increased use of cluster analysis in the nursing literature is positioning this statistical method to result in insights that have the potential to change clinical practice.

  4. Extracting Galaxy Cluster Gas Inhomogeneity from X-Ray Surface Brightness: A Statistical Approach and Application to Abell 3667

    NASA Astrophysics Data System (ADS)

    Kawahara, Hajime; Reese, Erik D.; Kitayama, Tetsu; Sasaki, Shin; Suto, Yasushi

    2008-11-01

    Our previous analysis indicates that small-scale fluctuations in the intracluster medium (ICM) from cosmological hydrodynamic simulations follow the lognormal probability density function. In order to test the lognormal nature of the ICM directly against X-ray observations of galaxy clusters, we develop a method of extracting statistical information about the three-dimensional properties of the fluctuations from the two-dimensional X-ray surface brightness. We first create a set of synthetic clusters with lognormal fluctuations around their mean profile given by spherical isothermal β-models, later considering polytropic temperature profiles as well. Performing mock observations of these synthetic clusters, we find that the resulting X-ray surface brightness fluctuations also follow the lognormal distribution fairly well. Systematic analysis of the synthetic clusters provides an empirical relation between the three-dimensional density fluctuations and the two-dimensional X-ray surface brightness. We analyze Chandra observations of the galaxy cluster Abell 3667, and find that its X-ray surface brightness fluctuations follow the lognormal distribution. While the lognormal model was originally motivated by cosmological hydrodynamic simulations, this is the first observational confirmation of the lognormal signature in a real cluster. Finally we check the synthetic cluster results against clusters from cosmological hydrodynamic simulations. As a result of the complex structure exhibited by simulated clusters, the empirical relation between the two- and three-dimensional fluctuation properties calibrated with synthetic clusters when applied to simulated clusters shows large scatter. Nevertheless we are able to reproduce the true value of the fluctuation amplitude of simulated clusters within a factor of 2 from their two-dimensional X-ray surface brightness alone. Our current methodology combined with existing observational data is useful in describing and inferring the statistical properties of the three-dimensional inhomogeneity in galaxy clusters.

  5. Comparative analysis on the selection of number of clusters in community detection

    NASA Astrophysics Data System (ADS)

    Kawamoto, Tatsuro; Kabashima, Yoshiyuki

    2018-02-01

    We conduct a comparative analysis on various estimates of the number of clusters in community detection. An exhaustive comparison requires testing of all possible combinations of frameworks, algorithms, and assessment criteria. In this paper we focus on the framework based on a stochastic block model, and investigate the performance of greedy algorithms, statistical inference, and spectral methods. For the assessment criteria, we consider modularity, map equation, Bethe free energy, prediction errors, and isolated eigenvalues. From the analysis, the tendency of overfit and underfit that the assessment criteria and algorithms have becomes apparent. In addition, we propose that the alluvial diagram is a suitable tool to visualize statistical inference results and can be useful to determine the number of clusters.

  6. Cluster analysis and subgrouping to investigate inter-individual variability to non-invasive brain stimulation: a systematic review.

    PubMed

    Pellegrini, Michael; Zoghi, Maryam; Jaberzadeh, Shapour

    2018-01-12

    Cluster analysis and other subgrouping techniques have risen in popularity in recent years in non-invasive brain stimulation research in the attempt to investigate the issue of inter-individual variability - the issue of why some individuals respond, as traditionally expected, to non-invasive brain stimulation protocols and others do not. Cluster analysis and subgrouping techniques have been used to categorise individuals, based on their response patterns, as responder or non-responders. There is, however, a lack of consensus and consistency on the most appropriate technique to use. This systematic review aimed to provide a systematic summary of the cluster analysis and subgrouping techniques used to date and suggest recommendations moving forward. Twenty studies were included that utilised subgrouping techniques, while seven of these additionally utilised cluster analysis techniques. The results of this systematic review appear to indicate that statistical cluster analysis techniques are effective in identifying subgroups of individuals based on response patterns to non-invasive brain stimulation. This systematic review also reports a lack of consensus amongst researchers on the most effective subgrouping technique and the criteria used to determine whether an individual is categorised as a responder or a non-responder. This systematic review provides a step-by-step guide to carrying out statistical cluster analyses and subgrouping techniques to provide a framework for analysis when developing further insights into the contributing factors of inter-individual variability in response to non-invasive brain stimulation.

  7. Descriptive Statistics and Cluster Analysis for Extreme Rainfall in Java Island

    NASA Astrophysics Data System (ADS)

    E Komalasari, K.; Pawitan, H.; Faqih, A.

    2017-03-01

    This study aims to describe regional pattern of extreme rainfall based on maximum daily rainfall for period 1983 to 2012 in Java Island. Descriptive statistics analysis was performed to obtain centralization, variation and distribution of maximum precipitation data. Mean and median are utilized to measure central tendency data while Inter Quartile Range (IQR) and standard deviation are utilized to measure variation of data. In addition, skewness and kurtosis used to obtain shape the distribution of rainfall data. Cluster analysis using squared euclidean distance and ward method is applied to perform regional grouping. Result of this study show that mean (average) of maximum daily rainfall in Java Region during period 1983-2012 is around 80-181mm with median between 75-160mm and standard deviation between 17 to 82. Cluster analysis produces four clusters and show that western area of Java tent to have a higher annual maxima of daily rainfall than northern area, and have more variety of annual maximum value.

  8. A method of using cluster analysis to study statistical dependence in multivariate data

    NASA Technical Reports Server (NTRS)

    Borucki, W. J.; Card, D. H.; Lyle, G. C.

    1975-01-01

    A technique is presented that uses both cluster analysis and a Monte Carlo significance test of clusters to discover associations between variables in multidimensional data. The method is applied to an example of a noisy function in three-dimensional space, to a sample from a mixture of three bivariate normal distributions, and to the well-known Fisher's Iris data.

  9. Multivariate Statistical Analysis: a tool for groundwater quality assessment in the hidrogeologic region of the Ring of Cenotes, Yucatan, Mexico.

    NASA Astrophysics Data System (ADS)

    Ye, M.; Pacheco Castro, R. B.; Pacheco Avila, J.; Cabrera Sansores, A.

    2014-12-01

    The karstic aquifer of Yucatan is a vulnerable and complex system. The first fifteen meters of this aquifer have been polluted, due to this the protection of this resource is important because is the only source of potable water of the entire State. Through the assessment of groundwater quality we can gain some knowledge about the main processes governing water chemistry as well as spatial patterns which are important to establish protection zones. In this work multivariate statistical techniques are used to assess the groundwater quality of the supply wells (30 to 40 meters deep) in the hidrogeologic region of the Ring of Cenotes, located in Yucatan, Mexico. Cluster analysis and principal component analysis are applied in groundwater chemistry data of the study area. Results of principal component analysis show that the main sources of variation in the data are due sea water intrusion and the interaction of the water with the carbonate rocks of the system and some pollution processes. The cluster analysis shows that the data can be divided in four clusters. The spatial distribution of the clusters seems to be random, but is consistent with sea water intrusion and pollution with nitrates. The overall results show that multivariate statistical analysis can be successfully applied in the groundwater quality assessment of this karstic aquifer.

  10. Cluster-level statistical inference in fMRI datasets: The unexpected behavior of random fields in high dimensions.

    PubMed

    Bansal, Ravi; Peterson, Bradley S

    2018-06-01

    Identifying regional effects of interest in MRI datasets usually entails testing a priori hypotheses across many thousands of brain voxels, requiring control for false positive findings in these multiple hypotheses testing. Recent studies have suggested that parametric statistical methods may have incorrectly modeled functional MRI data, thereby leading to higher false positive rates than their nominal rates. Nonparametric methods for statistical inference when conducting multiple statistical tests, in contrast, are thought to produce false positives at the nominal rate, which has thus led to the suggestion that previously reported studies should reanalyze their fMRI data using nonparametric tools. To understand better why parametric methods may yield excessive false positives, we assessed their performance when applied both to simulated datasets of 1D, 2D, and 3D Gaussian Random Fields (GRFs) and to 710 real-world, resting-state fMRI datasets. We showed that both the simulated 2D and 3D GRFs and the real-world data contain a small percentage (<6%) of very large clusters (on average 60 times larger than the average cluster size), which were not present in 1D GRFs. These unexpectedly large clusters were deemed statistically significant using parametric methods, leading to empirical familywise error rates (FWERs) as high as 65%: the high empirical FWERs were not a consequence of parametric methods failing to model spatial smoothness accurately, but rather of these very large clusters that are inherently present in smooth, high-dimensional random fields. In fact, when discounting these very large clusters, the empirical FWER for parametric methods was 3.24%. Furthermore, even an empirical FWER of 65% would yield on average less than one of those very large clusters in each brain-wide analysis. Nonparametric methods, in contrast, estimated distributions from those large clusters, and therefore, by construct rejected the large clusters as false positives at the nominal FWERs. Those rejected clusters were outlying values in the distribution of cluster size but cannot be distinguished from true positive findings without further analyses, including assessing whether fMRI signal in those regions correlates with other clinical, behavioral, or cognitive measures. Rejecting the large clusters, however, significantly reduced the statistical power of nonparametric methods in detecting true findings compared with parametric methods, which would have detected most true findings that are essential for making valid biological inferences in MRI data. Parametric analyses, in contrast, detected most true findings while generating relatively few false positives: on average, less than one of those very large clusters would be deemed a true finding in each brain-wide analysis. We therefore recommend the continued use of parametric methods that model nonstationary smoothness for cluster-level, familywise control of false positives, particularly when using a Cluster Defining Threshold of 2.5 or higher, and subsequently assessing rigorously the biological plausibility of the findings, even for large clusters. Finally, because nonparametric methods yielded a large reduction in statistical power to detect true positive findings, we conclude that the modest reduction in false positive findings that nonparametric analyses afford does not warrant a re-analysis of previously published fMRI studies using nonparametric techniques. Copyright © 2018 Elsevier Inc. All rights reserved.

  11. A General Framework for Power Analysis to Detect the Moderator Effects in Two- and Three-Level Cluster Randomized Trials

    ERIC Educational Resources Information Center

    Dong, Nianbo; Spybrook, Jessaca; Kelcey, Ben

    2016-01-01

    The purpose of this study is to propose a general framework for power analyses to detect the moderator effects in two- and three-level cluster randomized trials (CRTs). The study specifically aims to: (1) develop the statistical formulations for calculating statistical power, minimum detectable effect size (MDES) and its confidence interval to…

  12. Network Data: Statistical Theory and New Models

    DTIC Science & Technology

    2016-02-17

    SECURITY CLASSIFICATION OF: During this period of review, Bin Yu worked on many thrusts of high-dimensional statistical theory and methodologies. Her...research covered a wide range of topics in statistics including analysis and methods for spectral clustering for sparse and structured networks...2,7,8,21], sparse modeling (e.g. Lasso) [4,10,11,17,18,19], statistical guarantees for the EM algorithm [3], statistical analysis of algorithm leveraging

  13. The Equivalence of Three Statistical Packages for Performing Hierarchical Cluster Analysis

    ERIC Educational Resources Information Center

    Blashfield, Roger

    1977-01-01

    Three different software programs which contain hierarchical agglomerative cluster analysis procedures were shown to generate different solutions on the same data set using apparently the same options. The basis for the differences in the solutions was the formulae used to calculate Euclidean distance. (Author/JKS)

  14. Circulation Clusters--An Empirical Approach to Decentralization of Academic Libraries.

    ERIC Educational Resources Information Center

    McGrath, William E.

    1986-01-01

    Discusses the issue of centralization or decentralization of academic library collections, and describes a statistical analysis of book circulation at the University of Southwestern Louisiana that yielded subject area clusters as a compromise solution to the problem. Applications of the cluster model for all types of library catalogs are…

  15. A Dimensionally Reduced Clustering Methodology for Heterogeneous Occupational Medicine Data Mining.

    PubMed

    Saâdaoui, Foued; Bertrand, Pierre R; Boudet, Gil; Rouffiac, Karine; Dutheil, Frédéric; Chamoux, Alain

    2015-10-01

    Clustering is a set of techniques of the statistical learning aimed at finding structures of heterogeneous partitions grouping homogenous data called clusters. There are several fields in which clustering was successfully applied, such as medicine, biology, finance, economics, etc. In this paper, we introduce the notion of clustering in multifactorial data analysis problems. A case study is conducted for an occupational medicine problem with the purpose of analyzing patterns in a population of 813 individuals. To reduce the data set dimensionality, we base our approach on the Principal Component Analysis (PCA), which is the statistical tool most commonly used in factorial analysis. However, the problems in nature, especially in medicine, are often based on heterogeneous-type qualitative-quantitative measurements, whereas PCA only processes quantitative ones. Besides, qualitative data are originally unobservable quantitative responses that are usually binary-coded. Hence, we propose a new set of strategies allowing to simultaneously handle quantitative and qualitative data. The principle of this approach is to perform a projection of the qualitative variables on the subspaces spanned by quantitative ones. Subsequently, an optimal model is allocated to the resulting PCA-regressed subspaces.

  16. Aftershock identification problem via the nearest-neighbor analysis for marked point processes

    NASA Astrophysics Data System (ADS)

    Gabrielov, A.; Zaliapin, I.; Wong, H.; Keilis-Borok, V.

    2007-12-01

    The centennial observations on the world seismicity have revealed a wide variety of clustering phenomena that unfold in the space-time-energy domain and provide most reliable information about the earthquake dynamics. However, there is neither a unifying theory nor a convenient statistical apparatus that would naturally account for the different types of seismic clustering. In this talk we present a theoretical framework for nearest-neighbor analysis of marked processes and obtain new results on hierarchical approach to studying seismic clustering introduced by Baiesi and Paczuski (2004). Recall that under this approach one defines an asymmetric distance D in space-time-energy domain such that the nearest-neighbor spanning graph with respect to D becomes a time- oriented tree. We demonstrate how this approach can be used to detect earthquake clustering. We apply our analysis to the observed seismicity of California and synthetic catalogs from ETAS model and show that the earthquake clustering part is statistically different from the homogeneous part. This finding may serve as a basis for an objective aftershock identification procedure.

  17. Two worlds collide: Image analysis methods for quantifying structural variation in cluster molecular dynamics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Steenbergen, K. G., E-mail: kgsteen@gmail.com; Gaston, N.

    2014-02-14

    Inspired by methods of remote sensing image analysis, we analyze structural variation in cluster molecular dynamics (MD) simulations through a unique application of the principal component analysis (PCA) and Pearson Correlation Coefficient (PCC). The PCA analysis characterizes the geometric shape of the cluster structure at each time step, yielding a detailed and quantitative measure of structural stability and variation at finite temperature. Our PCC analysis captures bond structure variation in MD, which can be used to both supplement the PCA analysis as well as compare bond patterns between different cluster sizes. Relying only on atomic position data, without requirement formore » a priori structural input, PCA and PCC can be used to analyze both classical and ab initio MD simulations for any cluster composition or electronic configuration. Taken together, these statistical tools represent powerful new techniques for quantitative structural characterization and isomer identification in cluster MD.« less

  18. Two worlds collide: image analysis methods for quantifying structural variation in cluster molecular dynamics.

    PubMed

    Steenbergen, K G; Gaston, N

    2014-02-14

    Inspired by methods of remote sensing image analysis, we analyze structural variation in cluster molecular dynamics (MD) simulations through a unique application of the principal component analysis (PCA) and Pearson Correlation Coefficient (PCC). The PCA analysis characterizes the geometric shape of the cluster structure at each time step, yielding a detailed and quantitative measure of structural stability and variation at finite temperature. Our PCC analysis captures bond structure variation in MD, which can be used to both supplement the PCA analysis as well as compare bond patterns between different cluster sizes. Relying only on atomic position data, without requirement for a priori structural input, PCA and PCC can be used to analyze both classical and ab initio MD simulations for any cluster composition or electronic configuration. Taken together, these statistical tools represent powerful new techniques for quantitative structural characterization and isomer identification in cluster MD.

  19. Spatial distribution and cluster analysis of retail drug shop characteristics and antimalarial behaviors as reported by private medicine retailers in western Kenya: informing future interventions.

    PubMed

    Rusk, Andria; Highfield, Linda; Wilkerson, J Michael; Harrell, Melissa; Obala, Andrew; Amick, Benjamin

    2016-02-19

    Efforts to improve malaria case management in sub-Saharan Africa have shifted focus to private antimalarial retailers to increase access to appropriate treatment. Demands to decrease intervention cost while increasing efficacy requires interventions tailored to geographic regions with demonstrated need. Cluster analysis presents an opportunity to meet this demand, but has not been applied to the retail sector or antimalarial retailer behaviors. This research conducted cluster analysis on medicine retailer behaviors in Kenya, to improve malaria case management and inform future interventions. Ninety-seven surveys were collected from medicine retailers working in the Webuye Health and Demographic Surveillance Site. Survey items included retailer training, education, antimalarial drug knowledge, recommending behavior, sales, and shop characteristics, and were analyzed using Kulldorff's spatial scan statistic. The Bernoulli purely spatial model for binomial data was used, comparing cases to controls. Statistical significance of found clusters was tested with a likelihood ratio test, using the null hypothesis of no clustering, and a p value based on 999 Monte Carlo simulations. The null hypothesis was rejected with p values of 0.05 or less. A statistically significant cluster of fewer than expected pharmacy-trained retailers was found (RR = .09, p = .001) when compared to the expected random distribution. Drug recommending behavior also yielded a statistically significant cluster, with fewer than expected retailers recommending the correct antimalarial medication to adults (RR = .018, p = .01), and fewer than expected shops selling that medication more often than outdated antimalarials when compared to random distribution (RR = 0.23, p = .007). All three of these clusters were co-located, overlapping in the northwest of the study area. Spatial clustering was found in the data. A concerning amount of correlation was found in one specific region in the study area where multiple behaviors converged in space, highlighting a prime target for interventions. These results also demonstrate the utility of applying geospatial methods in the study of medicine retailer behaviors, making the case for expanding this approach to other regions.

  20. Spatio-Temporal Analysis of Smear-Positive Tuberculosis in the Sidama Zone, Southern Ethiopia

    PubMed Central

    Dangisso, Mesay Hailu; Datiko, Daniel Gemechu; Lindtjørn, Bernt

    2015-01-01

    Background Tuberculosis (TB) is a disease of public health concern, with a varying distribution across settings depending on socio-economic status, HIV burden, availability and performance of the health system. Ethiopia is a country with a high burden of TB, with regional variations in TB case notification rates (CNRs). However, TB program reports are often compiled and reported at higher administrative units that do not show the burden at lower units, so there is limited information about the spatial distribution of the disease. We therefore aim to assess the spatial distribution and presence of the spatio-temporal clustering of the disease in different geographic settings over 10 years in the Sidama Zone in southern Ethiopia. Methods A retrospective space–time and spatial analysis were carried out at the kebele level (the lowest administrative unit within a district) to identify spatial and space-time clusters of smear-positive pulmonary TB (PTB). Scan statistics, Global Moran’s I, and Getis and Ordi (Gi*) statistics were all used to help analyze the spatial distribution and clusters of the disease across settings. Results A total of 22,545 smear-positive PTB cases notified over 10 years were used for spatial analysis. In a purely spatial analysis, we identified the most likely cluster of smear-positive PTB in 192 kebeles in eight districts (RR= 2, p<0.001), with 12,155 observed and 8,668 expected cases. The Gi* statistic also identified the clusters in the same areas, and the spatial clusters showed stability in most areas in each year during the study period. The space-time analysis also detected the most likely cluster in 193 kebeles in the same eight districts (RR= 1.92, p<0.001), with 7,584 observed and 4,738 expected cases in 2003-2012. Conclusion The study found variations in CNRs and significant spatio-temporal clusters of smear-positive PTB in the Sidama Zone. The findings can be used to guide TB control programs to devise effective TB control strategies for the geographic areas characterized by the highest CNRs. Further studies are required to understand the factors associated with clustering based on individual level locations and investigation of cases. PMID:26030162

  1. The smart cluster method. Adaptive earthquake cluster identification and analysis in strong seismic regions

    NASA Astrophysics Data System (ADS)

    Schaefer, Andreas M.; Daniell, James E.; Wenzel, Friedemann

    2017-07-01

    Earthquake clustering is an essential part of almost any statistical analysis of spatial and temporal properties of seismic activity. The nature of earthquake clusters and subsequent declustering of earthquake catalogues plays a crucial role in determining the magnitude-dependent earthquake return period and its respective spatial variation for probabilistic seismic hazard assessment. This study introduces the Smart Cluster Method (SCM), a new methodology to identify earthquake clusters, which uses an adaptive point process for spatio-temporal cluster identification. It utilises the magnitude-dependent spatio-temporal earthquake density to adjust the search properties, subsequently analyses the identified clusters to determine directional variation and adjusts its search space with respect to directional properties. In the case of rapid subsequent ruptures like the 1992 Landers sequence or the 2010-2011 Darfield-Christchurch sequence, a reclassification procedure is applied to disassemble subsequent ruptures using near-field searches, nearest neighbour classification and temporal splitting. The method is capable of identifying and classifying earthquake clusters in space and time. It has been tested and validated using earthquake data from California and New Zealand. A total of more than 1500 clusters have been found in both regions since 1980 with M m i n = 2.0. Utilising the knowledge of cluster classification, the method has been adjusted to provide an earthquake declustering algorithm, which has been compared to existing methods. Its performance is comparable to established methodologies. The analysis of earthquake clustering statistics lead to various new and updated correlation functions, e.g. for ratios between mainshock and strongest aftershock and general aftershock activity metrics.

  2. Ankle plantarflexion strength in rearfoot and forefoot runners: a novel clusteranalytic approach.

    PubMed

    Liebl, Dominik; Willwacher, Steffen; Hamill, Joseph; Brüggemann, Gert-Peter

    2014-06-01

    The purpose of the present study was to test for differences in ankle plantarflexion strengths of habitually rearfoot and forefoot runners. In order to approach this issue, we revisit the problem of classifying different footfall patterns in human runners. A dataset of 119 subjects running shod and barefoot (speed 3.5m/s) was analyzed. The footfall patterns were clustered by a novel statistical approach, which is motivated by advances in the statistical literature on functional data analysis. We explain the novel statistical approach in detail and compare it to the classically used strike index of Cavanagh and Lafortune (1980). The two groups found by the new cluster approach are well interpretable as a forefoot and a rearfoot footfall groups. The subsequent comparison study of the clustered subjects reveals that runners with a forefoot footfall pattern are capable of producing significantly higher joint moments in a maximum voluntary contraction (MVC) of their ankle plantarflexor muscles tendon units; difference in means: 0.28Nm/kg. This effect remains significant after controlling for an additional gender effect and for differences in training levels. Our analysis confirms the hypothesis that forefoot runners have a higher mean MVC plantarflexion strength than rearfoot runners. Furthermore, we demonstrate that our proposed stochastic cluster analysis provides a robust and useful framework for clustering foot strikes. Copyright © 2014 Elsevier B.V. All rights reserved.

  3. The composite sequential clustering technique for analysis of multispectral scanner data

    NASA Technical Reports Server (NTRS)

    Su, M. Y.

    1972-01-01

    The clustering technique consists of two parts: (1) a sequential statistical clustering which is essentially a sequential variance analysis, and (2) a generalized K-means clustering. In this composite clustering technique, the output of (1) is a set of initial clusters which are input to (2) for further improvement by an iterative scheme. This unsupervised composite technique was employed for automatic classification of two sets of remote multispectral earth resource observations. The classification accuracy by the unsupervised technique is found to be comparable to that by traditional supervised maximum likelihood classification techniques. The mathematical algorithms for the composite sequential clustering program and a detailed computer program description with job setup are given.

  4. Complex networks as a unified framework for descriptive analysis and predictive modeling in climate

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Steinhaeuser, Karsten J K; Chawla, Nitesh; Ganguly, Auroop R

    The analysis of climate data has relied heavily on hypothesis-driven statistical methods, while projections of future climate are based primarily on physics-based computational models. However, in recent years a wealth of new datasets has become available. Therefore, we take a more data-centric approach and propose a unified framework for studying climate, with an aim towards characterizing observed phenomena as well as discovering new knowledge in the climate domain. Specifically, we posit that complex networks are well-suited for both descriptive analysis and predictive modeling tasks. We show that the structural properties of climate networks have useful interpretation within the domain. Further,more » we extract clusters from these networks and demonstrate their predictive power as climate indices. Our experimental results establish that the network clusters are statistically significantly better predictors than clusters derived using a more traditional clustering approach. Using complex networks as data representation thus enables the unique opportunity for descriptive and predictive modeling to inform each other.« less

  5. Statistical analysis and handling of missing data in cluster randomized trials: a systematic review.

    PubMed

    Fiero, Mallorie H; Huang, Shuang; Oren, Eyal; Bell, Melanie L

    2016-02-09

    Cluster randomized trials (CRTs) randomize participants in groups, rather than as individuals and are key tools used to assess interventions in health research where treatment contamination is likely or if individual randomization is not feasible. Two potential major pitfalls exist regarding CRTs, namely handling missing data and not accounting for clustering in the primary analysis. The aim of this review was to evaluate approaches for handling missing data and statistical analysis with respect to the primary outcome in CRTs. We systematically searched for CRTs published between August 2013 and July 2014 using PubMed, Web of Science, and PsycINFO. For each trial, two independent reviewers assessed the extent of the missing data and method(s) used for handling missing data in the primary and sensitivity analyses. We evaluated the primary analysis and determined whether it was at the cluster or individual level. Of the 86 included CRTs, 80 (93%) trials reported some missing outcome data. Of those reporting missing data, the median percent of individuals with a missing outcome was 19% (range 0.5 to 90%). The most common way to handle missing data in the primary analysis was complete case analysis (44, 55%), whereas 18 (22%) used mixed models, six (8%) used single imputation, four (5%) used unweighted generalized estimating equations, and two (2%) used multiple imputation. Fourteen (16%) trials reported a sensitivity analysis for missing data, but most assumed the same missing data mechanism as in the primary analysis. Overall, 67 (78%) trials accounted for clustering in the primary analysis. High rates of missing outcome data are present in the majority of CRTs, yet handling missing data in practice remains suboptimal. Researchers and applied statisticians should carry out appropriate missing data methods, which are valid under plausible assumptions in order to increase statistical power in trials and reduce the possibility of bias. Sensitivity analysis should be performed, with weakened assumptions regarding the missing data mechanism to explore the robustness of results reported in the primary analysis.

  6. Use of multivariate statistics to identify unreliable data obtained using CASA.

    PubMed

    Martínez, Luis Becerril; Crispín, Rubén Huerta; Mendoza, Maximino Méndez; Gallegos, Oswaldo Hernández; Martínez, Andrés Aragón

    2013-06-01

    In order to identify unreliable data in a dataset of motility parameters obtained from a pilot study acquired by a veterinarian with experience in boar semen handling, but without experience in the operation of a computer assisted sperm analysis (CASA) system, a multivariate graphical and statistical analysis was performed. Sixteen boar semen samples were aliquoted then incubated with varying concentrations of progesterone from 0 to 3.33 µg/ml and analyzed in a CASA system. After standardization of the data, Chernoff faces were pictured for each measurement, and a principal component analysis (PCA) was used to reduce the dimensionality and pre-process the data before hierarchical clustering. The first twelve individual measurements showed abnormal features when Chernoff faces were drawn. PCA revealed that principal components 1 and 2 explained 63.08% of the variance in the dataset. Values of principal components for each individual measurement of semen samples were mapped to identify differences among treatment or among boars. Twelve individual measurements presented low values of principal component 1. Confidence ellipses on the map of principal components showed no statistically significant effects for treatment or boar. Hierarchical clustering realized on two first principal components produced three clusters. Cluster 1 contained evaluations of the two first samples in each treatment, each one of a different boar. With the exception of one individual measurement, all other measurements in cluster 1 were the same as observed in abnormal Chernoff faces. Unreliable data in cluster 1 are probably related to the operator inexperience with a CASA system. These findings could be used to objectively evaluate the skill level of an operator of a CASA system. This may be particularly useful in the quality control of semen analysis using CASA systems.

  7. The writer independent online handwriting recognition system frog on hand and cluster generative statistical dynamic time warping.

    PubMed

    Bahlmann, Claus; Burkhardt, Hans

    2004-03-01

    In this paper, we give a comprehensive description of our writer-independent online handwriting recognition system frog on hand. The focus of this work concerns the presentation of the classification/training approach, which we call cluster generative statistical dynamic time warping (CSDTW). CSDTW is a general, scalable, HMM-based method for variable-sized, sequential data that holistically combines cluster analysis and statistical sequence modeling. It can handle general classification problems that rely on this sequential type of data, e.g., speech recognition, genome processing, robotics, etc. Contrary to previous attempts, clustering and statistical sequence modeling are embedded in a single feature space and use a closely related distance measure. We show character recognition experiments of frog on hand using CSDTW on the UNIPEN online handwriting database. The recognition accuracy is significantly higher than reported results of other handwriting recognition systems. Finally, we describe the real-time implementation of frog on hand on a Linux Compaq iPAQ embedded device.

  8. Statistical analysis of atom probe data: detecting the early stages of solute clustering and/or co-segregation.

    PubMed

    Hyde, J M; Cerezo, A; Williams, T J

    2009-04-01

    Statistical analysis of atom probe data has improved dramatically in the last decade and it is now possible to determine the size, the number density and the composition of individual clusters or precipitates such as those formed in reactor pressure vessel (RPV) steels during irradiation. However, the characterisation of the onset of clustering or co-segregation is more difficult and has traditionally focused on the use of composition frequency distributions (for detecting clustering) and contingency tables (for detecting co-segregation). In this work, the authors investigate the possibility of directly examining the neighbourhood of each individual solute atom as a means of identifying the onset of solute clustering and/or co-segregation. The methodology involves comparing the mean observed composition around a particular type of solute with that expected from the overall composition of the material. The methodology has been applied to atom probe data obtained from several irradiated RPV steels. The results show that the new approach is more sensitive to fine scale clustering and co-segregation than that achievable using composition frequency distribution and contingency table analyses.

  9. Spatio-Temporal Clustering of Monitoring Network

    NASA Astrophysics Data System (ADS)

    Hussain, I.; Pilz, J.

    2009-04-01

    Pakistan has much diversity in seasonal variation of different locations. Some areas are in desserts and remain very hot and waterless, for example coastal areas are situated along the Arabian Sea and have very warm season and a little rainfall. Some areas are covered with mountains, have very low temperature and heavy rainfall; for instance Karakoram ranges. The most important variables that have an impact on the climate are temperature, precipitation, humidity, wind speed and elevation. Furthermore, it is hard to find homogeneous regions in Pakistan with respect to climate variation. Identification of homogeneous regions in Pakistan can be useful in many aspects. It can be helpful for prediction of the climate in the sub-regions and for optimizing the number of monitoring sites. In the earlier literature no one tried to identify homogeneous regions of Pakistan with respect to climate variation. There are only a few papers about spatio-temporal clustering of monitoring network. Steinhaus (1956) presented the well-known K-means clustering method. It can identify a predefined number of clusters by iteratively assigning centriods to clusters based. Castro et al. (1997) developed a genetic heuristic algorithm to solve medoids based clustering. Their method is based on genetic recombination upon random assorting recombination. The suggested method is appropriate for clustering the attributes which have genetic characteristics. Sap and Awan (2005) presented a robust weighted kernel K-means algorithm incorporating spatial constraints for clustering climate data. The proposed algorithm can effectively handle noise, outliers and auto-correlation in the spatial data, for effective and efficient data analysis by exploring patterns and structures in the data. Soltani and Modarres (2006) used hierarchical and divisive cluster analysis to categorize patterns of rainfall in Iran. They only considered rainfall at twenty-eight monitoring sites and concluded that eight clusters existed. Soltani and Modarres (2006) classified the sites by using only average rainfall of sites, they did not consider time replications and spatial coordinates. Kerby et.al (2007) purposed spatial clustering method based on likelihood. They took account of the geographic locations through the variance covariance matrix. Their purposed method works like hierarchical clustering methods. Moreovere, it is inappropiriate for time replication data and could not perform well for large number of sites. Tuia.et.al (2008) used scan statistics for identifying spatio-temporal clusters for fire sequences in the Tuscany region in Italy. The scan statistics clustering method was developed by Kulldorff et al. (1997) to detect spatio-temporal clusters in epidemiology and assessing their significance. The purposed scan statistics method is used only for univariate discrete stochastic random variables. In this paper we make use of a very simple approach for spatio-temporal clustering which can create separable and homogeneous clusters. Most of the clustering methods are based on Euclidean distances. It is well known that geographic coordinates are spherical coordinates and estimating Euclidean distances from spherical coordinates is inappropriate. As a transformation from geographic coordinates to rectangular (D-plane) coordinates we use the Lambert projection method. The partition around medoids clustering method is incorporated on the data including D-plane coordinates. Ordinary kriging is taken as validity measure for the precipitation data. The kriging results for clusters are more accurate and have less variation compared to complete monitoring network precipitation data. References Casto.V.E and Murray.A.T (1997). Spatial Clustering with Data Mining with Genetic Algorithms. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.8573 Kaufman.L and Rousseeuw.P.J (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley series of Probability and Mathematical Statistics, New York. Kulldorf.M (1997). A spatial scan statistic. Commun. Stat.-Theor. Math. 26(6), 1481-1496 Kerby. A , Marx. D, Samal. A and Adamchuck. V. (2007). Spatial Clustering Using the Likelihood Function. Seventh IEEE International Conference on Data Mining - Workshops Steinhaus.H (1956). Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci., C1. III vol IV:801- 804 Snyder, J. P. (1987). Map Projection: A Working Manual. U. S. Geological Survey Professional Paper 1395. Washington, DC: U. S. Government Printing Office, pp. 104-110 Sap.M.N and Awan. A.M (2005). Finding Spatio-Temporal Patterns in Climate Data Using Clustering. Proceedings of the International Conference on Cyberworlds (CW'05) Soltani.S and Modarres.R (2006). Classification of Spatio -Temporal Pattern of Rainfall in Iran: Using Hierarchical and Divisive Cluster Analysis. Journal of Spatial Hydrology Vol.6, No.2 Tuia.D, Ratle.F, Lasaponara.R, Telesca.L and Kanevski.M (2008). Scan Statistics Analysis for Forest Fire Clusters. Commun. in Nonlinear science and numerical simulation 13,1689-1694.

  10. Cluster size statistic and cluster mass statistic: two novel methods for identifying changes in functional connectivity between groups or conditions.

    PubMed

    Ing, Alex; Schwarzbauer, Christian

    2014-01-01

    Functional connectivity has become an increasingly important area of research in recent years. At a typical spatial resolution, approximately 300 million connections link each voxel in the brain with every other. This pattern of connectivity is known as the functional connectome. Connectivity is often compared between experimental groups and conditions. Standard methods used to control the type 1 error rate are likely to be insensitive when comparisons are carried out across the whole connectome, due to the huge number of statistical tests involved. To address this problem, two new cluster based methods--the cluster size statistic (CSS) and cluster mass statistic (CMS)--are introduced to control the family wise error rate across all connectivity values. These methods operate within a statistical framework similar to the cluster based methods used in conventional task based fMRI. Both methods are data driven, permutation based and require minimal statistical assumptions. Here, the performance of each procedure is evaluated in a receiver operator characteristic (ROC) analysis, utilising a simulated dataset. The relative sensitivity of each method is also tested on real data: BOLD (blood oxygen level dependent) fMRI scans were carried out on twelve subjects under normal conditions and during the hypercapnic state (induced through the inhalation of 6% CO2 in 21% O2 and 73%N2). Both CSS and CMS detected significant changes in connectivity between normal and hypercapnic states. A family wise error correction carried out at the individual connection level exhibited no significant changes in connectivity.

  11. Cluster Size Statistic and Cluster Mass Statistic: Two Novel Methods for Identifying Changes in Functional Connectivity Between Groups or Conditions

    PubMed Central

    Ing, Alex; Schwarzbauer, Christian

    2014-01-01

    Functional connectivity has become an increasingly important area of research in recent years. At a typical spatial resolution, approximately 300 million connections link each voxel in the brain with every other. This pattern of connectivity is known as the functional connectome. Connectivity is often compared between experimental groups and conditions. Standard methods used to control the type 1 error rate are likely to be insensitive when comparisons are carried out across the whole connectome, due to the huge number of statistical tests involved. To address this problem, two new cluster based methods – the cluster size statistic (CSS) and cluster mass statistic (CMS) – are introduced to control the family wise error rate across all connectivity values. These methods operate within a statistical framework similar to the cluster based methods used in conventional task based fMRI. Both methods are data driven, permutation based and require minimal statistical assumptions. Here, the performance of each procedure is evaluated in a receiver operator characteristic (ROC) analysis, utilising a simulated dataset. The relative sensitivity of each method is also tested on real data: BOLD (blood oxygen level dependent) fMRI scans were carried out on twelve subjects under normal conditions and during the hypercapnic state (induced through the inhalation of 6% CO2 in 21% O2 and 73%N2). Both CSS and CMS detected significant changes in connectivity between normal and hypercapnic states. A family wise error correction carried out at the individual connection level exhibited no significant changes in connectivity. PMID:24906136

  12. WordCluster: detecting clusters of DNA words and genomic elements

    PubMed Central

    2011-01-01

    Background Many k-mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds. Results We introduce here an algorithm to detect clusters of DNA words (k-mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome. Conclusions WordCluster seems to predict biological meaningful clusters of DNA words (k-mers) and genomic entities. The implementation of the method into a web server is available at http://bioinfo2.ugr.es/wordCluster/wordCluster.php including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes. PMID:21261981

  13. Groundwater quality assessment of urban Bengaluru using multivariate statistical techniques

    NASA Astrophysics Data System (ADS)

    Gulgundi, Mohammad Shahid; Shetty, Amba

    2018-03-01

    Groundwater quality deterioration due to anthropogenic activities has become a subject of prime concern. The objective of the study was to assess the spatial and temporal variations in groundwater quality and to identify the sources in the western half of the Bengaluru city using multivariate statistical techniques. Water quality index rating was calculated for pre and post monsoon seasons to quantify overall water quality for human consumption. The post-monsoon samples show signs of poor quality in drinking purpose compared to pre-monsoon. Cluster analysis (CA), principal component analysis (PCA) and discriminant analysis (DA) were applied to the groundwater quality data measured on 14 parameters from 67 sites distributed across the city. Hierarchical cluster analysis (CA) grouped the 67 sampling stations into two groups, cluster 1 having high pollution and cluster 2 having lesser pollution. Discriminant analysis (DA) was applied to delineate the most meaningful parameters accounting for temporal and spatial variations in groundwater quality of the study area. Temporal DA identified pH as the most important parameter, which discriminates between water quality in the pre-monsoon and post-monsoon seasons and accounts for 72% seasonal assignation of cases. Spatial DA identified Mg, Cl and NO3 as the three most important parameters discriminating between two clusters and accounting for 89% spatial assignation of cases. Principal component analysis was applied to the dataset obtained from the two clusters, which evolved three factors in each cluster, explaining 85.4 and 84% of the total variance, respectively. Varifactors obtained from principal component analysis showed that groundwater quality variation is mainly explained by dissolution of minerals from rock water interactions in the aquifer, effect of anthropogenic activities and ion exchange processes in water.

  14. Identification and characterization of earthquake clusters: a comparative analysis for selected sequences in Italy

    NASA Astrophysics Data System (ADS)

    Peresan, Antonella; Gentili, Stefania

    2017-04-01

    Identification and statistical characterization of seismic clusters may provide useful insights about the features of seismic energy release and their relation to physical properties of the crust within a given region. Moreover, a number of studies based on spatio-temporal analysis of main-shocks occurrence require preliminary declustering of the earthquake catalogs. Since various methods, relying on different physical/statistical assumptions, may lead to diverse classifications of earthquakes into main events and related events, we aim to investigate the classification differences among different declustering techniques. Accordingly, a formal selection and comparative analysis of earthquake clusters is carried out for the most relevant earthquakes in North-Eastern Italy, as reported in the local OGS-CRS bulletins, compiled at the National Institute of Oceanography and Experimental Geophysics since 1977. The comparison is then extended to selected earthquake sequences associated with a different seismotectonic setting, namely to events that occurred in the region struck by the recent Central Italy destructive earthquakes, making use of INGV data. Various techniques, ranging from classical space-time windows methods to ad hoc manual identification of aftershocks, are applied for detection of earthquake clusters. In particular, a statistical method based on nearest-neighbor distances of events in space-time-energy domain, is considered. Results from clusters identification by the nearest-neighbor method turn out quite robust with respect to the time span of the input catalogue, as well as to minimum magnitude cutoff. The identified clusters for the largest events reported in North-Eastern Italy since 1977 are well consistent with those reported in earlier studies, which were aimed at detailed manual aftershocks identification. The study shows that the data-driven approach, based on the nearest-neighbor distances, can be satisfactorily applied to decompose the seismic catalog into background seismicity and individual sequences of earthquake clusters, also in areas characterized by moderate seismic activity, where the standard declustering techniques may turn out rather gross approximations. With these results acquired, the main statistical features of seismic clusters are explored, including complex interdependence of related events, with the aim to characterize the space-time patterns of earthquakes occurrence in North-Eastern Italy and capture their basic differences with Central Italy sequences.

  15. A spatial cluster analysis of tractor overturns in Kentucky from 1960 to 2002

    USGS Publications Warehouse

    Saman, D.M.; Cole, H.P.; Odoi, A.; Myers, M.L.; Carey, D.I.; Westneat, S.C.

    2012-01-01

    Background: Agricultural tractor overturns without rollover protective structures are the leading cause of farm fatalities in the United States. To our knowledge, no studies have incorporated the spatial scan statistic in identifying high-risk areas for tractor overturns. The aim of this study was to determine whether tractor overturns cluster in certain parts of Kentucky and identify factors associated with tractor overturns. Methods: A spatial statistical analysis using Kulldorff's spatial scan statistic was performed to identify county clusters at greatest risk for tractor overturns. A regression analysis was then performed to identify factors associated with tractor overturns. Results: The spatial analysis revealed a cluster of higher than expected tractor overturns in four counties in northern Kentucky (RR = 2.55) and 10 counties in eastern Kentucky (RR = 1.97). Higher rates of tractor overturns were associated with steeper average percent slope of pasture land by county (p = 0.0002) and a greater percent of total tractors with less than 40 horsepower by county (p<0.0001). Conclusions: This study reveals that geographic hotspots of tractor overturns exist in Kentucky and identifies factors associated with overturns. This study provides policymakers a guide to targeted county-level interventions (e.g., roll-over protective structures promotion interventions) with the intention of reducing tractor overturns in the highest risk counties in Kentucky. ?? 2012 Saman et al.

  16. Does objective cluster analysis serve as a useful precursor to seasonal precipitation prediction at local scale? Application to western Ethiopia

    NASA Astrophysics Data System (ADS)

    Zhang, Ying; Moges, Semu; Block, Paul

    2018-01-01

    Prediction of seasonal precipitation can provide actionable information to guide management of various sectoral activities. For instance, it is often translated into hydrological forecasts for better water resources management. However, many studies assume homogeneity in precipitation across an entire study region, which may prove ineffective for operational and local-level decisions, particularly for locations with high spatial variability. This study proposes advancing local-level seasonal precipitation predictions by first conditioning on regional-level predictions, as defined through objective cluster analysis, for western Ethiopia. To our knowledge, this is the first study predicting seasonal precipitation at high resolution in this region, where lives and livelihoods are vulnerable to precipitation variability given the high reliance on rain-fed agriculture and limited water resources infrastructure. The combination of objective cluster analysis, spatially high-resolution prediction of seasonal precipitation, and a modeling structure spanning statistical and dynamical approaches makes clear advances in prediction skill and resolution, as compared with previous studies. The statistical model improves versus the non-clustered case or dynamical models for a number of specific clusters in northwestern Ethiopia, with clusters having regional average correlation and ranked probability skill score (RPSS) values of up to 0.5 and 33 %, respectively. The general skill (after bias correction) of the two best-performing dynamical models over the entire study region is superior to that of the statistical models, although the dynamical models issue predictions at a lower resolution and the raw predictions require bias correction to guarantee comparable skills.

  17. Zonation in the deep benthic megafauna : Application of a general test.

    PubMed

    Gardiner, Frederick P; Haedrich, Richard L

    1978-01-01

    A test based on Maxwell-Boltzman statistics, instead of the formerly suggested but inappropriate Bose-Einstein statistics (Pielou and Routledge, 1976), examines the distribution of the boundaries of species' ranges distributed along a gradient, and indicates whether they are random or clustered (zoned). The test is most useful as a preliminary to the application of more instructive but less statistically rigorous methods such as cluster analysis. The test indicates zonation is marked in the deep benthic megafauna living between 200 and 3000 m, but below 3000 m little zonation may be found.

  18. Network Analysis Tools: from biological networks to clusters and pathways.

    PubMed

    Brohée, Sylvain; Faust, Karoline; Lima-Mendez, Gipsi; Vanderstocken, Gilles; van Helden, Jacques

    2008-01-01

    Network Analysis Tools (NeAT) is a suite of computer tools that integrate various algorithms for the analysis of biological networks: comparison between graphs, between clusters, or between graphs and clusters; network randomization; analysis of degree distribution; network-based clustering and path finding. The tools are interconnected to enable a stepwise analysis of the network through a complete analytical workflow. In this protocol, we present a typical case of utilization, where the tasks above are combined to decipher a protein-protein interaction network retrieved from the STRING database. The results returned by NeAT are typically subnetworks, networks enriched with additional information (i.e., clusters or paths) or tables displaying statistics. Typical networks comprising several thousands of nodes and arcs can be analyzed within a few minutes. The complete protocol can be read and executed in approximately 1 h.

  19. Clustering analysis for muon tomography data elaboration in the Muon Portal project

    NASA Astrophysics Data System (ADS)

    Bandieramonte, M.; Antonuccio-Delogu, V.; Becciani, U.; Costa, A.; La Rocca, P.; Massimino, P.; Petta, C.; Pistagna, C.; Riggi, F.; Riggi, S.; Sciacca, E.; Vitello, F.

    2015-05-01

    Clustering analysis is one of multivariate data analysis techniques which allows to gather statistical data units into groups, in order to minimize the logical distance within each group and to maximize the one between different groups. In these proceedings, the authors present a novel approach to the muontomography data analysis based on clustering algorithms. As a case study we present the Muon Portal project that aims to build and operate a dedicated particle detector for the inspection of harbor containers to hinder the smuggling of nuclear materials. Clustering techniques, working directly on scattering points, help to detect the presence of suspicious items inside the container, acting, as it will be shown, as a filter for a preliminary analysis of the data.

  20. Performance analysis of clustering techniques over microarray data: A case study

    NASA Astrophysics Data System (ADS)

    Dash, Rasmita; Misra, Bijan Bihari

    2018-03-01

    Handling big data is one of the major issues in the field of statistical data analysis. In such investigation cluster analysis plays a vital role to deal with the large scale data. There are many clustering techniques with different cluster analysis approach. But which approach suits a particular dataset is difficult to predict. To deal with this problem a grading approach is introduced over many clustering techniques to identify a stable technique. But the grading approach depends on the characteristic of dataset as well as on the validity indices. So a two stage grading approach is implemented. In this study the grading approach is implemented over five clustering techniques like hybrid swarm based clustering (HSC), k-means, partitioning around medoids (PAM), vector quantization (VQ) and agglomerative nesting (AGNES). The experimentation is conducted over five microarray datasets with seven validity indices. The finding of grading approach that a cluster technique is significant is also established by Nemenyi post-hoc hypothetical test.

  1. An application of cluster analysis for determining homogeneous subregions: The agroclimatological point of view. [Rio Grande do Sul, Brazil

    NASA Technical Reports Server (NTRS)

    Parada, N. D. J. (Principal Investigator); Cappelletti, C. A.

    1982-01-01

    A stratification oriented to crop area and yield estimation problems was performed using an algorithm of clustering. The variables used were a set of agroclimatological characteristics measured in each one of the 232 municipalities of the State of Rio Grande do Sul, Brazil. A nonhierarchical cluster analysis was used and the pseudo F-statistics criterion was implemented for determining the "cut point" in the number of strata.

  2. Applications of modern statistical methods to analysis of data in physical science

    NASA Astrophysics Data System (ADS)

    Wicker, James Eric

    Modern methods of statistical and computational analysis offer solutions to dilemmas confronting researchers in physical science. Although the ideas behind modern statistical and computational analysis methods were originally introduced in the 1970's, most scientists still rely on methods written during the early era of computing. These researchers, who analyze increasingly voluminous and multivariate data sets, need modern analysis methods to extract the best results from their studies. The first section of this work showcases applications of modern linear regression. Since the 1960's, many researchers in spectroscopy have used classical stepwise regression techniques to derive molecular constants. However, problems with thresholds of entry and exit for model variables plagues this analysis method. Other criticisms of this kind of stepwise procedure include its inefficient searching method, the order in which variables enter or leave the model and problems with overfitting data. We implement an information scoring technique that overcomes the assumptions inherent in the stepwise regression process to calculate molecular model parameters. We believe that this kind of information based model evaluation can be applied to more general analysis situations in physical science. The second section proposes new methods of multivariate cluster analysis. The K-means algorithm and the EM algorithm, introduced in the 1960's and 1970's respectively, formed the basis of multivariate cluster analysis methodology for many years. However, several shortcomings of these methods include strong dependence on initial seed values and inaccurate results when the data seriously depart from hypersphericity. We propose new cluster analysis methods based on genetic algorithms that overcomes the strong dependence on initial seed values. In addition, we propose a generalization of the Genetic K-means algorithm which can accurately identify clusters with complex hyperellipsoidal covariance structures. We then use this new algorithm in a genetic algorithm based Expectation-Maximization process that can accurately calculate parameters describing complex clusters in a mixture model routine. Using the accuracy of this GEM algorithm, we assign information scores to cluster calculations in order to best identify the number of mixture components in a multivariate data set. We will showcase how these algorithms can be used to process multivariate data from astronomical observations.

  3. Cluster analysis of phytoplankton data collected from the National Stream Quality Accounting Network in the Tennessee River basin, 1974-81

    USGS Publications Warehouse

    Stephens, D.W.; Wangsgard, J.B.

    1988-01-01

    A computer program, Numerical Taxonomy System of Multivariate Statistical Programs (NTSYS), was used with interfacing software to perform cluster analyses of phytoplankton data stored in the biological files of the U.S. Geological Survey. The NTSYS software performs various types of statistical analyses and is capable of handling a large matrix of data. Cluster analyses were done on phytoplankton data collected from 1974 to 1981 at four national Stream Quality Accounting Network stations in the Tennessee River basin. Analysis of the changes in clusters of phytoplankton genera indicated possible changes in the water quality of the French Broad River near Knoxville, Tennessee. At this station, the most common diatom groups indicated a shift in dominant forms with some of the less common diatoms being replaced by green and blue-green algae. There was a reduction in genera variability between 1974-77 and 1979-81 sampling periods. Statistical analysis of chloride and dissolved solids confirmed that concentrations of these substances were smaller in 1974-77 than in 1979-81. At Pickwick Landing Dam, the furthest downstream station used in the study, there was an increase in the number of genera of ' rare ' organisms with time. The appearance of two groups of green and blue-green algae indicated that an increase in temperature or nutrient concentrations occurred from 1974 to 1981, but this could not be confirmed using available water quality data. Associations of genera forming the phytoplankton communities at three stations on the Tennessee River were found to be seasonal. Nodal analysis of combined data from all four stations used in the study did not identify any seasonal or temporal patterns during 1974-81. Cluster analysis using the NYSYS programs was effective in reducing the large phytoplankton data set to a manageable size and provided considerable insight into the structure of phytoplankton communities in the Tennessee River basin. Problems encountered using cluster analysis were the subjectivity introduced in the definition of meaningful clusters, and the lack of taxonomic identification to the species level. (Author 's abstract)

  4. Analysis of temporal gene expression profiles: clustering by simulated annealing and determining the optimal number of clusters.

    PubMed

    Lukashin, A V; Fuchs, R

    2001-05-01

    Cluster analysis of genome-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and samples. In the present paper, we focus on several important issues related to clustering algorithms that have not yet been fully studied. We describe a simple and robust algorithm for the clustering of temporal gene expression profiles that is based on the simulated annealing procedure. In general, this algorithm guarantees to eventually find the globally optimal distribution of genes over clusters. We introduce an iterative scheme that serves to evaluate quantitatively the optimal number of clusters for each specific data set. The scheme is based on standard approaches used in regular statistical tests. The basic idea is to organize the search of the optimal number of clusters simultaneously with the optimization of the distribution of genes over clusters. The efficiency of the proposed algorithm has been evaluated by means of a reverse engineering experiment, that is, a situation in which the correct distribution of genes over clusters is known a priori. The employment of this statistically rigorous test has shown that our algorithm places greater than 90% genes into correct clusters. Finally, the algorithm has been tested on real gene expression data (expression changes during yeast cell cycle) for which the fundamental patterns of gene expression and the assignment of genes to clusters are well understood from numerous previous studies.

  5. Statistical analysis of catalogs of extragalactic objects. II - The Abell catalog of rich clusters

    NASA Technical Reports Server (NTRS)

    Hauser, M. G.; Peebles, P. J. E.

    1973-01-01

    The results of a power-spectrum analysis are presented for the distribution of clusters in the Abell catalog. Clear and direct evidence is found for superclusters with small angular scale, in agreement with the recent study of Bogart and Wagoner (1973). It is also found that the degree and angular scale of the apparent superclustering varies with distance in the manner expected if the clustering is intrinsic to the spatial distribution rather than a consequence of patchy local obscuration.

  6. Cluster analysis as a prediction tool for pregnancy outcomes.

    PubMed

    Banjari, Ines; Kenjerić, Daniela; Šolić, Krešimir; Mandić, Milena L

    2015-03-01

    Considering specific physiology changes during gestation and thinking of pregnancy as a "critical window", classification of pregnant women at early pregnancy can be considered as crucial. The paper demonstrates the use of a method based on an approach from intelligent data mining, cluster analysis. Cluster analysis method is a statistical method which makes possible to group individuals based on sets of identifying variables. The method was chosen in order to determine possibility for classification of pregnant women at early pregnancy to analyze unknown correlations between different variables so that the certain outcomes could be predicted. 222 pregnant women from two general obstetric offices' were recruited. The main orient was set on characteristics of these pregnant women: their age, pre-pregnancy body mass index (BMI) and haemoglobin value. Cluster analysis gained a 94.1% classification accuracy rate with three branch- es or groups of pregnant women showing statistically significant correlations with pregnancy outcomes. The results are showing that pregnant women both of older age and higher pre-pregnancy BMI have a significantly higher incidence of delivering baby of higher birth weight but they gain significantly less weight during pregnancy. Their babies are also longer, and these women have significantly higher probability for complications during pregnancy (gestosis) and higher probability of induced or caesarean delivery. We can conclude that the cluster analysis method can appropriately classify pregnant women at early pregnancy to predict certain outcomes.

  7. Transforming Graph Data for Statistical Relational Learning

    DTIC Science & Technology

    2012-10-01

    Jordan, 2003), PLSA (Hofmann, 1999), ? Classification via RMN (Taskar et al., 2003) or SVM (Hasan, Chaoji, Salem , & Zaki, 2006) ? Hierarchical...dimensionality reduction methods such as Principal 407 Rossi, McDowell, Aha, & Neville Component Analysis (PCA), Principal Factor Analysis ( PFA ), and...clustering algorithm. Journal of the Royal Statistical Society. Series C, Applied statistics, 28, 100–108. Hasan, M. A., Chaoji, V., Salem , S., & Zaki, M

  8. Clustering performance comparison using K-means and expectation maximization algorithms.

    PubMed

    Jung, Yong Gyu; Kang, Min Soo; Heo, Jun

    2014-11-14

    Clustering is an important means of data mining based on separating data categories by similar features. Unlike the classification algorithm, clustering belongs to the unsupervised type of algorithms. Two representatives of the clustering algorithms are the K -means and the expectation maximization (EM) algorithm. Linear regression analysis was extended to the category-type dependent variable, while logistic regression was achieved using a linear combination of independent variables. To predict the possibility of occurrence of an event, a statistical approach is used. However, the classification of all data by means of logistic regression analysis cannot guarantee the accuracy of the results. In this paper, the logistic regression analysis is applied to EM clusters and the K -means clustering method for quality assessment of red wine, and a method is proposed for ensuring the accuracy of the classification results.

  9. Assessment of trace elements levels in patients with Type 2 diabetes using multivariate statistical analysis.

    PubMed

    Badran, M; Morsy, R; Soliman, H; Elnimr, T

    2016-01-01

    The trace elements metabolism has been reported to possess specific roles in the pathogenesis and progress of diabetes mellitus. Due to the continuous increase in the population of patients with Type 2 diabetes (T2D), this study aims to assess the levels and inter-relationships of fast blood glucose (FBG) and serum trace elements in Type 2 diabetic patients. This study was conducted on 40 Egyptian Type 2 diabetic patients and 36 healthy volunteers (Hospital of Tanta University, Tanta, Egypt). The blood serum was digested and then used to determine the levels of 24 trace elements using an inductive coupled plasma mass spectroscopy (ICP-MS). Multivariate statistical analysis depended on correlation coefficient, cluster analysis (CA) and principal component analysis (PCA), were used to analysis the data. The results exhibited significant changes in FBG and eight of trace elements, Zn, Cu, Se, Fe, Mn, Cr, Mg, and As, levels in the blood serum of Type 2 diabetic patients relative to those of healthy controls. The statistical analyses using multivariate statistical techniques were obvious in the reduction of the experimental variables, and grouping the trace elements in patients into three clusters. The application of PCA revealed a distinct difference in associations of trace elements and their clustering patterns in control and patients group in particular for Mg, Fe, Cu, and Zn that appeared to be the most crucial factors which related with Type 2 diabetes. Therefore, on the basis of this study, the contributors of trace elements content in Type 2 diabetic patients can be determine and specify with correlation relationship and multivariate statistical analysis, which confirm that the alteration of some essential trace metals may play a role in the development of diabetes mellitus. Copyright © 2015 Elsevier GmbH. All rights reserved.

  10. Percolation Analysis as a Tool to Describe the Topology of the Large Scale Structure of the Universe

    NASA Astrophysics Data System (ADS)

    Yess, Capp D.

    1997-09-01

    Percolation analysis is the study of the properties of clusters. In cosmology, it is the statistics of the size and number of clusters. This thesis presents a refinement of percolation analysis and its application to astronomical data. An overview of the standard model of the universe and the development of large scale structure is presented in order to place the study in historical and scientific context. Then using percolation statistics we, for the first time, demonstrate the universal character of a network pattern in the real space, mass distributions resulting from nonlinear gravitational instability of initial Gaussian fluctuations. We also find that the maximum of the number of clusters statistic in the evolved, nonlinear distributions is determined by the effective slope of the power spectrum. Next, we present percolation analyses of Wiener Reconstructions of the IRAS 1.2 Jy Redshift Survey. There are ten reconstructions of galaxy density fields in real space spanning the range β = 0.1 to 1.0, where β=Ω0.6/b,/ Ω is the present dimensionless density and b is the linear bias factor. Our method uses the growth of the largest cluster statistic to characterize the topology of a density field, where Gaussian randomized versions of the reconstructions are used as standards for analysis. For the reconstruction volume of radius, R≈100h-1 Mpc, percolation analysis reveals a slight 'meatball' topology for the real space, galaxy distribution of the IRAS survey. Finally, we employ a percolation technique developed for pointwise distributions to analyze two-dimensional projections of the three northern and three southern slices in the Las Campanas Redshift Survey and then give consideration to further study of the methodology, errors and application of percolation. We track the growth of the largest cluster as a topological indicator to a depth of 400 h-1 Mpc, and report an unambiguous signal, with high signal-to-noise ratio, indicating a network topology which in two dimensions is indicative of a filamentary distribution. It is hoped that one day percolation analysis can characterize the structure of the universe to a degree that will aid theorists in confidently describing the nature of our world.

  11. Detection of Anomalies in Hydrometric Data Using Artificial Intelligence Techniques

    NASA Astrophysics Data System (ADS)

    Lauzon, N.; Lence, B. J.

    2002-12-01

    This work focuses on the detection of anomalies in hydrometric data sequences, such as 1) outliers, which are individual data having statistical properties that differ from those of the overall population; 2) shifts, which are sudden changes over time in the statistical properties of the historical records of data; and 3) trends, which are systematic changes over time in the statistical properties. For the purpose of the design and management of water resources systems, it is important to be aware of these anomalies in hydrometric data, for they can induce a bias in the estimation of water quantity and quality parameters. These anomalies may be viewed as specific patterns affecting the data, and therefore pattern recognition techniques can be used for identifying them. However, the number of possible patterns is very large for each type of anomaly and consequently large computing capacities are required to account for all possibilities using the standard statistical techniques, such as cluster analysis. Artificial intelligence techniques, such as the Kohonen neural network and fuzzy c-means, are clustering techniques commonly used for pattern recognition in several areas of engineering and have recently begun to be used for the analysis of natural systems. They require much less computing capacity than the standard statistical techniques, and therefore are well suited for the identification of outliers, shifts and trends in hydrometric data. This work constitutes a preliminary study, using synthetic data representing hydrometric data that can be found in Canada. The analysis of the results obtained shows that the Kohonen neural network and fuzzy c-means are reasonably successful in identifying anomalies. This work also addresses the problem of uncertainties inherent to the calibration procedures that fit the clusters to the possible patterns for both the Kohonen neural network and fuzzy c-means. Indeed, for the same database, different sets of clusters can be established with these calibration procedures. A simple method for analyzing uncertainties associated with the Kohonen neural network and fuzzy c-means is developed here. The method combines the results from several sets of clusters, either from the Kohonen neural network or fuzzy c-means, so as to provide an overall diagnosis as to the identification of outliers, shifts and trends. The results indicate an improvement in the performance for identifying anomalies when the method of combining cluster sets is used, compared with when only one cluster set is used.

  12. Estimating global distribution of boreal, temperate, and tropical tree plant functional types using clustering techniques

    NASA Astrophysics Data System (ADS)

    Wang, Audrey; Price, David T.

    2007-03-01

    A simple integrated algorithm was developed to relate global climatology to distributions of tree plant functional types (PFT). Multivariate cluster analysis was performed to analyze the statistical homogeneity of the climate space occupied by individual tree PFTs. Forested regions identified from the satellite-based GLC2000 classification were separated into tropical, temperate, and boreal sub-PFTs for use in the Canadian Terrestrial Ecosystem Model (CTEM). Global data sets of monthly minimum temperature, growing degree days, an index of climatic moisture, and estimated PFT cover fractions were then used as variables in the cluster analysis. The statistical results for individual PFT clusters were found consistent with other global-scale classifications of dominant vegetation. As an improvement of the quantification of the climatic limitations on PFT distributions, the results also demonstrated overlapping of PFT cluster boundaries that reflected vegetation transitions, for example, between tropical and temperate biomes. The resulting global database should provide a better basis for simulating the interaction of climate change and terrestrial ecosystem dynamics using global vegetation models.

  13. An Empirical Taxonomy of Hospital Governing Board Roles

    PubMed Central

    Lee, Shoou-Yih D; Alexander, Jeffrey A; Wang, Virginia; Margolin, Frances S; Combes, John R

    2008-01-01

    Objective To develop a taxonomy of governing board roles in U.S. hospitals. Data Sources 2005 AHA Hospital Governance Survey, 2004 AHA Annual Survey of Hospitals, and Area Resource File. Study Design A governing board taxonomy was developed using cluster analysis. Results were validated and reviewed by industry experts. Differences in hospital and environmental characteristics across clusters were examined. Data Extraction Methods One-thousand three-hundred thirty-four hospitals with complete information on the study variables were included in the analysis. Principal Findings Five distinct clusters of hospital governing boards were identified. Statistical tests showed that the five clusters had high internal reliability and high internal validity. Statistically significant differences in hospital and environmental conditions were found among clusters. Conclusions The developed taxonomy provides policy makers, health care executives, and researchers a useful way to describe and understand hospital governing board roles. The taxonomy may also facilitate valid and systematic assessment of governance performance. Further, the taxonomy could be used as a framework for governing boards themselves to identify areas for improvement and direction for change. PMID:18355260

  14. Aircraft noise effects: An interdisciplinary study of the effects of aircraft noise on man. Part 1: Basic report

    NASA Technical Reports Server (NTRS)

    1980-01-01

    An area around the Munich-Riem airport was divided into 32 clusters of different noise exposure and subjects were drawn from each cluster for a social survey and for psychological, medical, and physiological testing. Extensive acoustical measurements were also carried out in each cluster. The results were then subjected to detailed statistical analysis.

  15. Use of a spatial scan statistic to identify clusters of births occurring outside Ghanaian health facilities for targeted intervention.

    PubMed

    Bosomprah, Samuel; Dotse-Gborgbortsi, Winfred; Aboagye, Patrick; Matthews, Zoe

    2016-11-01

    To identify and evaluate clusters of births that occurred outside health facilities in Ghana for targeted intervention. A retrospective study was conducted using a convenience sample of live births registered in Ghanaian health facilities from January 1 to December 31, 2014. Data were extracted from the district health information system. A spatial scan statistic was used to investigate clusters of home births through a discrete Poisson probability model. Scanning with a circular spatial window was conducted only for clusters with high rates of such deliveries. The district was used as the geographic unit of analysis. The likelihood P value was estimated using Monte Carlo simulations. Ten statistically significant clusters with a high rate of home birth were identified. The relative risks ranged from 1.43 ("least likely" cluster; P=0.001) to 1.95 ("most likely" cluster; P=0.001). The relative risks of the top five "most likely" clusters ranged from 1.68 to 1.95; these clusters were located in Ashanti, Brong Ahafo, and the Western, Eastern, and Greater regions of Accra. Health facility records, geospatial techniques, and geographic information systems provided locally relevant information to assist policy makers in delivering targeted interventions to small geographic areas. Copyright © 2016 International Federation of Gynecology and Obstetrics. Published by Elsevier Ireland Ltd. All rights reserved.

  16. Degree-based statistic and center persistency for brain connectivity analysis.

    PubMed

    Yoo, Kwangsun; Lee, Peter; Chung, Moo K; Sohn, William S; Chung, Sun Ju; Na, Duk L; Ju, Daheen; Jeong, Yong

    2017-01-01

    Brain connectivity analyses have been widely performed to investigate the organization and functioning of the brain, or to observe changes in neurological or psychiatric conditions. However, connectivity analysis inevitably introduces the problem of mass-univariate hypothesis testing. Although, several cluster-wise correction methods have been suggested to address this problem and shown to provide high sensitivity, these approaches fundamentally have two drawbacks: the lack of spatial specificity (localization power) and the arbitrariness of an initial cluster-forming threshold. In this study, we propose a novel method, degree-based statistic (DBS), performing cluster-wise inference. DBS is designed to overcome the above-mentioned two shortcomings. From a network perspective, a few brain regions are of critical importance and considered to play pivotal roles in network integration. Regarding this notion, DBS defines a cluster as a set of edges of which one ending node is shared. This definition enables the efficient detection of clusters and their center nodes. Furthermore, a new measure of a cluster, center persistency (CP) was introduced. The efficiency of DBS with a known "ground truth" simulation was demonstrated. Then they applied DBS to two experimental datasets and showed that DBS successfully detects the persistent clusters. In conclusion, by adopting a graph theoretical concept of degrees and borrowing the concept of persistence from algebraic topology, DBS could sensitively identify clusters with centric nodes that would play pivotal roles in an effect of interest. DBS is potentially widely applicable to variable cognitive or clinical situations and allows us to obtain statistically reliable and easily interpretable results. Hum Brain Mapp 38:165-181, 2017. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.

  17. Modest validity and fair reproducibility of dietary patterns derived by cluster analysis.

    PubMed

    Funtikova, Anna N; Benítez-Arciniega, Alejandra A; Fitó, Montserrat; Schröder, Helmut

    2015-03-01

    Cluster analysis is widely used to analyze dietary patterns. We aimed to analyze the validity and reproducibility of the dietary patterns defined by cluster analysis derived from a food frequency questionnaire (FFQ). We hypothesized that the dietary patterns derived by cluster analysis have fair to modest reproducibility and validity. Dietary data were collected from 107 individuals from population-based survey, by an FFQ at baseline (FFQ1) and after 1 year (FFQ2), and by twelve 24-hour dietary recalls (24-HDR). Repeatability and validity were measured by comparing clusters obtained by the FFQ1 and FFQ2 and by the FFQ2 and 24-HDR (reference method), respectively. Cluster analysis identified a "fruits & vegetables" and a "meat" pattern in each dietary data source. Cluster membership was concordant for 66.7% of participants in FFQ1 and FFQ2 (reproducibility), and for 67.0% in FFQ2 and 24-HDR (validity). Spearman correlation analysis showed reasonable reproducibility, especially in the "fruits & vegetables" pattern, and lower validity also especially in the "fruits & vegetables" pattern. κ statistic revealed a fair validity and reproducibility of clusters. Our findings indicate a reasonable reproducibility and fair to modest validity of dietary patterns derived by cluster analysis. Copyright © 2015 Elsevier Inc. All rights reserved.

  18. Common Scientific and Statistical Errors in Obesity Research

    PubMed Central

    George, Brandon J.; Beasley, T. Mark; Brown, Andrew W.; Dawson, John; Dimova, Rositsa; Divers, Jasmin; Goldsby, TaShauna U.; Heo, Moonseong; Kaiser, Kathryn A.; Keith, Scott; Kim, Mimi Y.; Li, Peng; Mehta, Tapan; Oakes, J. Michael; Skinner, Asheley; Stuart, Elizabeth; Allison, David B.

    2015-01-01

    We identify 10 common errors and problems in the statistical analysis, design, interpretation, and reporting of obesity research and discuss how they can be avoided. The 10 topics are: 1) misinterpretation of statistical significance, 2) inappropriate testing against baseline values, 3) excessive and undisclosed multiple testing and “p-value hacking,” 4) mishandling of clustering in cluster randomized trials, 5) misconceptions about nonparametric tests, 6) mishandling of missing data, 7) miscalculation of effect sizes, 8) ignoring regression to the mean, 9) ignoring confirmation bias, and 10) insufficient statistical reporting. We hope that discussion of these errors can improve the quality of obesity research by helping researchers to implement proper statistical practice and to know when to seek the help of a statistician. PMID:27028280

  19. Geographic Clusters of Basal Cell Carcinoma in a Northern California Health Plan Population.

    PubMed

    Ray, G Thomas; Kulldorff, Martin; Asgari, Maryam M

    2016-11-01

    Rates of skin cancer, including basal cell carcinoma (BCC), the most common cancer, have been increasing over the past 3 decades. A better understanding of geographic clustering of BCCs can help target screening and prevention efforts. Present a methodology to identify spatial clusters of BCC and identify such clusters in a northern California population. This retrospective study used a BCC registry to determine rates of BCC by census block group, and used spatial scan statistics to identify statistically significant geographic clusters of BCCs, adjusting for age, sex, and socioeconomic status. The study population consisted of white, non-Hispanic members of Kaiser Permanente Northern California during years 2011 and 2012. Statistically significant geographic clusters of BCC as determined by spatial scan statistics. Spatial analysis of 28 408 individuals who received a diagnosis of at least 1 BCC in 2011 or 2012 revealed distinct geographic areas with elevated BCC rates. Among the 14 counties studied, BCC incidence ranged from 661 to 1598 per 100 000 person-years. After adjustment for age, sex, and neighborhood socioeconomic status, a pattern of 5 discrete geographic clusters emerged, with a relative risk ranging from 1.12 (95% CI, 1.03-1.21; P = .006) for a cluster in eastern Sonoma and northern Napa Counties to 1.40 (95% CI, 1.15-1.71; P < .001) for a cluster in east Contra Costa and west San Joaquin Counties, compared with persons residing outside that cluster. In this study of a northern California population, we identified several geographic clusters with modestly elevated incidence of BCC. Knowledge of geographic clusters can help inform future research on the underlying etiology of the clustering including factors related to the environment, health care access, or other characteristics of the resident population, and can help target screening efforts to areas of highest yield.

  20. Examining the effectiveness of discriminant function analysis and cluster analysis in species identification of male field crickets based on their calling songs.

    PubMed

    Jaiswara, Ranjana; Nandi, Diptarup; Balakrishnan, Rohini

    2013-01-01

    Traditional taxonomy based on morphology has often failed in accurate species identification owing to the occurrence of cryptic species, which are reproductively isolated but morphologically identical. Molecular data have thus been used to complement morphology in species identification. The sexual advertisement calls in several groups of acoustically communicating animals are species-specific and can thus complement molecular data as non-invasive tools for identification. Several statistical tools and automated identifier algorithms have been used to investigate the efficiency of acoustic signals in species identification. Despite a plethora of such methods, there is a general lack of knowledge regarding the appropriate usage of these methods in specific taxa. In this study, we investigated the performance of two commonly used statistical methods, discriminant function analysis (DFA) and cluster analysis, in identification and classification based on acoustic signals of field cricket species belonging to the subfamily Gryllinae. Using a comparative approach we evaluated the optimal number of species and calling song characteristics for both the methods that lead to most accurate classification and identification. The accuracy of classification using DFA was high and was not affected by the number of taxa used. However, a constraint in using discriminant function analysis is the need for a priori classification of songs. Accuracy of classification using cluster analysis, which does not require a priori knowledge, was maximum for 6-7 taxa and decreased significantly when more than ten taxa were analysed together. We also investigated the efficacy of two novel derived acoustic features in improving the accuracy of identification. Our results show that DFA is a reliable statistical tool for species identification using acoustic signals. Our results also show that cluster analysis of acoustic signals in crickets works effectively for species classification and identification.

  1. Extracting Aggregation Free Energies of Mixed Clusters from Simulations of Small Systems: Application to Ionic Surfactant Micelles.

    PubMed

    Zhang, X; Patel, L A; Beckwith, O; Schneider, R; Weeden, C J; Kindt, J T

    2017-11-14

    Micelle cluster distributions from molecular dynamics simulations of a solvent-free coarse-grained model of sodium octyl sulfate (SOS) were analyzed using an improved method to extract equilibrium association constants from small-system simulations containing one or two micelle clusters at equilibrium with free surfactants and counterions. The statistical-thermodynamic and mathematical foundations of this partition-enabled analysis of cluster histograms (PEACH) approach are presented. A dramatic reduction in computational time for analysis was achieved through a strategy similar to the selector variable method to circumvent the need for exhaustive enumeration of the possible partitions of surfactants and counterions into clusters. Using statistics from a set of small-system (up to 60 SOS molecules) simulations as input, equilibrium association constants for micelle clusters were obtained as a function of both number of surfactants and number of associated counterions through a global fitting procedure. The resulting free energies were able to accurately predict micelle size and charge distributions in a large (560 molecule) system. The evolution of micelle size and charge with SOS concentration as predicted by the PEACH-derived free energies and by a phenomenological four-parameter model fit, along with the sensitivity of these predictions to variations in cluster definitions, are analyzed and discussed.

  2. Statistical detection of geographic clusters of resistant Escherichia coli in a regional network with WHONET and SaTScan.

    PubMed

    Park, Rachel; O'Brien, Thomas F; Huang, Susan S; Baker, Meghan A; Yokoe, Deborah S; Kulldorff, Martin; Barrett, Craig; Swift, Jamie; Stelling, John

    2016-11-01

    While antimicrobial resistance threatens the prevention, treatment, and control of infectious diseases, systematic analysis of routine microbiology laboratory test results worldwide can alert new threats and promote timely response. This study explores statistical algorithms for recognizing geographic clustering of multi-resistant microbes within a healthcare network and monitoring the dissemination of new strains over time. Escherichia coli antimicrobial susceptibility data from a three-year period stored in WHONET were analyzed across ten facilities in a healthcare network utilizing SaTScan's spatial multinomial model with two models for defining geographic proximity. We explored geographic clustering of multi-resistance phenotypes within the network and changes in clustering over time. Geographic clustering identified from both latitude/longitude and non-parametric facility groupings geographic models were similar, while the latter was offers greater flexibility and generalizability. Iterative application of the clustering algorithms suggested the possible recognition of the initial appearance of invasive E. coli ST131 in the clinical database of a single hospital and subsequent dissemination to others. Systematic analysis of routine antimicrobial resistance susceptibility test results supports the recognition of geographic clustering of microbial phenotypic subpopulations with WHONET and SaTScan, and iterative application of these algorithms can detect the initial appearance in and dissemination across a region prompting early investigation, response, and containment measures.

  3. A statistical study of EMIC waves observed by Cluster: 1. Wave properties

    NASA Astrophysics Data System (ADS)

    Allen, R. C.; Zhang, J.-C.; Kistler, L. M.; Spence, H. E.; Lin, R.-L.; Klecker, B.; Dunlop, M. W.; André, M.; Jordanova, V. K.

    2015-07-01

    Electromagnetic ion cyclotron (EMIC) waves are an important mechanism for particle energization and losses inside the magnetosphere. In order to better understand the effects of these waves on particle dynamics, detailed information about the occurrence rate, wave power, ellipticity, normal angle, energy propagation angle distributions, and local plasma parameters are required. Previous statistical studies have used in situ observations to investigate the distribution of these parameters in the magnetic local time versus L-shell (MLT-L) frame within a limited magnetic latitude (MLAT) range. In this study, we present a statistical analysis of EMIC wave properties using 10 years (2001-2010) of data from Cluster, totaling 25,431 min of wave activity. Due to the polar orbit of Cluster, we are able to investigate EMIC waves at all MLATs and MLTs. This allows us to further investigate the MLAT dependence of various wave properties inside different MLT sectors and further explore the effects of Shabansky orbits on EMIC wave generation and propagation. The statistical analysis is presented in two papers. This paper focuses on the wave occurrence distribution as well as the distribution of wave properties. The companion paper focuses on local plasma parameters during wave observations as well as wave generation proxies.

  4. Clinical Study of the 3D-Master Color System among the Spanish Population.

    PubMed

    Gómez-Polo, Cristina; Gómez-Polo, Miguel; Martínez Vázquez de Parga, Juan Antonio; Celemín-Viñuela, Alicia

    2017-01-12

    To study whether the shades of the 3D-Master System were grouped and represented in the chromatic space according to the three-color coordinates of value, chroma, and hue. Maxillary central incisor color was measured on tooth surfaces through the Easyshade Compact spectrophotometer using 1361 participants aged between 16 and 89. The natural (not bleached teeth) color of the middle thirds was registered in the 3D-Master System nomenclature and in the CIELCh system. Principal component analysis and cluster analysis were applied. 75 colors of the 3D-Master System were found. The statistical analysis revealed the existence of 5 cluster groups. The centroid, the average of the 75 samples, in relation to lightness (L*) was 74.64, 22.87 for chroma (C*), and 88.85 for hue (h*). All of the clusters, except cluster 3, showed significant statistical differences with the centroid for the three-color coordinates (p <0.001). The results of this study indicated that 75 shades in the 3D-Master System were grouped into 5 clusters following coordinates L*, C*, and h* resulting from the dental spectrophotometer Vita Easyshade compact. The shades that composed each cluster did not belong to the same lightness color dimension groups. There was no special uniform chromatic distribution among the colors of the 3D-Master System. © 2017 by the American College of Prosthodontists.

  5. Improved Test Planning and Analysis Through the Use of Advanced Statistical Methods

    NASA Technical Reports Server (NTRS)

    Green, Lawrence L.; Maxwell, Katherine A.; Glass, David E.; Vaughn, Wallace L.; Barger, Weston; Cook, Mylan

    2016-01-01

    The goal of this work is, through computational simulations, to provide statistically-based evidence to convince the testing community that a distributed testing approach is superior to a clustered testing approach for most situations. For clustered testing, numerous, repeated test points are acquired at a limited number of test conditions. For distributed testing, only one or a few test points are requested at many different conditions. The statistical techniques of Analysis of Variance (ANOVA), Design of Experiments (DOE) and Response Surface Methods (RSM) are applied to enable distributed test planning, data analysis and test augmentation. The D-Optimal class of DOE is used to plan an optimally efficient single- and multi-factor test. The resulting simulated test data are analyzed via ANOVA and a parametric model is constructed using RSM. Finally, ANOVA can be used to plan a second round of testing to augment the existing data set with new data points. The use of these techniques is demonstrated through several illustrative examples. To date, many thousands of comparisons have been performed and the results strongly support the conclusion that the distributed testing approach outperforms the clustered testing approach.

  6. Application of the Linux cluster for exhaustive window haplotype analysis using the FBAT and Unphased programs.

    PubMed

    Mishima, Hiroyuki; Lidral, Andrew C; Ni, Jun

    2008-05-28

    Genetic association studies have been used to map disease-causing genes. A newly introduced statistical method, called exhaustive haplotype association study, analyzes genetic information consisting of different numbers and combinations of DNA sequence variations along a chromosome. Such studies involve a large number of statistical calculations and subsequently high computing power. It is possible to develop parallel algorithms and codes to perform the calculations on a high performance computing (HPC) system. However, most existing commonly-used statistic packages for genetic studies are non-parallel versions. Alternatively, one may use the cutting-edge technology of grid computing and its packages to conduct non-parallel genetic statistical packages on a centralized HPC system or distributed computing systems. In this paper, we report the utilization of a queuing scheduler built on the Grid Engine and run on a Rocks Linux cluster for our genetic statistical studies. Analysis of both consecutive and combinational window haplotypes was conducted by the FBAT (Laird et al., 2000) and Unphased (Dudbridge, 2003) programs. The dataset consisted of 26 loci from 277 extended families (1484 persons). Using the Rocks Linux cluster with 22 compute-nodes, FBAT jobs performed about 14.4-15.9 times faster, while Unphased jobs performed 1.1-18.6 times faster compared to the accumulated computation duration. Execution of exhaustive haplotype analysis using non-parallel software packages on a Linux-based system is an effective and efficient approach in terms of cost and performance.

  7. Application of the Linux cluster for exhaustive window haplotype analysis using the FBAT and Unphased programs

    PubMed Central

    Mishima, Hiroyuki; Lidral, Andrew C; Ni, Jun

    2008-01-01

    Background Genetic association studies have been used to map disease-causing genes. A newly introduced statistical method, called exhaustive haplotype association study, analyzes genetic information consisting of different numbers and combinations of DNA sequence variations along a chromosome. Such studies involve a large number of statistical calculations and subsequently high computing power. It is possible to develop parallel algorithms and codes to perform the calculations on a high performance computing (HPC) system. However, most existing commonly-used statistic packages for genetic studies are non-parallel versions. Alternatively, one may use the cutting-edge technology of grid computing and its packages to conduct non-parallel genetic statistical packages on a centralized HPC system or distributed computing systems. In this paper, we report the utilization of a queuing scheduler built on the Grid Engine and run on a Rocks Linux cluster for our genetic statistical studies. Results Analysis of both consecutive and combinational window haplotypes was conducted by the FBAT (Laird et al., 2000) and Unphased (Dudbridge, 2003) programs. The dataset consisted of 26 loci from 277 extended families (1484 persons). Using the Rocks Linux cluster with 22 compute-nodes, FBAT jobs performed about 14.4–15.9 times faster, while Unphased jobs performed 1.1–18.6 times faster compared to the accumulated computation duration. Conclusion Execution of exhaustive haplotype analysis using non-parallel software packages on a Linux-based system is an effective and efficient approach in terms of cost and performance. PMID:18541045

  8. Ecological tolerances of Miocene larger benthic foraminifera from Indonesia

    NASA Astrophysics Data System (ADS)

    Novak, Vibor; Renema, Willem

    2018-01-01

    To provide a comprehensive palaeoenvironmental reconstruction based on larger benthic foraminifera (LBF), a quantitative analysis of their assemblage composition is needed. Besides microfacies analysis which includes environmental preferences of foraminiferal taxa, statistical analyses should also be employed. Therefore, detrended correspondence analysis and cluster analysis were performed on relative abundance data of identified LBF assemblages deposited in mixed carbonate-siliciclastic (MCS) systems and blue-water (BW) settings. Studied MCS system localities include ten sections from the central part of the Kutai Basin in East Kalimantan, ranging from late Burdigalian to Serravallian age. The BW samples were collected from eleven sections of the Bulu Formation on Central Java, dated as Serravallian. Results from detrended correspondence analysis reveal significant differences between these two environmental settings. Cluster analysis produced five clusters of samples; clusters 1 and 2 comprise dominantly MCS samples, clusters 3 and 4 with dominance of BW samples, and cluster 5 showing a mixed composition with both MCS and BW samples. The results of cluster analysis were afterwards subjected to indicator species analysis resulting in the interpretation that generated three groups among LBF taxa: typical assemblage indicators, regularly occurring taxa and rare taxa. By interpreting the results of detrended correspondence analysis, cluster analysis and indicator species analysis, along with environmental preferences of identified LBF taxa, a palaeoenvironmental model is proposed for the distribution of LBF in Miocene MCS systems and adjacent BW settings of Indonesia.

  9. Statistical design and analysis plan for an impact evaluation of an HIV treatment and prevention intervention for female sex workers in Zimbabwe: a study protocol for a cluster randomised controlled trial.

    PubMed

    Hargreaves, James R; Fearon, Elizabeth; Davey, Calum; Phillips, Andrew; Cambiano, Valentina; Cowan, Frances M

    2016-01-05

    Pragmatic cluster-randomised trials should seek to make unbiased estimates of effect and be reported according to CONSORT principles, and the study population should be representative of the target population. This is challenging when conducting trials amongst 'hidden' populations without a sample frame. We describe a pair-matched cluster-randomised trial of a combination HIV-prevention intervention to reduce the proportion of female sex workers (FSW) with a detectable HIV viral load in Zimbabwe, recruiting via respondent driven sampling (RDS). We will cross-sectionally survey approximately 200 FSW at baseline and at endline to characterise each of 14 sites. RDS is a variant of chain referral sampling and has been adapted to approximate random sampling. Primary analysis will use the 'RDS-2' method to estimate cluster summaries and will adapt Hayes and Moulton's '2-step' method to adjust effect estimates for individual-level confounders and further adjust for cluster baseline prevalence. We will adapt CONSORT to accommodate RDS. In the absence of observable refusal rates, we will compare the recruitment process between matched pairs. We will need to investigate whether cluster-specific recruitment or the intervention itself affects the accuracy of the RDS estimation process, potentially causing differential biases. To do this, we will calculate RDS-diagnostic statistics for each cluster at each time point and compare these statistics within matched pairs and time points. Sensitivity analyses will assess the impact of potential biases arising from assumptions made by the RDS-2 estimation. We are not aware of any other completed pragmatic cluster RCTs that are recruiting participants using RDS. Our statistical design and analysis approach seeks to transparently document participant recruitment and allow an assessment of the representativeness of the study to the target population, a key aspect of pragmatic trials. The challenges we have faced in the design of this trial are likely to be shared in other contexts aiming to serve the needs of legally and/or socially marginalised populations for which no sampling frame exists and especially when the social networks of participants are both the target of intervention and the means of recruitment. The trial was registered at Pan African Clinical Trials Registry (PACTR201312000722390) on 9 December 2013.

  10. Wildfire cluster detection using space-time scan statistics

    NASA Astrophysics Data System (ADS)

    Tonini, M.; Tuia, D.; Ratle, F.; Kanevski, M.

    2009-04-01

    The aim of the present study is to identify spatio-temporal clusters of fires sequences using space-time scan statistics. These statistical methods are specifically designed to detect clusters and assess their significance. Basically, scan statistics work by comparing a set of events occurring inside a scanning window (or a space-time cylinder for spatio-temporal data) with those that lie outside. Windows of increasing size scan the zone across space and time: the likelihood ratio is calculated for each window (comparing the ratio "observed cases over expected" inside and outside): the window with the maximum value is assumed to be the most probable cluster, and so on. Under the null hypothesis of spatial and temporal randomness, these events are distributed according to a known discrete-state random process (Poisson or Bernoulli), which parameters can be estimated. Given this assumption, it is possible to test whether or not the null hypothesis holds in a specific area. In order to deal with fires data, the space-time permutation scan statistic has been applied since it does not require the explicit specification of the population-at risk in each cylinder. The case study is represented by Florida daily fire detection using the Moderate Resolution Imaging Spectroradiometer (MODIS) active fire product during the period 2003-2006. As result, statistically significant clusters have been identified. Performing the analyses over the entire frame period, three out of the five most likely clusters have been identified in the forest areas, on the North of the country; the other two clusters cover a large zone in the South, corresponding to agricultural land and the prairies in the Everglades. Furthermore, the analyses have been performed separately for the four years to analyze if the wildfires recur each year during the same period. It emerges that clusters of forest fires are more frequent in hot seasons (spring and summer), while in the South areas they are widely present along the whole year. The analysis of fires distribution to evaluate if they are statistically more frequent in some area or/and in some period of the year, can be useful to support fire management and to focus on prevention measures.

  11. Structural parameters of young star clusters: fractal analysis

    NASA Astrophysics Data System (ADS)

    Hetem, A.

    2017-07-01

    A unified view of star formation in the Universe demand detailed and in-depth studies of young star clusters. This work is related to our previous study of fractal statistics estimated for a sample of young stellar clusters (Gregorio-Hetem et al. 2015, MNRAS 448, 2504). The structural properties can lead to significant conclusions about the early stages of cluster formation: 1) virial conditions can be used to distinguish warm collapsed; 2) bound or unbound behaviour can lead to conclusions about expansion; and 3) fractal statistics are correlated to the dynamical evolution and age. The technique of error bars estimation most used in the literature is to adopt inferential methods (like bootstrap) to estimate deviation and variance, which are valid only for an artificially generated cluster. In this paper, we expanded the number of studied clusters, in order to enhance the investigation of the cluster properties and dynamic evolution. The structural parameters were compared with fractal statistics and reveal that the clusters radial density profile show a tendency of the mean separation of the stars increase with the average surface density. The sample can be divided into two groups showing different dynamic behaviour, but they have the same dynamic evolution, since the entire sample was revealed as being expanding objects, for which the substructures do not seem to have been completely erased. These results are in agreement with the simulations adopting low surface densities and supervirial conditions.

  12. An unsupervised classification technique for multispectral remote sensing data.

    NASA Technical Reports Server (NTRS)

    Su, M. Y.; Cummings, R. E.

    1973-01-01

    Description of a two-part clustering technique consisting of (a) a sequential statistical clustering, which is essentially a sequential variance analysis, and (b) a generalized K-means clustering. In this composite clustering technique, the output of (a) is a set of initial clusters which are input to (b) for further improvement by an iterative scheme. This unsupervised composite technique was employed for automatic classification of two sets of remote multispectral earth resource observations. The classification accuracy by the unsupervised technique is found to be comparable to that by traditional supervised maximum-likelihood classification techniques.

  13. Unsupervised classification of earth resources data.

    NASA Technical Reports Server (NTRS)

    Su, M. Y.; Jayroe, R. R., Jr.; Cummings, R. E.

    1972-01-01

    A new clustering technique is presented. It consists of two parts: (a) a sequential statistical clustering which is essentially a sequential variance analysis and (b) a generalized K-means clustering. In this composite clustering technique, the output of (a) is a set of initial clusters which are input to (b) for further improvement by an iterative scheme. This unsupervised composite technique was employed for automatic classification of two sets of remote multispectral earth resource observations. The classification accuracy by the unsupervised technique is found to be comparable to that by existing supervised maximum liklihood classification technique.

  14. A new clustering strategy

    NASA Astrophysics Data System (ADS)

    Feng, Jian-xin; Tang, Jia-fu; Wang, Guang-xing

    2007-04-01

    On the basis of the analysis of clustering algorithm that had been proposed for MANET, a novel clustering strategy was proposed in this paper. With the trust defined by statistical hypothesis in probability theory and the cluster head selected by node trust and node mobility, this strategy can realize the function of the malicious nodes detection which was neglected by other clustering algorithms and overcome the deficiency of being incapable of implementing the relative mobility metric of corresponding nodes in the MOBIC algorithm caused by the fact that the receiving power of two consecutive HELLO packet cannot be measured. It's an effective solution to cluster MANET securely.

  15. Symptom clusters in patients with nasopharyngeal carcinoma during radiotherapy.

    PubMed

    Xiao, Wenli; Chan, Carmen W H; Fan, Yuying; Leung, Doris Y P; Xia, Weixiong; He, Yan; Tang, Linquan

    2017-06-01

    Despite the improvement in radiotherapy (RT) technology, patients with nasopharyngeal carcinoma (NPC) still suffer from numerous distressing symptoms simultaneously during RT. The purpose of the study was to investigate the symptom clusters experienced by NPC patients during RT. First-treated Chinese NPC patients (n = 130) undergoing late-period RT (from week 4 till the end) were recruited for this cross-sectional study. They completed a sociodemographic and clinical data questionnaire, the Chinese version of the M. D. Anderson Symptom Inventory - Head and Neck Module (MDASI-HN-C) and the Chinese version of the Functional Assessment of Cancer Therapy - Head and Neck Scale (FACT-H&N-C). Principal axis factor analysis with oblimin rotation, independent t-test, one-way analysis of variance (ANOVA) and Pearson product-moment correlation were used to analyze the data. Four symptom clusters were identified, and labelled general, gastrointestinal, nutrition impact and social interaction impact. Of these 4 types, the nutrition impact symptom cluster was the most severe. Statistically positive correlations were found between severity of all 4 symptom clusters and symptom interference, as well as weight loss. Statistically negative correlations were detected between the cluster severity and the QOL total score and 3 out of 5 subscale scores. The four clusters identified reveal the symptom patterns experienced by NPC patients during RT. Future intervention studies on managing these symptom clusters are warranted, especially for the nutrition impact symptom cluster. Copyright © 2017 Elsevier Ltd. All rights reserved.

  16. The use of the temporal scan statistic to detect methicillin-resistant Staphylococcus aureus clusters in a community hospital.

    PubMed

    Faires, Meredith C; Pearl, David L; Ciccotelli, William A; Berke, Olaf; Reid-Smith, Richard J; Weese, J Scott

    2014-07-08

    In healthcare facilities, conventional surveillance techniques using rule-based guidelines may result in under- or over-reporting of methicillin-resistant Staphylococcus aureus (MRSA) outbreaks, as these guidelines are generally unvalidated. The objectives of this study were to investigate the utility of the temporal scan statistic for detecting MRSA clusters, validate clusters using molecular techniques and hospital records, and determine significant differences in the rate of MRSA cases using regression models. Patients admitted to a community hospital between August 2006 and February 2011, and identified with MRSA>48 hours following hospital admission, were included in this study. Between March 2010 and February 2011, MRSA specimens were obtained for spa typing. MRSA clusters were investigated using a retrospective temporal scan statistic. Tests were conducted on a monthly scale and significant clusters were compared to MRSA outbreaks identified by hospital personnel. Associations between the rate of MRSA cases and the variables year, month, and season were investigated using a negative binomial regression model. During the study period, 735 MRSA cases were identified and 167 MRSA isolates were spa typed. Nine different spa types were identified with spa type 2/t002 (88.6%) the most prevalent. The temporal scan statistic identified significant MRSA clusters at the hospital (n=2), service (n=16), and ward (n=10) levels (P ≤ 0.05). Seven clusters were concordant with nine MRSA outbreaks identified by hospital staff. For the remaining clusters, seven events may have been equivalent to true outbreaks and six clusters demonstrated possible transmission events. The regression analysis indicated years 2009-2011, compared to 2006, and months March and April, compared to January, were associated with an increase in the rate of MRSA cases (P ≤ 0.05). The application of the temporal scan statistic identified several MRSA clusters that were not detected by hospital personnel. The identification of specific years and months with increased MRSA rates may be attributable to several hospital level factors including the presence of other pathogens. Within hospitals, the incorporation of the temporal scan statistic to standard surveillance techniques is a valuable tool for healthcare workers to evaluate surveillance strategies and aid in the identification of MRSA clusters.

  17. Developing appropriate methods for cost-effectiveness analysis of cluster randomized trials.

    PubMed

    Gomes, Manuel; Ng, Edmond S-W; Grieve, Richard; Nixon, Richard; Carpenter, James; Thompson, Simon G

    2012-01-01

    Cost-effectiveness analyses (CEAs) may use data from cluster randomized trials (CRTs), where the unit of randomization is the cluster, not the individual. However, most studies use analytical methods that ignore clustering. This article compares alternative statistical methods for accommodating clustering in CEAs of CRTs. Our simulation study compared the performance of statistical methods for CEAs of CRTs with 2 treatment arms. The study considered a method that ignored clustering--seemingly unrelated regression (SUR) without a robust standard error (SE)--and 4 methods that recognized clustering--SUR and generalized estimating equations (GEEs), both with robust SE, a "2-stage" nonparametric bootstrap (TSB) with shrinkage correction, and a multilevel model (MLM). The base case assumed CRTs with moderate numbers of balanced clusters (20 per arm) and normally distributed costs. Other scenarios included CRTs with few clusters, imbalanced cluster sizes, and skewed costs. Performance was reported as bias, root mean squared error (rMSE), and confidence interval (CI) coverage for estimating incremental net benefits (INBs). We also compared the methods in a case study. Each method reported low levels of bias. Without the robust SE, SUR gave poor CI coverage (base case: 0.89 v. nominal level: 0.95). The MLM and TSB performed well in each scenario (CI coverage, 0.92-0.95). With few clusters, the GEE and SUR (with robust SE) had coverage below 0.90. In the case study, the mean INBs were similar across all methods, but ignoring clustering underestimated statistical uncertainty and the value of further research. MLMs and the TSB are appropriate analytical methods for CEAs of CRTs with the characteristics described. SUR and GEE are not recommended for studies with few clusters.

  18. Dynamics of cD Clusters of Galaxies. 4; Conclusion of a Survey of 25 Abell Clusters

    NASA Technical Reports Server (NTRS)

    Oegerle, William R.; Hill, John M.; Fisher, Richard R. (Technical Monitor)

    2001-01-01

    We present the final results of a spectroscopic study of a sample of cD galaxy clusters. The goal of this program has been to study the dynamics of the clusters, with emphasis on determining the nature and frequency of cD galaxies with peculiar velocities. Redshifts measured with the MX Spectrometer have been combined with those obtained from the literature to obtain typically 50 - 150 observed velocities in each of 25 galaxy clusters containing a central cD galaxy. We present a dynamical analysis of the final 11 clusters to be observed in this sample. All 25 clusters are analyzed in a uniform manner to test for the presence of substructure, and to determine peculiar velocities and their statistical significance for the central cD galaxy. These peculiar velocities were used to determine whether or not the central cD galaxy is at rest in the cluster potential well. We find that 30 - 50% of the clusters in our sample possess significant subclustering (depending on the cluster radius used in the analysis), which is in agreement with other studies of non-cD clusters. Hence, the dynamical state of cD clusters is not different than other present-day clusters. After careful study, four of the clusters appear to have a cD galaxy with a significant peculiar velocity. Dressler-Shectman tests indicate that three of these four clusters have statistically significant substructure within 1.5/h(sub 75) Mpc of the cluster center. The dispersion 75 of the cD peculiar velocities is 164 +41/-34 km/s around the mean cluster velocity. This represents a significant detection of peculiar cD velocities, but at a level which is far below the mean velocity dispersion for this sample of clusters. The picture that emerges is one in which cD galaxies are nearly at rest with respect to the cluster potential well, but have small residual velocities due to subcluster mergers.

  19. Could the clinical interpretability of subgroups detected using clustering methods be improved by using a novel two-stage approach?

    PubMed

    Kent, Peter; Stochkendahl, Mette Jensen; Christensen, Henrik Wulff; Kongsted, Alice

    2015-01-01

    Recognition of homogeneous subgroups of patients can usefully improve prediction of their outcomes and the targeting of treatment. There are a number of research approaches that have been used to recognise homogeneity in such subgroups and to test their implications. One approach is to use statistical clustering techniques, such as Cluster Analysis or Latent Class Analysis, to detect latent relationships between patient characteristics. Influential patient characteristics can come from diverse domains of health, such as pain, activity limitation, physical impairment, social role participation, psychological factors, biomarkers and imaging. However, such 'whole person' research may result in data-driven subgroups that are complex, difficult to interpret and challenging to recognise clinically. This paper describes a novel approach to applying statistical clustering techniques that may improve the clinical interpretability of derived subgroups and reduce sample size requirements. This approach involves clustering in two sequential stages. The first stage involves clustering within health domains and therefore requires creating as many clustering models as there are health domains in the available data. This first stage produces scoring patterns within each domain. The second stage involves clustering using the scoring patterns from each health domain (from the first stage) to identify subgroups across all domains. We illustrate this using chest pain data from the baseline presentation of 580 patients. The new two-stage clustering resulted in two subgroups that approximated the classic textbook descriptions of musculoskeletal chest pain and atypical angina chest pain. The traditional single-stage clustering resulted in five clusters that were also clinically recognisable but displayed less distinct differences. In this paper, a new approach to using clustering techniques to identify clinically useful subgroups of patients is suggested. Research designs, statistical methods and outcome metrics suitable for performing that testing are also described. This approach has potential benefits but requires broad testing, in multiple patient samples, to determine its clinical value. The usefulness of the approach is likely to be context-specific, depending on the characteristics of the available data and the research question being asked of it.

  20. [Space-time suicide clustering in the community of Antequera (Spain)].

    PubMed

    Pérez-Costillas, Lucía; Blasco-Fontecilla, Hilario; Benítez, Nicolás; Comino, Raquel; Antón, José Miguel; Ramos-Medina, Valentín; Lopez, Amalia; Palomo, José Luis; Madrigal, Lucía; Alcalde, Javier; Perea-Millá, Emilio; Artieda-Urrutia, Paula; de León-Martínez, Victoria; de Diego Otero, Yolanda

    2015-01-01

    Approximately 3,500 people commit suicide every year in Spain. The main aim of this study is to explore if a spatial and temporal clustering of suicide exists in the region of Antequera (Málaga, España). Sample and procedure: All suicides from January 1, 2004 to December 31, 2008 were identified using data from the Forensic Pathology Department of the Institute of Legal Medicine, Málaga (España). Geolocalisation. Google Earth was used to calculate the coordinates for each suicide decedent's address. Statistical analysis. A spatiotemporal permutation scan statistic and the Ripley's K function were used to explore spatiotemporal clustering. Pearson's chi-squared was used to determine whether there were differences between suicides inside and outside the spatiotemporal clusters. A total of 120 individuals committed suicide within the region of Antequera, of which 96 (80%) were included in our analyses. Statistically significant evidence for 7 spatiotemporal suicide clusters emerged within critical limits for the 0-2.5 km distance and for the first and second semanas (P<.05 in both cases) after suicide. There was not a single subject diagnosed with a current psychotic disorder, among suicides within clusters, whereas outside clusters, 20% had this diagnosis (X2=4.13; df=1; P<.05). There are spatiotemporal suicide clusters in the area surrounding Antequera. Patients diagnosed with current psychotic disorder are less likely to be influenced by the factors explaining suicide clustering. Copyright © 2013 SEP y SEPB. Published by Elsevier España. All rights reserved.

  1. Selection of the Maximum Spatial Cluster Size of the Spatial Scan Statistic by Using the Maximum Clustering Set-Proportion Statistic.

    PubMed

    Ma, Yue; Yin, Fei; Zhang, Tao; Zhou, Xiaohua Andrew; Li, Xiaosong

    2016-01-01

    Spatial scan statistics are widely used in various fields. The performance of these statistics is influenced by parameters, such as maximum spatial cluster size, and can be improved by parameter selection using performance measures. Current performance measures are based on the presence of clusters and are thus inapplicable to data sets without known clusters. In this work, we propose a novel overall performance measure called maximum clustering set-proportion (MCS-P), which is based on the likelihood of the union of detected clusters and the applied dataset. MCS-P was compared with existing performance measures in a simulation study to select the maximum spatial cluster size. Results of other performance measures, such as sensitivity and misclassification, suggest that the spatial scan statistic achieves accurate results in most scenarios with the maximum spatial cluster sizes selected using MCS-P. Given that previously known clusters are not required in the proposed strategy, selection of the optimal maximum cluster size with MCS-P can improve the performance of the scan statistic in applications without identified clusters.

  2. Selection of the Maximum Spatial Cluster Size of the Spatial Scan Statistic by Using the Maximum Clustering Set-Proportion Statistic

    PubMed Central

    Ma, Yue; Yin, Fei; Zhang, Tao; Zhou, Xiaohua Andrew; Li, Xiaosong

    2016-01-01

    Spatial scan statistics are widely used in various fields. The performance of these statistics is influenced by parameters, such as maximum spatial cluster size, and can be improved by parameter selection using performance measures. Current performance measures are based on the presence of clusters and are thus inapplicable to data sets without known clusters. In this work, we propose a novel overall performance measure called maximum clustering set–proportion (MCS-P), which is based on the likelihood of the union of detected clusters and the applied dataset. MCS-P was compared with existing performance measures in a simulation study to select the maximum spatial cluster size. Results of other performance measures, such as sensitivity and misclassification, suggest that the spatial scan statistic achieves accurate results in most scenarios with the maximum spatial cluster sizes selected using MCS-P. Given that previously known clusters are not required in the proposed strategy, selection of the optimal maximum cluster size with MCS-P can improve the performance of the scan statistic in applications without identified clusters. PMID:26820646

  3. A system for learning statistical motion patterns.

    PubMed

    Hu, Weiming; Xiao, Xuejuan; Fu, Zhouyu; Xie, Dan; Tan, Tieniu; Maybank, Steve

    2006-09-01

    Analysis of motion patterns is an effective approach for anomaly detection and behavior prediction. Current approaches for the analysis of motion patterns depend on known scenes, where objects move in predefined ways. It is highly desirable to automatically construct object motion patterns which reflect the knowledge of the scene. In this paper, we present a system for automatically learning motion patterns for anomaly detection and behavior prediction based on a proposed algorithm for robustly tracking multiple objects. In the tracking algorithm, foreground pixels are clustered using a fast accurate fuzzy K-means algorithm. Growing and prediction of the cluster centroids of foreground pixels ensure that each cluster centroid is associated with a moving object in the scene. In the algorithm for learning motion patterns, trajectories are clustered hierarchically using spatial and temporal information and then each motion pattern is represented with a chain of Gaussian distributions. Based on the learned statistical motion patterns, statistical methods are used to detect anomalies and predict behaviors. Our system is tested using image sequences acquired, respectively, from a crowded real traffic scene and a model traffic scene. Experimental results show the robustness of the tracking algorithm, the efficiency of the algorithm for learning motion patterns, and the encouraging performance of algorithms for anomaly detection and behavior prediction.

  4. Nonlinear dynamics of the cellular-automaton ``game of Life''

    NASA Astrophysics Data System (ADS)

    Garcia, J. B. C.; Gomes, M. A. F.; Jyh, T. I.; Ren, T. I.; Sales, T. R. M.

    1993-11-01

    A statistical analysis of the ``game of Life'' due to Conway [Berlekamp, Conway, and Guy, Winning Ways for Your Mathematical Plays (Academic, New York, 1982), Vol. 2] is reported. The results are based on extensive computer simulations starting with uncorrelated distributions of live sites at t=0. The number n(s,t) of clusters of s live sites at time t, the mean cluster size s¯(t), and the diversity of sizes among other statistical functions are obtained. The dependence of the statistical functions with the initial density of live sites is examined. Several scaling relations as well as static and dynamic critical exponents are found.

  5. Cluster analysis of particulate matter (PM10) and black carbon (BC) concentrations

    NASA Astrophysics Data System (ADS)

    Žibert, Janez; Pražnikar, Jure

    2012-09-01

    The monitoring of air-pollution constituents like particulate matter (PM10) and black carbon (BC) can provide information about air quality and the dynamics of emissions. Air quality depends on natural and anthropogenic sources of emissions as well as the weather conditions. For a one-year period the diurnal concentrations of PM10 and BC in the Port of Koper were analysed by clustering days into similar groups according to the similarity of the BC and PM10 hourly derived day-profiles without any prior assumptions about working and non-working days, weather conditions or hot and cold seasons. The analysis was performed by using k-means clustering with the squared Euclidean distance as the similarity measure. The analysis showed that 10 clusters in the BC case produced 3 clusters with just one member day and 7 clusters that encompasses more than one day with similar BC profiles. Similar results were found in the PM10 case, where one cluster has a single-member day, while 7 clusters contain several member days. The clustering analysis revealed that the clusters with less pronounced bimodal patterns and low hourly and average daily concentrations for both types of measurements include the most days in the one-year analysis. A typical day profile of the BC measurements includes a bimodal pattern with morning and evening peaks, while the PM10 measurements reveal a less pronounced bimodality. There are also clusters with single-peak day-profiles. The BC data in such cases exhibit morning peaks, while the PM10 data consist of noon or afternoon single peaks. Single pronounced peaks can be explained by appropriate cluster wind speed profiles. The analysis also revealed some special day-profiles. The BC cluster with a high midnight peak at 30/04/2010 and the PM10 cluster with the highest observed concentration of PM10 at 01/05/2010 (208.0 μg m-3) coincide with 1 May, which is a national holiday in Slovenia and has very strong tradition of bonfire parties. The clustering of the diurnal concentration showed that various different day-profiles are presented in a cold period, while this is not the case for the hot season. Additional analysis of ship traffic and rain fall data showed that there is no statistically significant difference between the ship gross (bruto) registered tonnage (BRT) values in the case of BC and PM10 clusters, but that there is statistically significant differences between the rain fall in the BC and PM10 clusters. The wind-rose for clusters which included most days in the sampling period indicating that emitted PM10 and BC from Port of Koper were manly transported in the west direction over the sea and in the east direction, where there is in no populated area. Presented analysis showed that both BC and PM10 concentrations were driven by rain intensity and wind speed.

  6. Hydrogeochemistry and water quality of the Kordkandi-Duzduzan plain, NW Iran: application of multivariate statistical analysis and PoS index.

    PubMed

    Soltani, Shahla; Asghari Moghaddam, Asghar; Barzegar, Rahim; Kazemian, Naeimeh; Tziritis, Evangelos

    2017-08-18

    Kordkandi-Duzduzan plain is one of the fertile plains of East Azarbaijan Province, NW of Iran. Groundwater is an important resource for drinking and agricultural purposes due to the lack of surface water resources in the region. The main objectives of the present study are to identify the hydrogeochemical processes and the potential sources of major, minor, and trace metals and metalloids such as Cr, Mn, Cd, Fe, Al, and As by using joint hydrogeochemical techniques and multivariate statistical analysis and to evaluate groundwater quality deterioration with the use of PoS environmental index. To achieve these objectives, 23 groundwater samples were collected in September 2015. Piper diagram shows that the mixed Ca-Mg-Cl is the dominant groundwater type, and some of the samples have Ca-HCO 3 , Ca-Cl, and Na-Cl types. Multivariate statistical analyses indicate that weathering and dissolution of different rocks and minerals, e.g., silicates, gypsum, and halite, ion exchange, and agricultural activities influence the hydrogeochemistry of the study area. The cluster analysis divides the samples into two distinct clusters which are completely different in EC (and its dependent variables such as Na + , K + , Ca 2+ , Mg 2+ , SO 4 2- , and Cl - ), Cd, and Cr variables according to the ANOVA statistical test. Based on the median values, the concentrations of pH, NO 3 - , SiO 2 , and As in cluster 1 are elevated compared with those of cluster 2, while their maximum values occur in cluster 2. According to the PoS index, the dominant parameter that controls quality deterioration is As, with 60% of contribution. Samples of lowest PoS values are located in the southern and northern parts (recharge area) while samples of the highest values are located in the discharge area and the eastern part.

  7. Clustering, randomness, and regularity in cloud fields. 4. Stratocumulus cloud fields

    NASA Astrophysics Data System (ADS)

    Lee, J.; Chou, J.; Weger, R. C.; Welch, R. M.

    1994-07-01

    To complete the analysis of the spatial distribution of boundary layer cloudiness, the present study focuses on nine stratocumulus Landsat scenes. The results indicate many similarities between stratocumulus and cumulus spatial distributions. Most notably, at full spatial resolution all scenes exhibit a decidedly clustered distribution. The strength of the clustering signal decreases with increasing cloud size; the clusters themselves consist of a few clouds (less than 10), occupy a small percentage of the cloud field area (less than 5%), contain between 20% and 60% of the cloud field population, and are randomly located within the scene. In contrast, stratocumulus in almost every respect are more strongly clustered than are cumulus cloud fields. For instance, stratocumulus clusters contain more clouds per cluster, occupy a larger percentage of the total area, and have a larger percentage of clouds participating in clusters than the corresponding cumulus examples. To investigate clustering at intermediate spatial scales, the local dimensionality statistic is introduced. Results obtained from this statistic provide the first direct evidence for regularity among large (>900 m in diameter) clouds in stratocumulus and cumulus cloud fields, in support of the inhibition hypothesis of Ramirez and Bras (1990). Also, the size compensated point-to-cloud cumulative distribution function statistic is found to be necessary to obtain a consistent description of stratocumulus cloud distributions. A hypothesis regarding the underlying physical mechanisms responsible for cloud clustering is presented. It is suggested that cloud clusters often arise from 4 to 10 triggering events localized within regions less than 2 km in diameter and randomly distributed within the cloud field. As the size of the cloud surpasses the scale of the triggering region, the clustering signal weakens and the larger cloud locations become more random.

  8. Clustering, randomness, and regularity in cloud fields. 4: Stratocumulus cloud fields

    NASA Technical Reports Server (NTRS)

    Lee, J.; Chou, J.; Weger, R. C.; Welch, R. M.

    1994-01-01

    To complete the analysis of the spatial distribution of boundary layer cloudiness, the present study focuses on nine stratocumulus Landsat scenes. The results indicate many similarities between stratocumulus and cumulus spatial distributions. Most notably, at full spatial resolution all scenes exhibit a decidedly clustered distribution. The strength of the clustering signal decreases with increasing cloud size; the clusters themselves consist of a few clouds (less than 10), occupy a small percentage of the cloud field area (less than 5%), contain between 20% and 60% of the cloud field population, and are randomly located within the scene. In contrast, stratocumulus in almost every respect are more strongly clustered than are cumulus cloud fields. For instance, stratocumulus clusters contain more clouds per cluster, occupy a larger percentage of the total area, and have a larger percentage of clouds participating in clusters than the corresponding cumulus examples. To investigate clustering at intermediate spatial scales, the local dimensionality statistic is introduced. Results obtained from this statistic provide the first direct evidence for regularity among large (more than 900 m in diameter) clouds in stratocumulus and cumulus cloud fields, in support of the inhibition hypothesis of Ramirez and Bras (1990). Also, the size compensated point-to-cloud cumulative distribution function statistic is found to be necessary to obtain a consistent description of stratocumulus cloud distributions. A hypothesis regarding the underlying physical mechanisms responsible for cloud clustering is presented. It is suggested that cloud clusters often arise from 4 to 10 triggering events localized within regions less than 2 km in diameter and randomly distributed within the cloud field. As the size of the cloud surpasses the scale of the triggering region, the clustering signal weakens and the larger cloud locations become more random.

  9. [Visual field progression in glaucoma: cluster analysis].

    PubMed

    Bresson-Dumont, H; Hatton, J; Foucher, J; Fonteneau, M

    2012-11-01

    Visual field progression analysis is one of the key points in glaucoma monitoring, but distinction between true progression and random fluctuation is sometimes difficult. There are several different algorithms but no real consensus for detecting visual field progression. The trend analysis of global indices (MD, sLV) may miss localized deficits or be affected by media opacities. Conversely, point-by-point analysis makes progression difficult to differentiate from physiological variability, particularly when the sensitivity of a point is already low. The goal of our study was to analyse visual field progression with the EyeSuite™ Octopus Perimetry Clusters algorithm in patients with no significant changes in global indices or worsening of the analysis of pointwise linear regression. We analyzed the visual fields of 162 eyes (100 patients - 58 women, 42 men, average age 66.8 ± 10.91) with ocular hypertension or glaucoma. For inclusion, at least six reliable visual fields per eye were required, and the trend analysis (EyeSuite™ Perimetry) of visual field global indices (MD and SLV), could show no significant progression. The analysis of changes in cluster mode was then performed. In a second step, eyes with statistically significant worsening of at least one of their clusters were analyzed point-by-point with the Octopus Field Analysis (OFA). Fifty four eyes (33.33%) had a significant worsening in some clusters, while their global indices remained stable over time. In this group of patients, more advanced glaucoma was present than in stable group (MD 6.41 dB vs. 2.87); 64.82% (35/54) of those eyes in which the clusters progressed, however, had no statistically significant change in the trend analysis by pointwise linear regression. Most software algorithms for analyzing visual field progression are essentially trend analyses of global indices, or point-by-point linear regression. This study shows the potential role of analysis by clusters trend. However, for best results, it is preferable to compare the analyses of several tests in combination with morphologic exam. Copyright © 2012 Elsevier Masson SAS. All rights reserved.

  10. Statistical detection of geographic clusters of resistant Escherichia coli in a regional network with WHONET and SaTScan

    PubMed Central

    Park, Rachel; O'Brien, Thomas F.; Huang, Susan S.; Baker, Meghan A.; Yokoe, Deborah S.; Kulldorff, Martin; Barrett, Craig; Swift, Jamie; Stelling, John

    2016-01-01

    Objectives While antimicrobial resistance threatens the prevention, treatment, and control of infectious diseases, systematic analysis of routine microbiology laboratory test results worldwide can alert new threats and promote timely response. This study explores statistical algorithms for recognizing geographic clustering of multi-resistant microbes within a healthcare network and monitoring the dissemination of new strains over time. Methods Escherichia coli antimicrobial susceptibility data from a three-year period stored in WHONET were analyzed across ten facilities in a healthcare network utilizing SaTScan's spatial multinomial model with two models for defining geographic proximity. We explored geographic clustering of multi-resistance phenotypes within the network and changes in clustering over time. Results Geographic clustering identified from both latitude/longitude and non-parametric facility groupings geographic models were similar, while the latter was offers greater flexibility and generalizability. Iterative application of the clustering algorithms suggested the possible recognition of the initial appearance of invasive E. coli ST131 in the clinical database of a single hospital and subsequent dissemination to others. Conclusion Systematic analysis of routine antimicrobial resistance susceptibility test results supports the recognition of geographic clustering of microbial phenotypic subpopulations with WHONET and SaTScan, and iterative application of these algorithms can detect the initial appearance in and dissemination across a region prompting early investigation, response, and containment measures. PMID:27530311

  11. Statistical analyses and characteristics of volcanic tremor on Stromboli Volcano (Italy)

    NASA Astrophysics Data System (ADS)

    Falsaperla, S.; Langer, H.; Spampinato, S.

    A study of volcanic tremor on Stromboli is carried out on the basis of data recorded daily between 1993 and 1995 by a permanent seismic station (STR) located 1.8km away from the active craters. We also consider the signal of a second station (TF1), which operated for a shorter time span. Changes in the spectral tremor characteristics can be related to modifications in volcanic activity, particularly to lava effusions and explosive sequences. Statistical analyses were carried out on a set of spectra calculated daily from seismic signals where explosion quakes were present or excluded. Principal component analysis and cluster analysis were applied to identify different classes of spectra. Three clusters of spectra are associated with two different states of volcanic activity. One cluster corresponds to a state of low to moderate activity, whereas the two other clusters are present during phases with a high magma column as inferred from the occurrence of lava fountains or effusions. We therefore conclude that variations in volcanic activity at Stromboli are usually linked to changes in the spectral characteristics of volcanic tremor. Site effects are evident when comparing the spectra calculated from signals synchronously recorded at STR and TF1. However, some major spectral peaks at both stations may reflect source properties. Statistical considerations and polarization analysis are in favor of a prevailing presence of P-waves in the tremor signal along with a position of the source northwest of the craters and at shallow depth.

  12. Defining syndromes using cattle meat inspection data for syndromic surveillance purposes: a statistical approach with the 2005-2010 data from ten French slaughterhouses.

    PubMed

    Dupuy, Céline; Morignat, Eric; Maugey, Xavier; Vinard, Jean-Luc; Hendrikx, Pascal; Ducrot, Christian; Calavas, Didier; Gay, Emilie

    2013-04-30

    The slaughterhouse is a central processing point for food animals and thus a source of both demographic data (age, breed, sex) and health-related data (reason for condemnation and condemned portions) that are not available through other sources. Using these data for syndromic surveillance is therefore tempting. However many possible reasons for condemnation and condemned portions exist, making the definition of relevant syndromes challenging.The objective of this study was to determine a typology of cattle with at least one portion of the carcass condemned in order to define syndromes. Multiple factor analysis (MFA) in combination with clustering methods was performed using both health-related data and demographic data. Analyses were performed on 381,186 cattle with at least one portion of the carcass condemned among the 1,937,917 cattle slaughtered in ten French abattoirs. Results of the MFA and clustering methods led to 12 clusters considered as stable according to year of slaughter and slaughterhouse. One cluster was specific to a disease of public health importance (cysticercosis). Two clusters were linked to the slaughtering process (fecal contamination of heart or lungs and deterioration lesions). Two clusters respectively characterized by chronic liver lesions and chronic peritonitis could be linked to diseases of economic importance to farmers. Three clusters could be linked respectively to reticulo-pericarditis, fatty liver syndrome and farmer's lung syndrome, which are related to both diseases of economic importance to farmers and herd management issues. Three clusters respectively characterized by arthritis, myopathy and Dark Firm Dry (DFD) meat could notably be linked to animal welfare issues. Finally, one cluster, characterized by bronchopneumonia, could be linked to both animal health and herd management issues. The statistical approach of combining multiple factor analysis with cluster analysis showed its relevance for the detection of syndromes using available large and complex slaughterhouse data. The advantages of this statistical approach are to i) define groups of reasons for condemnation based on meat inspection data, ii) help grouping reasons for condemnation among a list of various possible reasons for condemnation for which a consensus among experts could be difficult to reach, iii) assign each animal to a single syndrome which allows the detection of changes in trends of syndromes to detect unusual patterns in known diseases and emergence of new diseases.

  13. Cosmological Constraints from Galaxy Cluster Velocity Statistics

    NASA Astrophysics Data System (ADS)

    Bhattacharya, Suman; Kosowsky, Arthur

    2007-04-01

    Future microwave sky surveys will have the sensitivity to detect the kinematic Sunyaev-Zeldovich signal from moving galaxy clusters, thus providing a direct measurement of their line-of-sight peculiar velocity. We show that cluster peculiar velocity statistics applied to foreseeable surveys will put significant constraints on fundamental cosmological parameters. We consider three statistical quantities that can be constructed from a cluster peculiar velocity catalog: the probability density function, the mean pairwise streaming velocity, and the pairwise velocity dispersion. These quantities are applied to an envisioned data set that measures line-of-sight cluster velocities with normal errors of 100 km s-1 for all clusters with masses larger than 1014 Msolar over a sky area of up to 5000 deg2. A simple Fisher matrix analysis of this survey shows that the normalization of the matter power spectrum and the dark energy equation of state can be constrained to better than 10%, and that the Hubble constant and the primordial power spectrum index can be constrained to a few percent, independent of any other cosmological observations. We also find that the current constraint on the power spectrum normalization can be improved by more than a factor of 2 using data from a 400 deg2 survey and WMAP third-year priors. We also show how the constraints on cosmological parameters change if cluster velocities are measured with normal errors of 300 km s-1.

  14. Multiple imputation methods for bivariate outcomes in cluster randomised trials.

    PubMed

    DiazOrdaz, K; Kenward, M G; Gomes, M; Grieve, R

    2016-09-10

    Missing observations are common in cluster randomised trials. The problem is exacerbated when modelling bivariate outcomes jointly, as the proportion of complete cases is often considerably smaller than the proportion having either of the outcomes fully observed. Approaches taken to handling such missing data include the following: complete case analysis, single-level multiple imputation that ignores the clustering, multiple imputation with a fixed effect for each cluster and multilevel multiple imputation. We contrasted the alternative approaches to handling missing data in a cost-effectiveness analysis that uses data from a cluster randomised trial to evaluate an exercise intervention for care home residents. We then conducted a simulation study to assess the performance of these approaches on bivariate continuous outcomes, in terms of confidence interval coverage and empirical bias in the estimated treatment effects. Missing-at-random clustered data scenarios were simulated following a full-factorial design. Across all the missing data mechanisms considered, the multiple imputation methods provided estimators with negligible bias, while complete case analysis resulted in biased treatment effect estimates in scenarios where the randomised treatment arm was associated with missingness. Confidence interval coverage was generally in excess of nominal levels (up to 99.8%) following fixed-effects multiple imputation and too low following single-level multiple imputation. Multilevel multiple imputation led to coverage levels of approximately 95% throughout. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

  15. A statistical study of EMIC waves observed by Cluster. 1. Wave properties. EMIC Wave Properties

    DOE PAGES

    Allen, R. C.; Zhang, J. -C.; Kistler, L. M.; ...

    2015-07-23

    Electromagnetic ion cyclotron (EMIC) waves are an important mechanism for particle energization and losses inside the magnetosphere. In order to better understand the effects of these waves on particle dynamics, detailed information about the occurrence rate, wave power, ellipticity, normal angle, energy propagation angle distributions, and local plasma parameters are required. Previous statistical studies have used in situ observations to investigate the distribution of these parameters in the magnetic local time versus L-shell (MLT-L) frame within a limited magnetic latitude (MLAT) range. In our study, we present a statistical analysis of EMIC wave properties using 10 years (2001–2010) of datamore » from Cluster, totaling 25,431 min of wave activity. Due to the polar orbit of Cluster, we are able to investigate EMIC waves at all MLATs and MLTs. This allows us to further investigate the MLAT dependence of various wave properties inside different MLT sectors and further explore the effects of Shabansky orbits on EMIC wave generation and propagation. Thus, the statistical analysis is presented in two papers. OUr paper focuses on the wave occurrence distribution as well as the distribution of wave properties. The companion paper focuses on local plasma parameters during wave observations as well as wave generation proxies.« less

  16. A scoping review of spatial cluster analysis techniques for point-event data.

    PubMed

    Fritz, Charles E; Schuurman, Nadine; Robertson, Colin; Lear, Scott

    2013-05-01

    Spatial cluster analysis is a uniquely interdisciplinary endeavour, and so it is important to communicate and disseminate ideas, innovations, best practices and challenges across practitioners, applied epidemiology researchers and spatial statisticians. In this research we conducted a scoping review to systematically search peer-reviewed journal databases for research that has employed spatial cluster analysis methods on individual-level, address location, or x and y coordinate derived data. To illustrate the thematic issues raised by our results, methods were tested using a dataset where known clusters existed. Point pattern methods, spatial clustering and cluster detection tests, and a locally weighted spatial regression model were most commonly used for individual-level, address location data (n = 29). The spatial scan statistic was the most popular method for address location data (n = 19). Six themes were identified relating to the application of spatial cluster analysis methods and subsequent analyses, which we recommend researchers to consider; exploratory analysis, visualization, spatial resolution, aetiology, scale and spatial weights. It is our intention that researchers seeking direction for using spatial cluster analysis methods, consider the caveats and strengths of each approach, but also explore the numerous other methods available for this type of analysis. Applied spatial epidemiology researchers and practitioners should give special consideration to applying multiple tests to a dataset. Future research should focus on developing frameworks for selecting appropriate methods and the corresponding spatial weighting schemes.

  17. The Peculiarities in O-Type Galaxy Clusters

    NASA Astrophysics Data System (ADS)

    Panko, E. A.; Emelyanov, S. I.

    We present the results of analysis of 2D distribution of galaxies in galaxy cluster fields. The Catalogue of Galaxy Clusters and Groups PF (Panko & Flin) was used as input observational data set. We selected open rich PF galaxy clusters, containing 100 and more galaxies for our study. According to Panko classification scheme open galaxy clusters (O-type) have no concentration to the cluster center. The data set contains both pure O-type clusters and O-type clusters with overdence belts, namely OL and OF types. According to Rood & Sastry and Struble & Rood ideas, the open galaxy clusters are the beginning stage of cluster evolution. We found in the O-type clusters some types of statistically significant regular peculiarities, such as two crossed belts or curved strip. We suppose founded features connected with galaxy clusters evolution and the distribution of DM inside the clusters.

  18. A time-series approach for clustering farms based on slaughterhouse health aberration data.

    PubMed

    Hulsegge, B; de Greef, K H

    2018-05-01

    A large amount of data is collected routinely in meat inspection in pig slaughterhouses. A time series clustering approach is presented and applied that groups farms based on similar statistical characteristics of meat inspection data over time. A three step characteristic-based clustering approach was used from the idea that the data contain more info than the incidence figures. A stratified subset containing 511,645 pigs was derived as a study set from 3.5 years of meat inspection data. The monthly averages of incidence of pleuritis and of pneumonia of 44 Dutch farms (delivering 5149 batches to 2 pig slaughterhouses) were subjected to 1) derivation of farm level data characteristics 2) factor analysis and 3) clustering into groups of farms. The characteristic-based clustering was able to cluster farms for both lung aberrations. Three groups of data characteristics were informative, describing incidence, time pattern and degree of autocorrelation. The consistency of clustering similar farms was confirmed by repetition of the analysis in a larger dataset. The robustness of the clustering was tested on a substantially extended dataset. This confirmed the earlier results, three data distribution aspects make up the majority of distinction between groups of farms and in these groups (clusters) the majority of the farms was allocated comparable to the earlier allocation (75% and 62% for pleuritis and pneumonia, respectively). The difference between pleuritis and pneumonia in their seasonal dependency was confirmed, supporting the biological relevance of the clustering. Comparison of the identified clusters of statistically comparable farms can be used to detect farm level risk factors causing the health aberrations beyond comparison on disease incidence and trend alone. Copyright © 2018 Elsevier B.V. All rights reserved.

  19. Grouping of Bulgarian wines according to grape variety by using statistical methods

    NASA Astrophysics Data System (ADS)

    Milev, M.; Nikolova, Kr.; Ivanova, Ir.; Minkova, St.; Evtimov, T.; Krustev, St.

    2017-12-01

    68 different types of Bulgarian wines were studied in accordance with 9 optical parameters as follows: color parameters in XYZ and SIE Lab color systems, lightness, Hue angle, chroma, fluorescence intensity and emission wavelength. The main objective of this research is using hierarchical cluster analysis to evaluate the similarity and the distance between examined different types of Bulgarian wines and their grouping based on physical parameters. We have found that wines are grouped in clusters on the base of the degree of identity between them. There are two main clusters each one with two subclusters. The first one contains white wines and Sira, the second contains red wines and rose. The results from cluster analysis are presented graphically by a dendrogram. The other statistical technique used is factor analysis performed by the Method of Principal Components (PCA). The aim is to reduce the large number of variables to a few factors by grouping the correlated variables into one factor and subdividing the noncorrelated variables into different factors. Moreover the factor analysis provided the possibility to determine the parameters with the greatest influence over the distribution of samples in different clusters. In our study after the rotation of the factors with Varimax method the parameters were combined into two factors, which explain about 80 % of the total variation. The first one explains the 61.49% and correlates with color characteristics, the second one explains 18.34% from the variation and correlates with the parameters connected with fluorescence spectroscopy.

  20. Understanding Teacher Users of a Digital Library Service: A Clustering Approach

    ERIC Educational Resources Information Center

    Xu, Beijie

    2011-01-01

    This research examined teachers' online behaviors while using a digital library service--the Instructional Architect (IA)--through three consecutive studies. In the first two studies, a statistical model called latent class analysis (LCA) was applied to cluster different groups of IA teachers according to their diverse online behaviors. The third…

  1. Developing Appropriate Methods for Cost-Effectiveness Analysis of Cluster Randomized Trials

    PubMed Central

    Gomes, Manuel; Ng, Edmond S.-W.; Nixon, Richard; Carpenter, James; Thompson, Simon G.

    2012-01-01

    Aim. Cost-effectiveness analyses (CEAs) may use data from cluster randomized trials (CRTs), where the unit of randomization is the cluster, not the individual. However, most studies use analytical methods that ignore clustering. This article compares alternative statistical methods for accommodating clustering in CEAs of CRTs. Methods. Our simulation study compared the performance of statistical methods for CEAs of CRTs with 2 treatment arms. The study considered a method that ignored clustering—seemingly unrelated regression (SUR) without a robust standard error (SE)—and 4 methods that recognized clustering—SUR and generalized estimating equations (GEEs), both with robust SE, a “2-stage” nonparametric bootstrap (TSB) with shrinkage correction, and a multilevel model (MLM). The base case assumed CRTs with moderate numbers of balanced clusters (20 per arm) and normally distributed costs. Other scenarios included CRTs with few clusters, imbalanced cluster sizes, and skewed costs. Performance was reported as bias, root mean squared error (rMSE), and confidence interval (CI) coverage for estimating incremental net benefits (INBs). We also compared the methods in a case study. Results. Each method reported low levels of bias. Without the robust SE, SUR gave poor CI coverage (base case: 0.89 v. nominal level: 0.95). The MLM and TSB performed well in each scenario (CI coverage, 0.92–0.95). With few clusters, the GEE and SUR (with robust SE) had coverage below 0.90. In the case study, the mean INBs were similar across all methods, but ignoring clustering underestimated statistical uncertainty and the value of further research. Conclusions. MLMs and the TSB are appropriate analytical methods for CEAs of CRTs with the characteristics described. SUR and GEE are not recommended for studies with few clusters. PMID:22016450

  2. Spatial analysis for the epidemiological study of cardiovascular diseases: A systematic literature search.

    PubMed

    Mena, Carlos; Sepúlveda, Cesar; Fuentes, Eduardo; Ormazábal, Yony; Palomo, Iván

    2018-05-07

    Cardiovascular diseases (CVDs) are the primary cause of death and disability in de world, and the detection of populations at risk as well as localization of vulnerable areas is essential for adequate epidemiological management. Techniques developed for spatial analysis, among them geographical information systems and spatial statistics, such as cluster detection and spatial correlation, are useful for the study of the distribution of the CVDs. These techniques, enabling recognition of events at different geographical levels of study (e.g., rural, deprived neighbourhoods, etc.), make it possible to relate CVDs to factors present in the immediate environment. The systemic literature presented here shows that this group of diseases is clustered with regard to incidence, mortality and hospitalization as well as obesity, smoking, increased glycated haemoglobin levels, hypertension physical activity and age. In addition, acquired variables such as income, residency (rural or urban) and education, contribute to CVD clustering. Both local cluster detection and spatial regression techniques give statistical weight to the findings providing valuable information that can influence response mechanisms in the health services by indicating locations in need of intervention and assignment of available resources.

  3. Identification and characterization of near-fatal asthma phenotypes by cluster analysis.

    PubMed

    Serrano-Pariente, J; Rodrigo, G; Fiz, J A; Crespo, A; Plaza, V

    2015-09-01

    Near-fatal asthma (NFA) is a heterogeneous clinical entity and several profiles of patients have been described according to different clinical, pathophysiological and histological features. However, there are no previous studies that identify in a unbiased way--using statistical methods such as clusters analysis--different phenotypes of NFA. Therefore, the aim of the present study was to identify and to characterize phenotypes of near fatal asthma using a cluster analysis. Over a period of 2 years, 33 Spanish hospitals enrolled 179 asthmatics admitted for an episode of NFA. A cluster analysis using two-steps algorithm was performed from data of 84 of these cases. The analysis defined three clusters of patients with NFA: cluster 1, the largest, including older patients with clinical and therapeutic criteria of severe asthma; cluster 2, with an high proportion of respiratory arrest (68%), impaired consciousness level (82%) and mechanical ventilation (93%); and cluster 3, which included younger patients, characterized by an insufficient anti-inflammatory treatment and frequent sensitization to Alternaria alternata and soybean. These results identify specific asthma phenotypes involved in NFA, confirming in part previous findings observed in studies with a clinical approach. The identification of patients with a specific NFA phenotype could suggest interventions to prevent future severe asthma exacerbations. © 2015 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  4. Is It Feasible to Identify Natural Clusters of TSC-Associated Neuropsychiatric Disorders (TAND)?

    PubMed

    Leclezio, Loren; Gardner-Lubbe, Sugnet; de Vries, Petrus J

    2018-04-01

    Tuberous sclerosis complex (TSC) is a genetic disorder with multisystem involvement. The lifetime prevalence of TSC-Associated Neuropsychiatric Disorders (TAND) is in the region of 90% in an apparently unique, individual pattern. This "uniqueness" poses significant challenges for diagnosis, psycho-education, and intervention planning. To date, no studies have explored whether there may be natural clusters of TAND. The purpose of this feasibility study was (1) to investigate the practicability of identifying natural TAND clusters, and (2) to identify appropriate multivariate data analysis techniques for larger-scale studies. TAND Checklist data were collected from 56 individuals with a clinical diagnosis of TSC (n = 20 from South Africa; n = 36 from Australia). Using R, the open-source statistical platform, mean squared contingency coefficients were calculated to produce a correlation matrix, and various cluster analyses and exploratory factor analysis were examined. Ward's method rendered six TAND clusters with good face validity and significant convergence with a six-factor exploratory factor analysis solution. The "bottom-up" data-driven strategies identified a "scholastic" cluster of TAND manifestations, an "autism spectrum disorder-like" cluster, a "dysregulated behavior" cluster, a "neuropsychological" cluster, a "hyperactive/impulsive" cluster, and a "mixed/mood" cluster. These feasibility results suggest that a combination of cluster analysis and exploratory factor analysis methods may be able to identify clinically meaningful natural TAND clusters. Findings require replication and expansion in larger dataset, and could include quantification of cluster or factor scores at an individual level. Copyright © 2018 Elsevier Inc. All rights reserved.

  5. Clustangles: An Open Library for Clustering Angular Data.

    PubMed

    Sargsyan, Karen; Hua, Yun Hao; Lim, Carmay

    2015-08-24

    Dihedral angles are good descriptors of the numerous conformations visited by large, flexible systems, but their analysis requires directional statistics. A single package including the various multivariate statistical methods for angular data that accounts for the distinct topology of such data does not exist. Here, we present a lightweight standalone, operating-system independent package called Clustangles to fill this gap. Clustangles will be useful in analyzing the ever-increasing number of structures in the Protein Data Bank and clustering the copious conformations from increasingly long molecular dynamics simulations.

  6. Open star clusters and Galactic structure

    NASA Astrophysics Data System (ADS)

    Joshi, Yogesh C.

    2018-04-01

    In order to understand the Galactic structure, we perform a statistical analysis of the distribution of various cluster parameters based on an almost complete sample of Galactic open clusters yet available. The geometrical and physical characteristics of a large number of open clusters given in the MWSC catalogue are used to study the spatial distribution of clusters in the Galaxy and determine the scale height, solar offset, local mass density and distribution of reddening material in the solar neighbourhood. We also explored the mass-radius and mass-age relations in the Galactic open star clusters. We find that the estimated parameters of the Galactic disk are largely influenced by the choice of cluster sample.

  7. Disease clusters, exact distributions of maxima, and P-values.

    PubMed

    Grimson, R C

    1993-10-01

    This paper presents combinatorial (exact) methods that are useful in the analysis of disease cluster data obtained from small environments, such as buildings and neighbourhoods. Maxwell-Boltzmann and Fermi-Dirac occupancy models are compared in terms of appropriateness of representation of disease incidence patterns (space and/or time) in these environments. The methods are illustrated by a statistical analysis of the incidence pattern of bone fractures in a setting wherein fracture clustering was alleged to be occurring. One of the methodological results derived in this paper is the exact distribution of the maximum cell frequency in occupancy models.

  8. Genotyping and spatial analysis of pulmonary tuberculosis and diabetes cases in the state of Veracruz, Mexico.

    PubMed

    Blanco-Guillot, Francles; Castañeda-Cediel, M Lucía; Cruz-Hervert, Pablo; Ferreyra-Reyes, Leticia; Delgado-Sánchez, Guadalupe; Ferreira-Guerrero, Elizabeth; Montero-Campos, Rogelio; Bobadilla-Del-Valle, Miriam; Martínez-Gamboa, Rosa Areli; Torres-González, Pedro; Téllez-Vazquez, Norma; Canizales-Quintero, Sergio; Yanes-Lane, Mercedes; Mongua-Rodríguez, Norma; Ponce-de-León, Alfredo; Sifuentes-Osornio, José; García-García, Lourdes

    2018-01-01

    Genotyping and georeferencing in tuberculosis (TB) have been used to characterize the distribution of the disease and occurrence of transmission within specific groups and communities. The objective of this study was to test the hypothesis that diabetes mellitus (DM) and pulmonary TB may occur in spatial and molecular aggregations. Retrospective cohort study of patients with pulmonary TB. The study area included 12 municipalities in the Sanitary Jurisdiction of Orizaba, Veracruz, México. Patients with acid-fast bacilli in sputum smears and/or Mycobacterium tuberculosis in sputum cultures were recruited from 1995 to 2010. Clinical (standardized questionnaire, physical examination, chest X-ray, blood glucose test and HIV test), microbiological, epidemiological, and molecular evaluations were carried out. Patients were considered "genotype-clustered" if two or more isolates from different patients were identified within 12 months of each other and had six or more IS6110 bands in an identical pattern, or < 6 bands with identical IS6110 RFLP patterns and spoligotype with the same spacer oligonucleotides. Residential and health care centers addresses were georeferenced. We used a Jeep hand GPS. The coordinates were transferred from the GPS files to ArcGIS using ArcMap 9.3. We evaluated global spatial aggregation of patients in IS6110-RFLP/ spoligotype clusters using global Moran´s I. Since global distribution was not random, we evaluated "hotspots" using Getis-Ord Gi* statistic. Using bivariate and multivariate analysis we analyzed sociodemographic, behavioral, clinic and bacteriological conditions associated with "hotspots". We used STATA® v13.1 for all statistical analysis. From 1995 to 2010, 1,370 patients >20 years were diagnosed with pulmonary TB; 33% had DM. The proportion of isolates that were genotyped was 80.7% (n = 1105), of which 31% (n = 342) were grouped in 91 genotype clusters with 2 to 23 patients each; 65.9% of total clusters were small (2 members) involving 35.08% of patients. Twenty three (22.7) percent of cases were classified as recent transmission. Moran`s I indicated that distribution of patients in IS6110-RFLP/spoligotype clusters was not random (Moran`s I = 0.035468, Z value = 7.0, p = 0.00). Local spatial analysis showed statistically significant spatial aggregation of patients in IS6110-RFLP/spoligotype clusters identifying "hotspots" and "coldspots". GI* statistic showed that the hotspot for spatial clustering was located in Camerino Z. Mendoza municipality; 14.6% (50/342) of patients in genotype clusters were located in a hotspot; of these, 60% (30/50) lived with DM. Using logistic regression the statistically significant variables associated with hotspots were: DM [adjusted Odds Ratio (aOR) 7.04, 95% Confidence interval (CI) 3.03-16.38] and attending the health center in Camerino Z. Mendoza (aOR18.04, 95% CI 7.35-44.28). The combination of molecular and epidemiological information with geospatial data allowed us to identify the concurrence of molecular clustering and spatial aggregation of patients with DM and TB. This information may be highly useful for TB control programs.

  9. Geovisual analytics to enhance spatial scan statistic interpretation: an analysis of U.S. cervical cancer mortality

    PubMed Central

    Chen, Jin; Roth, Robert E; Naito, Adam T; Lengerich, Eugene J; MacEachren, Alan M

    2008-01-01

    Background Kulldorff's spatial scan statistic and its software implementation – SaTScan – are widely used for detecting and evaluating geographic clusters. However, two issues make using the method and interpreting its results non-trivial: (1) the method lacks cartographic support for understanding the clusters in geographic context and (2) results from the method are sensitive to parameter choices related to cluster scaling (abbreviated as scaling parameters), but the system provides no direct support for making these choices. We employ both established and novel geovisual analytics methods to address these issues and to enhance the interpretation of SaTScan results. We demonstrate our geovisual analytics approach in a case study analysis of cervical cancer mortality in the U.S. Results We address the first issue by providing an interactive visual interface to support the interpretation of SaTScan results. Our research to address the second issue prompted a broader discussion about the sensitivity of SaTScan results to parameter choices. Sensitivity has two components: (1) the method can identify clusters that, while being statistically significant, have heterogeneous contents comprised of both high-risk and low-risk locations and (2) the method can identify clusters that are unstable in location and size as the spatial scan scaling parameter is varied. To investigate cluster result stability, we conducted multiple SaTScan runs with systematically selected parameters. The results, when scanning a large spatial dataset (e.g., U.S. data aggregated by county), demonstrate that no single spatial scan scaling value is known to be optimal to identify clusters that exist at different scales; instead, multiple scans that vary the parameters are necessary. We introduce a novel method of measuring and visualizing reliability that facilitates identification of homogeneous clusters that are stable across analysis scales. Finally, we propose a logical approach to proceed through the analysis of SaTScan results. Conclusion The geovisual analytics approach described in this manuscript facilitates the interpretation of spatial cluster detection methods by providing cartographic representation of SaTScan results and by providing visualization methods and tools that support selection of SaTScan parameters. Our methods distinguish between heterogeneous and homogeneous clusters and assess the stability of clusters across analytic scales. Method We analyzed the cervical cancer mortality data for the United States aggregated by county between 2000 and 2004. We ran SaTScan on the dataset fifty times with different parameter choices. Our geovisual analytics approach couples SaTScan with our visual analytic platform, allowing users to interactively explore and compare SaTScan results produced by different parameter choices. The Standardized Mortality Ratio and reliability scores are visualized for all the counties to identify stable, homogeneous clusters. We evaluated our analysis result by comparing it to that produced by other independent techniques including the Empirical Bayes Smoothing and Kafadar spatial smoother methods. The geovisual analytics approach introduced here is developed and implemented in our Java-based Visual Inquiry Toolkit. PMID:18992163

  10. Geovisual analytics to enhance spatial scan statistic interpretation: an analysis of U.S. cervical cancer mortality.

    PubMed

    Chen, Jin; Roth, Robert E; Naito, Adam T; Lengerich, Eugene J; Maceachren, Alan M

    2008-11-07

    Kulldorff's spatial scan statistic and its software implementation - SaTScan - are widely used for detecting and evaluating geographic clusters. However, two issues make using the method and interpreting its results non-trivial: (1) the method lacks cartographic support for understanding the clusters in geographic context and (2) results from the method are sensitive to parameter choices related to cluster scaling (abbreviated as scaling parameters), but the system provides no direct support for making these choices. We employ both established and novel geovisual analytics methods to address these issues and to enhance the interpretation of SaTScan results. We demonstrate our geovisual analytics approach in a case study analysis of cervical cancer mortality in the U.S. We address the first issue by providing an interactive visual interface to support the interpretation of SaTScan results. Our research to address the second issue prompted a broader discussion about the sensitivity of SaTScan results to parameter choices. Sensitivity has two components: (1) the method can identify clusters that, while being statistically significant, have heterogeneous contents comprised of both high-risk and low-risk locations and (2) the method can identify clusters that are unstable in location and size as the spatial scan scaling parameter is varied. To investigate cluster result stability, we conducted multiple SaTScan runs with systematically selected parameters. The results, when scanning a large spatial dataset (e.g., U.S. data aggregated by county), demonstrate that no single spatial scan scaling value is known to be optimal to identify clusters that exist at different scales; instead, multiple scans that vary the parameters are necessary. We introduce a novel method of measuring and visualizing reliability that facilitates identification of homogeneous clusters that are stable across analysis scales. Finally, we propose a logical approach to proceed through the analysis of SaTScan results. The geovisual analytics approach described in this manuscript facilitates the interpretation of spatial cluster detection methods by providing cartographic representation of SaTScan results and by providing visualization methods and tools that support selection of SaTScan parameters. Our methods distinguish between heterogeneous and homogeneous clusters and assess the stability of clusters across analytic scales. We analyzed the cervical cancer mortality data for the United States aggregated by county between 2000 and 2004. We ran SaTScan on the dataset fifty times with different parameter choices. Our geovisual analytics approach couples SaTScan with our visual analytic platform, allowing users to interactively explore and compare SaTScan results produced by different parameter choices. The Standardized Mortality Ratio and reliability scores are visualized for all the counties to identify stable, homogeneous clusters. We evaluated our analysis result by comparing it to that produced by other independent techniques including the Empirical Bayes Smoothing and Kafadar spatial smoother methods. The geovisual analytics approach introduced here is developed and implemented in our Java-based Visual Inquiry Toolkit.

  11. CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks.

    PubMed

    Li, Min; Li, Dongyan; Tang, Yu; Wu, Fangxiang; Wang, Jianxin

    2017-08-31

    Nowadays, cluster analysis of biological networks has become one of the most important approaches to identifying functional modules as well as predicting protein complexes and network biomarkers. Furthermore, the visualization of clustering results is crucial to display the structure of biological networks. Here we present CytoCluster, a cytoscape plugin integrating six clustering algorithms, HC-PIN (Hierarchical Clustering algorithm in Protein Interaction Networks), OH-PIN (identifying Overlapping and Hierarchical modules in Protein Interaction Networks), IPCA (Identifying Protein Complex Algorithm), ClusterONE (Clustering with Overlapping Neighborhood Expansion), DCU (Detecting Complexes based on Uncertain graph model), IPC-MCE (Identifying Protein Complexes based on Maximal Complex Extension), and BinGO (the Biological networks Gene Ontology) function. Users can select different clustering algorithms according to their requirements. The main function of these six clustering algorithms is to detect protein complexes or functional modules. In addition, BinGO is used to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. CytoCluster can be easily expanded, so that more clustering algorithms and functions can be added to this plugin. Since it was created in July 2013, CytoCluster has been downloaded more than 9700 times in the Cytoscape App store and has already been applied to the analysis of different biological networks. CytoCluster is available from http://apps.cytoscape.org/apps/cytocluster.

  12. CytoCluster: A Cytoscape Plugin for Cluster Analysis and Visualization of Biological Networks

    PubMed Central

    Li, Min; Li, Dongyan; Tang, Yu; Wang, Jianxin

    2017-01-01

    Nowadays, cluster analysis of biological networks has become one of the most important approaches to identifying functional modules as well as predicting protein complexes and network biomarkers. Furthermore, the visualization of clustering results is crucial to display the structure of biological networks. Here we present CytoCluster, a cytoscape plugin integrating six clustering algorithms, HC-PIN (Hierarchical Clustering algorithm in Protein Interaction Networks), OH-PIN (identifying Overlapping and Hierarchical modules in Protein Interaction Networks), IPCA (Identifying Protein Complex Algorithm), ClusterONE (Clustering with Overlapping Neighborhood Expansion), DCU (Detecting Complexes based on Uncertain graph model), IPC-MCE (Identifying Protein Complexes based on Maximal Complex Extension), and BinGO (the Biological networks Gene Ontology) function. Users can select different clustering algorithms according to their requirements. The main function of these six clustering algorithms is to detect protein complexes or functional modules. In addition, BinGO is used to determine which Gene Ontology (GO) categories are statistically overrepresented in a set of genes or a subgraph of a biological network. CytoCluster can be easily expanded, so that more clustering algorithms and functions can be added to this plugin. Since it was created in July 2013, CytoCluster has been downloaded more than 9700 times in the Cytoscape App store and has already been applied to the analysis of different biological networks. CytoCluster is available from http://apps.cytoscape.org/apps/cytocluster. PMID:28858211

  13. Evaluation of the Gini Coefficient in Spatial Scan Statistics for Detecting Irregularly Shaped Clusters

    PubMed Central

    Kim, Jiyu; Jung, Inkyung

    2017-01-01

    Spatial scan statistics with circular or elliptic scanning windows are commonly used for cluster detection in various applications, such as the identification of geographical disease clusters from epidemiological data. It has been pointed out that the method may have difficulty in correctly identifying non-compact, arbitrarily shaped clusters. In this paper, we evaluated the Gini coefficient for detecting irregularly shaped clusters through a simulation study. The Gini coefficient, the use of which in spatial scan statistics was recently proposed, is a criterion measure for optimizing the maximum reported cluster size. Our simulation study results showed that using the Gini coefficient works better than the original spatial scan statistic for identifying irregularly shaped clusters, by reporting an optimized and refined collection of clusters rather than a single larger cluster. We have provided a real data example that seems to support the simulation results. We think that using the Gini coefficient in spatial scan statistics can be helpful for the detection of irregularly shaped clusters. PMID:28129368

  14. The median hazard ratio: a useful measure of variance and general contextual effects in multilevel survival analysis.

    PubMed

    Austin, Peter C; Wagner, Philippe; Merlo, Juan

    2017-03-15

    Multilevel data occurs frequently in many research areas like health services research and epidemiology. A suitable way to analyze such data is through the use of multilevel regression models (MLRM). MLRM incorporate cluster-specific random effects which allow one to partition the total individual variance into between-cluster variation and between-individual variation. Statistically, MLRM account for the dependency of the data within clusters and provide correct estimates of uncertainty around regression coefficients. Substantively, the magnitude of the effect of clustering provides a measure of the General Contextual Effect (GCE). When outcomes are binary, the GCE can also be quantified by measures of heterogeneity like the Median Odds Ratio (MOR) calculated from a multilevel logistic regression model. Time-to-event outcomes within a multilevel structure occur commonly in epidemiological and medical research. However, the Median Hazard Ratio (MHR) that corresponds to the MOR in multilevel (i.e., 'frailty') Cox proportional hazards regression is rarely used. Analogously to the MOR, the MHR is the median relative change in the hazard of the occurrence of the outcome when comparing identical subjects from two randomly selected different clusters that are ordered by risk. We illustrate the application and interpretation of the MHR in a case study analyzing the hazard of mortality in patients hospitalized for acute myocardial infarction at hospitals in Ontario, Canada. We provide R code for computing the MHR. The MHR is a useful and intuitive measure for expressing cluster heterogeneity in the outcome and, thereby, estimating general contextual effects in multilevel survival analysis. © 2016 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.

  15. Examining the Effectiveness of Discriminant Function Analysis and Cluster Analysis in Species Identification of Male Field Crickets Based on Their Calling Songs

    PubMed Central

    Jaiswara, Ranjana; Nandi, Diptarup; Balakrishnan, Rohini

    2013-01-01

    Traditional taxonomy based on morphology has often failed in accurate species identification owing to the occurrence of cryptic species, which are reproductively isolated but morphologically identical. Molecular data have thus been used to complement morphology in species identification. The sexual advertisement calls in several groups of acoustically communicating animals are species-specific and can thus complement molecular data as non-invasive tools for identification. Several statistical tools and automated identifier algorithms have been used to investigate the efficiency of acoustic signals in species identification. Despite a plethora of such methods, there is a general lack of knowledge regarding the appropriate usage of these methods in specific taxa. In this study, we investigated the performance of two commonly used statistical methods, discriminant function analysis (DFA) and cluster analysis, in identification and classification based on acoustic signals of field cricket species belonging to the subfamily Gryllinae. Using a comparative approach we evaluated the optimal number of species and calling song characteristics for both the methods that lead to most accurate classification and identification. The accuracy of classification using DFA was high and was not affected by the number of taxa used. However, a constraint in using discriminant function analysis is the need for a priori classification of songs. Accuracy of classification using cluster analysis, which does not require a priori knowledge, was maximum for 6–7 taxa and decreased significantly when more than ten taxa were analysed together. We also investigated the efficacy of two novel derived acoustic features in improving the accuracy of identification. Our results show that DFA is a reliable statistical tool for species identification using acoustic signals. Our results also show that cluster analysis of acoustic signals in crickets works effectively for species classification and identification. PMID:24086666

  16. Relative risk estimates from spatial and space-time scan statistics: Are they biased?

    PubMed Central

    Prates, Marcos O.; Kulldorff, Martin; Assunção, Renato M.

    2014-01-01

    The purely spatial and space-time scan statistics have been successfully used by many scientists to detect and evaluate geographical disease clusters. Although the scan statistic has high power in correctly identifying a cluster, no study has considered the estimates of the cluster relative risk in the detected cluster. In this paper we evaluate whether there is any bias on these estimated relative risks. Intuitively, one may expect that the estimated relative risks has upward bias, since the scan statistic cherry picks high rate areas to include in the cluster. We show that this intuition is correct for clusters with low statistical power, but with medium to high power the bias becomes negligible. The same behaviour is not observed for the prospective space-time scan statistic, where there is an increasing conservative downward bias of the relative risk as the power to detect the cluster increases. PMID:24639031

  17. Prediction of operon-like gene clusters in the Arabidopsis thaliana genome based on co-expression analysis of neighboring genes.

    PubMed

    Wada, Masayoshi; Takahashi, Hiroki; Altaf-Ul-Amin, Md; Nakamura, Kensuke; Hirai, Masami Y; Ohta, Daisaku; Kanaya, Shigehiko

    2012-07-15

    Operon-like arrangements of genes occur in eukaryotes ranging from yeasts and filamentous fungi to nematodes, plants, and mammals. In plants, several examples of operon-like gene clusters involved in metabolic pathways have recently been characterized, e.g. the cyclic hydroxamic acid pathways in maize, the avenacin biosynthesis gene clusters in oat, the thalianol pathway in Arabidopsis thaliana, and the diterpenoid momilactone cluster in rice. Such operon-like gene clusters are defined by their co-regulation or neighboring positions within immediate vicinity of chromosomal regions. A comprehensive analysis of the expression of neighboring genes therefore accounts a crucial step to reveal the complete set of operon-like gene clusters within a genome. Genome-wide prediction of operon-like gene clusters should contribute to functional annotation efforts and provide novel insight into evolutionary aspects acquiring certain biological functions as well. We predicted co-expressed gene clusters by comparing the Pearson correlation coefficient of neighboring genes and randomly selected gene pairs, based on a statistical method that takes false discovery rate (FDR) into consideration for 1469 microarray gene expression datasets of A. thaliana. We estimated that A. thaliana contains 100 operon-like gene clusters in total. We predicted 34 statistically significant gene clusters consisting of 3 to 22 genes each, based on a stringent FDR threshold of 0.1. Functional relationships among genes in individual clusters were estimated by sequence similarity and functional annotation of genes. Duplicated gene pairs (determined based on BLAST with a cutoff of E<10(-5)) are included in 27 clusters. Five clusters are associated with metabolism, containing P450 genes restricted to the Brassica family and predicted to be involved in secondary metabolism. Operon-like clusters tend to include genes encoding bio-machinery associated with ribosomes, the ubiquitin/proteasome system, secondary metabolic pathways, lipid and fatty-acid metabolism, and the lipid transfer system. Copyright © 2012 Elsevier B.V. All rights reserved.

  18. Bias and inference from misspecified mixed-effect models in stepped wedge trial analysis.

    PubMed

    Thompson, Jennifer A; Fielding, Katherine L; Davey, Calum; Aiken, Alexander M; Hargreaves, James R; Hayes, Richard J

    2017-10-15

    Many stepped wedge trials (SWTs) are analysed by using a mixed-effect model with a random intercept and fixed effects for the intervention and time periods (referred to here as the standard model). However, it is not known whether this model is robust to misspecification. We simulated SWTs with three groups of clusters and two time periods; one group received the intervention during the first period and two groups in the second period. We simulated period and intervention effects that were either common-to-all or varied-between clusters. Data were analysed with the standard model or with additional random effects for period effect or intervention effect. In a second simulation study, we explored the weight given to within-cluster comparisons by simulating a larger intervention effect in the group of the trial that experienced both the control and intervention conditions and applying the three analysis models described previously. Across 500 simulations, we computed bias and confidence interval coverage of the estimated intervention effect. We found up to 50% bias in intervention effect estimates when period or intervention effects varied between clusters and were treated as fixed effects in the analysis. All misspecified models showed undercoverage of 95% confidence intervals, particularly the standard model. A large weight was given to within-cluster comparisons in the standard model. In the SWTs simulated here, mixed-effect models were highly sensitive to departures from the model assumptions, which can be explained by the high dependence on within-cluster comparisons. Trialists should consider including a random effect for time period in their SWT analysis model. © 2017 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. © 2017 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.

  19. Using spatial analysis to demonstrate the heterogeneity of the cardiovascular drug-prescribing pattern in Taiwan

    PubMed Central

    2011-01-01

    Background Geographic Information Systems (GIS) combined with spatial analytical methods could be helpful in examining patterns of drug use. Little attention has been paid to geographic variation of cardiovascular prescription use in Taiwan. The main objective was to use local spatial association statistics to test whether or not the cardiovascular medication-prescribing pattern is homogenous across 352 townships in Taiwan. Methods The statistical methods used were the global measures of Moran's I and Local Indicators of Spatial Association (LISA). While Moran's I provides information on the overall spatial distribution of the data, LISA provides information on types of spatial association at the local level. LISA statistics can also be used to identify influential locations in spatial association analysis. The major classes of prescription cardiovascular drugs were taken from Taiwan's National Health Insurance Research Database (NHIRD), which has a coverage rate of over 97%. The dosage of each prescription was converted into defined daily doses to measure the consumption of each class of drugs. Data were analyzed with ArcGIS and GeoDa at the township level. Results The LISA statistics showed an unusual use of cardiovascular medications in the southern townships with high local variation. Patterns of drug use also showed more low-low spatial clusters (cold spots) than high-high spatial clusters (hot spots), and those low-low associations were clustered in the rural areas. Conclusions The cardiovascular drug prescribing patterns were heterogeneous across Taiwan. In particular, a clear pattern of north-south disparity exists. Such spatial clustering helps prioritize the target areas that require better education concerning drug use. PMID:21609462

  20. On the X-ray spectrum of the volume emissivity arising from Abell clusters

    NASA Technical Reports Server (NTRS)

    Stottlemyer, A. R.; Boldt, E. A.

    1984-01-01

    HEAO 1 A-2 X-ray spectra (2-15 keV) for an optically selected sample of Abell clusters of galaxies with z less than 0.1 have been analyzed to determine the energy dependence of the cosmological X-ray volume emissivity arising from such clusters. This spectrum is well fitted by an isothermal-bremsstrahlung model with kT = 7.4 + or - 1.5 KeV. This result is a test of the isothermal-volume-emissivity spectrum to be inferred from the conjecture that all contributing clusters may be characterized by kT = 7 keV, as assumed by McKee et al. (1980) in estimating the underlying luminosity function for the same sample. Although satisfied at the statistical level indicated, the analysis of a low-luminosity subsample suggests that this assumption of identical isothermal spectra would lead to a systematic error for a more statistically precise determination of the luminosity function's form.

  1. Clustering of health-related behaviors among early and mid-adolescents in Tuscany: results from a representative cross-sectional study

    PubMed Central

    Lazzeri, Giacomo; Panatto, Donatella; Domnich, Alexander; Arata, Lucia; Pammolli, Andrea; Simi, Rita; Giacchi, Mariano Vincenzo; Amicizia, Daniela; Gasparini, Roberto

    2018-01-01

    Abstract Background A huge amount of literature suggests that adolescents’ health-related behaviors tend to occur in clusters, and the understanding of such behavioral clustering may have direct implications for the effective tailoring of health-promotion interventions. Despite the usefulness of analyzing clustering, Italian data on this topic are scant. This study aimed to evaluate the clustering patterns of health-related behaviors. Methods The present study is based on data from the Health Behaviors in School-aged Children (HBSC) study conducted in Tuscany in 2010, which involved 3291 11-, 13- and 15-year olds. To aggregate students’ data on 22 health-related behaviors, factor analysis and subsequent cluster analysis were performed. Results Factor analysis revealed eight factors, which were dubbed in accordance with their main traits: ‘Alcohol drinking’, ‘Smoking’, ‘Physical activity’, ‘Screen time’, ‘Signs & symptoms’, ‘Healthy eating’, ‘Violence’ and ‘Sweet tooth’. These factors explained 67% of variance and underwent cluster analysis. A six-cluster κ-means solution was established with a 93.8% level of classification validity. The between-cluster differences in both mean age and gender distribution were highly statistically significant. Conclusions Health-compromising behaviors are common among Tuscan teens and occur in distinct clusters. These results may be used by schools, health-promotion authorities and other stakeholders to design and implement tailored preventive interventions in Tuscany. PMID:27908972

  2. Clustering of health-related behaviors among early and mid-adolescents in Tuscany: results from a representative cross-sectional study.

    PubMed

    Lazzeri, Giacomo; Panatto, Donatella; Domnich, Alexander; Arata, Lucia; Pammolli, Andrea; Simi, Rita; Giacchi, Mariano Vincenzo; Amicizia, Daniela; Gasparini, Roberto

    2018-03-01

    A huge amount of literature suggests that adolescents' health-related behaviors tend to occur in clusters, and the understanding of such behavioral clustering may have direct implications for the effective tailoring of health-promotion interventions. Despite the usefulness of analyzing clustering, Italian data on this topic are scant. This study aimed to evaluate the clustering patterns of health-related behaviors. The present study is based on data from the Health Behaviors in School-aged Children (HBSC) study conducted in Tuscany in 2010, which involved 3291 11-, 13- and 15-year olds. To aggregate students' data on 22 health-related behaviors, factor analysis and subsequent cluster analysis were performed. Factor analysis revealed eight factors, which were dubbed in accordance with their main traits: 'Alcohol drinking', 'Smoking', 'Physical activity', 'Screen time', 'Signs & symptoms', 'Healthy eating', 'Violence' and 'Sweet tooth'. These factors explained 67% of variance and underwent cluster analysis. A six-cluster κ-means solution was established with a 93.8% level of classification validity. The between-cluster differences in both mean age and gender distribution were highly statistically significant. Health-compromising behaviors are common among Tuscan teens and occur in distinct clusters. These results may be used by schools, health-promotion authorities and other stakeholders to design and implement tailored preventive interventions in Tuscany.

  3. Statistical Analysis of Small-Scale Magnetic Flux Emergence Patterns: A Useful Subsurface Diagnostic?

    NASA Astrophysics Data System (ADS)

    Lamb, Derek A.

    2016-10-01

    While sunspots follow a well-defined pattern of emergence in space and time, small-scale flux emergence is assumed to occur randomly at all times in the quiet Sun. HMI's full-disk coverage, high cadence, spatial resolution, and duty cycle allow us to probe that basic assumption. Some case studies of emergence suggest that temporal clustering on spatial scales of 50-150 Mm may occur. If clustering is present, it could serve as a diagnostic of large-scale subsurface magnetic field structures. We present the results of a manual survey of small-scale flux emergence events over a short time period, and a statistical analysis addressing the question of whether these events show spatio-temporal behavior that is anything other than random.

  4. An Approach to Cluster EU Member States into Groups According to Pathways of Salmonella in the Farm-to-Consumption Chain for Pork Products.

    PubMed

    Vigre, Håkan; Domingues, Ana Rita Coutinho Calado; Pedersen, Ulrik Bo; Hald, Tine

    2016-03-01

    The aim of the project as the cluster analysis was to in part to develop a generic structured quantitative microbiological risk assessment (QMRA) model of human salmonellosis due to pork consumption in EU member states (MSs), and the objective of the cluster analysis was to group the EU MSs according to the relative contribution of different pathways of Salmonella in the farm-to-consumption chain of pork products. In the development of the model, by selecting a case study MS from each cluster the model was developed to represent different aspects of pig production, pork production, and consumption of pork products across EU states. The objective of the cluster analysis was to aggregate MSs into groups of countries with similar importance of different pathways of Salmonella in the farm-to-consumption chain using available, and where possible, universal register data related to the pork production and consumption in each country. Based on MS-specific information about distribution of (i) small and large farms, (ii) small and large slaughterhouses, (iii) amount of pork meat consumed, and (iv) amount of sausages consumed we used nonhierarchical and hierarchical cluster analysis to group the MSs. The cluster solutions were validated internally using statistic measures and externally by comparing the clustered MSs with an estimated human incidence of salmonellosis due to pork products in the MSs. Finally, each cluster was characterized qualitatively using the centroids of the clusters. © 2016 Society for Risk Analysis.

  5. STATISTICAL TECHNIQUES FOR DETERMINATION AND PREDICTION OF FUNDAMENTAL FISH ASSEMBLAGES OF THE MID-ATLANTIC HIGHLANDS

    EPA Science Inventory

    A statistical software tool, Stream Fish Community Predictor (SFCP), based on EMAP stream sampling in the mid-Atlantic Highlands, was developed to predict stream fish communities using stream and watershed characteristics. Step one in the tool development was a cluster analysis t...

  6. Lagged segmented Poincaré plot analysis for risk stratification in patients with dilated cardiomyopathy.

    PubMed

    Voss, Andreas; Fischer, Claudia; Schroeder, Rico; Figulla, Hans R; Goernig, Matthias

    2012-07-01

    The objectives of this study were to introduce a new type of heart-rate variability analysis improving risk stratification in patients with idiopathic dilated cardiomyopathy (DCM) and to provide additional information about impaired heart beat generation in these patients. Beat-to-beat intervals (BBI) of 30-min ECGs recorded from 91 DCM patients and 21 healthy subjects were analyzed applying the lagged segmented Poincaré plot analysis (LSPPA) method. LSPPA includes the Poincaré plot reconstruction with lags of 1-100, rotating the cloud of points, its normalized segmentation adapted to their standard deviations, and finally, a frequency-dependent clustering. The lags were combined into eight different clusters representing specific frequency bands within 0.012-1.153 Hz. Statistical differences between low- and high-risk DCM could be found within the clusters II-VIII (e.g., cluster IV: 0.033-0.038 Hz; p = 0.0002; sensitivity = 85.7 %; specificity = 71.4 %). The multivariate statistics led to a sensitivity of 92.9 %, specificity of 85.7 % and an area under the curve of 92.1 % discriminating these patient groups. We introduced the LSPPA method to investigate time correlations in BBI time series. We found that LSPPA contributes considerably to risk stratification in DCM and yields the highest discriminant power in the low and very low-frequency bands.

  7. An investigation on thermal patterns in Iran based on spatial autocorrelation

    NASA Astrophysics Data System (ADS)

    Fallah Ghalhari, Gholamabbas; Dadashi Roudbari, Abbasali

    2018-02-01

    The present study aimed at investigating temporal-spatial patterns and monthly patterns of temperature in Iran using new spatial statistical methods such as cluster and outlier analysis, and hotspot analysis. To do so, climatic parameters, monthly average temperature of 122 synoptic stations, were assessed. Statistical analysis showed that January with 120.75% had the most fluctuation among the studied months. Global Moran's Index revealed that yearly changes of temperature in Iran followed a strong spatially clustered pattern. Findings showed that the biggest thermal cluster pattern in Iran, 0.975388, occurred in May. Cluster and outlier analyses showed that thermal homogeneity in Iran decreases in cold months, while it increases in warm months. This is due to the radiation angle and synoptic systems which strongly influence thermal order in Iran. The elevations, however, have the most notable part proved by Geographically weighted regression model. Iran's thermal analysis through hotspot showed that hot thermal patterns (very hot, hot, and semi-hot) were dominant in the South, covering an area of 33.5% (about 552,145.3 km2). Regions such as mountain foot and low lands lack any significant spatial autocorrelation, 25.2% covering about 415,345.1 km2. The last is the cold thermal area (very cold, cold, and semi-cold) with about 25.2% covering about 552,145.3 km2 of the whole area of Iran.

  8. Automated classification of mouse pup isolation syllables: from cluster analysis to an Excel-based "mouse pup syllable classification calculator".

    PubMed

    Grimsley, Jasmine M S; Gadziola, Marie A; Wenstrup, Jeffrey J

    2012-01-01

    Mouse pups vocalize at high rates when they are cold or isolated from the nest. The proportions of each syllable type produced carry information about disease state and are being used as behavioral markers for the internal state of animals. Manual classifications of these vocalizations identified 10 syllable types based on their spectro-temporal features. However, manual classification of mouse syllables is time consuming and vulnerable to experimenter bias. This study uses an automated cluster analysis to identify acoustically distinct syllable types produced by CBA/CaJ mouse pups, and then compares the results to prior manual classification methods. The cluster analysis identified two syllable types, based on their frequency bands, that have continuous frequency-time structure, and two syllable types featuring abrupt frequency transitions. Although cluster analysis computed fewer syllable types than manual classification, the clusters represented well the probability distributions of the acoustic features within syllables. These probability distributions indicate that some of the manually classified syllable types are not statistically distinct. The characteristics of the four classified clusters were used to generate a Microsoft Excel-based mouse syllable classifier that rapidly categorizes syllables, with over a 90% match, into the syllable types determined by cluster analysis.

  9. Multifractal Approach to Time Clustering of Earthquakes. Application to Mt. Vesuvio Seismicity

    NASA Astrophysics Data System (ADS)

    Codano, C.; Alonzo, M. L.; Vilardo, G.

    The clustering structure of the Vesuvian earthquakes occurring is investigated by means of statistical tools: the inter-event time distribution, the running mean and the multifractal analysis. The first cannot clearly distinguish between a Poissonian process and a clustered one due to the difficulties of clearly distinguishing between an exponential distribution and a power law one. The running mean test reveals the clustering of the earthquakes, but looses information about the structure of the distribution at global scales. The multifractal approach can enlighten the clustering at small scales, while the global behaviour remains Poissonian. Subsequently the clustering of the events is interpreted in terms of diffusive processes of the stress in the earth crust.

  10. Detecting Genomic Clustering of Risk Variants from Sequence Data: Cases vs. Controls

    PubMed Central

    Schaid, Daniel J.; Sinnwell, Jason P.; McDonnell, Shannon K.; Thibodeau, Stephen N.

    2013-01-01

    As the ability to measure dense genetic markers approaches the limit of the DNA sequence itself, taking advantage of possible clustering of genetic variants in, and around, a gene would benefit genetic association analyses, and likely provide biological insights. The greatest benefit might be realized when multiple rare variants cluster in a functional region. Several statistical tests have been developed, one of which is based on the popular Kulldorff scan statistic for spatial clustering of disease. We extended another popular spatial clustering method – Tango’s statistic – to genomic sequence data. An advantage of Tango’s method is that it is rapid to compute, and when single test statistic is computed, its distribution is well approximated by a scaled chi-square distribution, making computation of p-values very rapid. We compared the Type-I error rates and power of several clustering statistics, as well as the omnibus sequence kernel association test (SKAT). Although our version of Tango’s statistic, which we call “Kernel Distance” statistic, took approximately half the time to compute than the Kulldorff scan statistic, it had slightly less power than the scan statistic. Our results showed that the Ionita-Laza version of Kulldorff’s scan statistic had the greatest power over a range of clustering scenarios. PMID:23842950

  11. Sequential analysis of hydrochemical data for watershed characterization.

    PubMed

    Thyne, Geoffrey; Güler, Cüneyt; Poeter, Eileen

    2004-01-01

    A methodology for characterizing the hydrogeology of watersheds using hydrochemical data that combine statistical, geochemical, and spatial techniques is presented. Surface water and ground water base flow and spring runoff samples (180 total) from a single watershed are first classified using hierarchical cluster analysis. The statistical clusters are analyzed for spatial coherence confirming that the clusters have a geological basis corresponding to topographic flowpaths and showing that the fractured rock aquifer behaves as an equivalent porous medium on the watershed scale. Then principal component analysis (PCA) is used to determine the sources of variation between parameters. PCA analysis shows that the variations within the dataset are related to variations in calcium, magnesium, SO4, and HCO3, which are derived from natural weathering reactions, and pH, NO3, and chlorine, which indicate anthropogenic impact. PHREEQC modeling is used to quantitatively describe the natural hydrochemical evolution for the watershed and aid in discrimination of samples that have an anthropogenic component. Finally, the seasonal changes in the water chemistry of individual sites were analyzed to better characterize the spatial variability of vertical hydraulic conductivity. The integrated result provides a method to characterize the hydrogeology of the watershed that fully utilizes traditional data.

  12. Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure.

    PubMed

    Kilborn, Joshua P; Jones, David L; Peebles, Ernst B; Naar, David F

    2017-04-01

    Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing-based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance-based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.

  13. Genotyping and spatial analysis of pulmonary tuberculosis and diabetes cases in the state of Veracruz, Mexico

    PubMed Central

    Blanco-Guillot, Francles; Ferreyra-Reyes, Leticia; Delgado-Sánchez, Guadalupe; Ferreira-Guerrero, Elizabeth; Montero-Campos, Rogelio; Bobadilla-del-Valle, Miriam; Martínez-Gamboa, Rosa Areli; Torres-González, Pedro; Téllez-Vazquez, Norma; Canizales-Quintero, Sergio; Yanes-Lane, Mercedes; Mongua-Rodríguez, Norma; Ponce-de-León, Alfredo; Sifuentes-Osornio, José

    2018-01-01

    Background Genotyping and georeferencing in tuberculosis (TB) have been used to characterize the distribution of the disease and occurrence of transmission within specific groups and communities. Objective The objective of this study was to test the hypothesis that diabetes mellitus (DM) and pulmonary TB may occur in spatial and molecular aggregations. Material and methods Retrospective cohort study of patients with pulmonary TB. The study area included 12 municipalities in the Sanitary Jurisdiction of Orizaba, Veracruz, México. Patients with acid-fast bacilli in sputum smears and/or Mycobacterium tuberculosis in sputum cultures were recruited from 1995 to 2010. Clinical (standardized questionnaire, physical examination, chest X-ray, blood glucose test and HIV test), microbiological, epidemiological, and molecular evaluations were carried out. Patients were considered “genotype-clustered” if two or more isolates from different patients were identified within 12 months of each other and had six or more IS6110 bands in an identical pattern, or < 6 bands with identical IS6110 RFLP patterns and spoligotype with the same spacer oligonucleotides. Residential and health care centers addresses were georeferenced. We used a Jeep hand GPS. The coordinates were transferred from the GPS files to ArcGIS using ArcMap 9.3. We evaluated global spatial aggregation of patients in IS6110-RFLP/ spoligotype clusters using global Moran´s I. Since global distribution was not random, we evaluated “hotspots” using Getis-Ord Gi* statistic. Using bivariate and multivariate analysis we analyzed sociodemographic, behavioral, clinic and bacteriological conditions associated with “hotspots”. We used STATA® v13.1 for all statistical analysis. Results From 1995 to 2010, 1,370 patients >20 years were diagnosed with pulmonary TB; 33% had DM. The proportion of isolates that were genotyped was 80.7% (n = 1105), of which 31% (n = 342) were grouped in 91 genotype clusters with 2 to 23 patients each; 65.9% of total clusters were small (2 members) involving 35.08% of patients. Twenty three (22.7) percent of cases were classified as recent transmission. Moran`s I indicated that distribution of patients in IS6110-RFLP/spoligotype clusters was not random (Moran`s I = 0.035468, Z value = 7.0, p = 0.00). Local spatial analysis showed statistically significant spatial aggregation of patients in IS6110-RFLP/spoligotype clusters identifying “hotspots” and “coldspots”. GI* statistic showed that the hotspot for spatial clustering was located in Camerino Z. Mendoza municipality; 14.6% (50/342) of patients in genotype clusters were located in a hotspot; of these, 60% (30/50) lived with DM. Using logistic regression the statistically significant variables associated with hotspots were: DM [adjusted Odds Ratio (aOR) 7.04, 95% Confidence interval (CI) 3.03–16.38] and attending the health center in Camerino Z. Mendoza (aOR18.04, 95% CI 7.35–44.28). Conclusions The combination of molecular and epidemiological information with geospatial data allowed us to identify the concurrence of molecular clustering and spatial aggregation of patients with DM and TB. This information may be highly useful for TB control programs. PMID:29534104

  14. Spatial distribution and cluster analysis of risky sexual behaviours and STDs reported by Chinese adults in Guangzhou, China: a representative population-based study

    PubMed Central

    Chen, Wen; Zhou, Fangjing; Hall, Brian J; Wang, Yu; Latkin, Carl; Ling, Li; Tucker, Joseph D

    2016-01-01

    Objectives To assess associations between residences location, risky sexual behaviours and sexually transmitted diseases (STDs) among adults living in Guangzhou, China. Methods Data were obtained from 751 Chinese adults aged 18–59 years in Guangzhou, China, using stratified random sampling by using spatial epidemiological methods. Face-to-face household interviews were conducted to collect self-report data on risky sexual behaviours and diagnosed STDs. Kulldorff’s spatial scan statistic was implemented to identify and detect spatial distribution and clusters of risky sexual behaviours and STDs. The presence and location of statistically significant clusters were mapped in the study areas using ArcGIS software. Results The prevalence of self-reported risky sexual behaviours was between 5.1% and 50.0%. The self-reported lifetime prevalence of diagnosed STDs was 7.06%. Anal intercourse clustered in an area located along the border within the rural–urban continuum (p=0.001). High rate clusters for alcohol or other drugs using before sex (p=0.008) and migrants who lived in Guangzhou <1 year (p=0.007) overlapped this cluster. Excess cases for unprotected sex (p=0.031) overlapped the cluster for college students (p<0.001). Five of nine (55.6%) students who had sexual experience during the last 12 months located in the cluster of unprotected sex. Conclusions Short-term migrants and college students reported greater risky sexual behaviours. Programmes to increase safer sex within these communities to reduce the risk of STDs are warranted in Guangzhou. Spatial analysis identified geographical clusters of risky sexual behaviours, which is critical for optimising surveillance and targeting control measures for these locations in the future. PMID:26843400

  15. Clusters in the distribution of pulsars in period, pulse-width, and age. [statistical analysis/statistical distributions

    NASA Technical Reports Server (NTRS)

    Baker, K. B.; Sturrock, P. A.

    1975-01-01

    The question of whether pulsars form a single group or whether pulsars come in two or more different groups is discussed. It is proposed that such groups might be related to several factors such as the initial creation of the neutron star, or the orientation of the magnetic field axis with the spin axis. Various statistical models are examined.

  16. Reconstruction of sediment transport pathways in modern microtidal sand flat by multiple classification analysis

    NASA Astrophysics Data System (ADS)

    Yamashita, S.; Nakajo, T.; Naruse, H.

    2009-12-01

    In this study, we statistically classified the grain size distribution of the bottom surface sediment on a microtidal sand flat to analyze the depositional processes of the sediment. Multiple classification analysis revealed that two types of sediment populations exist in the bottom surface sediment. Then, we employed the sediment trend model developed by Gao and Collins (1992) for the estimation of sediment transport pathways. As a result, we found that statistical discrimination of the bottom surface sediment provides useful information for the sediment trend model while dealing with various types of sediment transport processes. The microtidal sand flat along the Kushida River estuary, Ise Bay, central Japan, was investigated, and 102 bottom surface sediment samples were obtained. Then, their grain size distribution patterns were measured by the settling tube method, and each grain size distribution parameter (mud and gravel contents, mean grain size, coefficient of variance (CV), skewness, kurtosis, 5, 25, 50, 75, and 95 percentile) was calculated. Here, CV is the normalized sorting value divided by the mean grain size. Two classical statistical methods—principal component analysis (PCA) and fuzzy cluster analysis—were applied. The results of PCA showed that the bottom surface sediment of the study area is mainly characterized by grain size (mean grain size and 5-95 percentile) and the CV value, indicating predominantly large absolute values of factor loadings in primal component (PC) 1. PC1 is interpreted as being indicative of the grain-size trend, in which a finer grain-size distribution indicates better size sorting. The frequency distribution of PC1 has a bimodal shape and suggests the existence of two types of sediment populations. Therefore, we applied fuzzy cluster analysis, the results of which revealed two groupings of the sediment (Cluster 1 and Cluster 2). Cluster 1 shows a lower value of PC1, indicating coarse and poorly sorted sediments. Cluster 1 sediments are distributed around the branched channel from Kushida River and show an expanding distribution from the river mouth toward the northeast direction. Cluster 2 shows a higher value of PC1, indicating fine and well-sorted sediments; this cluster is distributed in a distant area from the river mouth, including the offshore region. Therefore, Cluster 1 and Cluster 2 are interpreted as being deposited by fluvial and wave processes, respectively. Finally, on the basis of this distribution pattern, the sediment trend model was applied in areas dominated separately by fluvial and wave processes. Resultant sediment transport patterns showed good agreement with those obtained by field observations. The results of this study provide an important insight into the numerical models of sediment transport.

  17. Connectivity-based fixel enhancement: Whole-brain statistical analysis of diffusion MRI measures in the presence of crossing fibres

    PubMed Central

    Raffelt, David A.; Smith, Robert E.; Ridgway, Gerard R.; Tournier, J-Donald; Vaughan, David N.; Rose, Stephen; Henderson, Robert; Connelly, Alan

    2015-01-01

    In brain regions containing crossing fibre bundles, voxel-average diffusion MRI measures such as fractional anisotropy (FA) are difficult to interpret, and lack within-voxel single fibre population specificity. Recent work has focused on the development of more interpretable quantitative measures that can be associated with a specific fibre population within a voxel containing crossing fibres (herein we use fixel to refer to a specific fibre population within a single voxel). Unfortunately, traditional 3D methods for smoothing and cluster-based statistical inference cannot be used for voxel-based analysis of these measures, since the local neighbourhood for smoothing and cluster formation can be ambiguous when adjacent voxels may have different numbers of fixels, or ill-defined when they belong to different tracts. Here we introduce a novel statistical method to perform whole-brain fixel-based analysis called connectivity-based fixel enhancement (CFE). CFE uses probabilistic tractography to identify structurally connected fixels that are likely to share underlying anatomy and pathology. Probabilistic connectivity information is then used for tract-specific smoothing (prior to the statistical analysis) and enhancement of the statistical map (using a threshold-free cluster enhancement-like approach). To investigate the characteristics of the CFE method, we assessed sensitivity and specificity using a large number of combinations of CFE enhancement parameters and smoothing extents, using simulated pathology generated with a range of test-statistic signal-to-noise ratios in five different white matter regions (chosen to cover a broad range of fibre bundle features). The results suggest that CFE input parameters are relatively insensitive to the characteristics of the simulated pathology. We therefore recommend a single set of CFE parameters that should give near optimal results in future studies where the group effect is unknown. We then demonstrate the proposed method by comparing apparent fibre density between motor neurone disease (MND) patients with control subjects. The MND results illustrate the benefit of fixel-specific statistical inference in white matter regions that contain crossing fibres. PMID:26004503

  18. Astrophysical properties of star clusters in the Magellanic Clouds homogeneously estimated by ASteCA

    NASA Astrophysics Data System (ADS)

    Perren, G. I.; Piatti, A. E.; Vázquez, R. A.

    2017-06-01

    Aims: We seek to produce a homogeneous catalog of astrophysical parameters of 239 resolved star clusters, located in the Small and Large Magellanic Clouds, observed in the Washington photometric system. Methods: The cluster sample was processed with the recently introduced Automated Stellar Cluster Analysis (ASteCA) package, which ensures both an automatized and a fully reproducible treatment, together with a statistically based analysis of their fundamental parameters and associated uncertainties. The fundamental parameters determined for each cluster with this tool, via a color-magnitude diagram (CMD) analysis, are metallicity, age, reddening, distance modulus, and total mass. Results: We generated a homogeneous catalog of structural and fundamental parameters for the studied cluster sample and performed a detailed internal error analysis along with a thorough comparison with values taken from 26 published articles. We studied the distribution of cluster fundamental parameters in both Clouds and obtained their age-metallicity relationships. Conclusions: The ASteCA package can be applied to an unsupervised determination of fundamental cluster parameters, which is a task of increasing relevance as more data becomes available through upcoming surveys. A table with the estimated fundamental parameters for the 239 clusters analyzed is only available at the CDS via anonymous ftp to http://cdsarc.u-strasbg.fr (http://130.79.128.5) or via http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/602/A89

  19. Defining syndromes using cattle meat inspection data for syndromic surveillance purposes: a statistical approach with the 2005–2010 data from ten French slaughterhouses

    PubMed Central

    2013-01-01

    Background The slaughterhouse is a central processing point for food animals and thus a source of both demographic data (age, breed, sex) and health-related data (reason for condemnation and condemned portions) that are not available through other sources. Using these data for syndromic surveillance is therefore tempting. However many possible reasons for condemnation and condemned portions exist, making the definition of relevant syndromes challenging. The objective of this study was to determine a typology of cattle with at least one portion of the carcass condemned in order to define syndromes. Multiple factor analysis (MFA) in combination with clustering methods was performed using both health-related data and demographic data. Results Analyses were performed on 381,186 cattle with at least one portion of the carcass condemned among the 1,937,917 cattle slaughtered in ten French abattoirs. Results of the MFA and clustering methods led to 12 clusters considered as stable according to year of slaughter and slaughterhouse. One cluster was specific to a disease of public health importance (cysticercosis). Two clusters were linked to the slaughtering process (fecal contamination of heart or lungs and deterioration lesions). Two clusters respectively characterized by chronic liver lesions and chronic peritonitis could be linked to diseases of economic importance to farmers. Three clusters could be linked respectively to reticulo-pericarditis, fatty liver syndrome and farmer’s lung syndrome, which are related to both diseases of economic importance to farmers and herd management issues. Three clusters respectively characterized by arthritis, myopathy and Dark Firm Dry (DFD) meat could notably be linked to animal welfare issues. Finally, one cluster, characterized by bronchopneumonia, could be linked to both animal health and herd management issues. Conclusion The statistical approach of combining multiple factor analysis with cluster analysis showed its relevance for the detection of syndromes using available large and complex slaughterhouse data. The advantages of this statistical approach are to i) define groups of reasons for condemnation based on meat inspection data, ii) help grouping reasons for condemnation among a list of various possible reasons for condemnation for which a consensus among experts could be difficult to reach, iii) assign each animal to a single syndrome which allows the detection of changes in trends of syndromes to detect unusual patterns in known diseases and emergence of new diseases. PMID:23628140

  20. Linear regression models and k-means clustering for statistical analysis of fNIRS data.

    PubMed

    Bonomini, Viola; Zucchelli, Lucia; Re, Rebecca; Ieva, Francesca; Spinelli, Lorenzo; Contini, Davide; Paganoni, Anna; Torricelli, Alessandro

    2015-02-01

    We propose a new algorithm, based on a linear regression model, to statistically estimate the hemodynamic activations in fNIRS data sets. The main concern guiding the algorithm development was the minimization of assumptions and approximations made on the data set for the application of statistical tests. Further, we propose a K-means method to cluster fNIRS data (i.e. channels) as activated or not activated. The methods were validated both on simulated and in vivo fNIRS data. A time domain (TD) fNIRS technique was preferred because of its high performances in discriminating cortical activation and superficial physiological changes. However, the proposed method is also applicable to continuous wave or frequency domain fNIRS data sets.

  1. Linear regression models and k-means clustering for statistical analysis of fNIRS data

    PubMed Central

    Bonomini, Viola; Zucchelli, Lucia; Re, Rebecca; Ieva, Francesca; Spinelli, Lorenzo; Contini, Davide; Paganoni, Anna; Torricelli, Alessandro

    2015-01-01

    We propose a new algorithm, based on a linear regression model, to statistically estimate the hemodynamic activations in fNIRS data sets. The main concern guiding the algorithm development was the minimization of assumptions and approximations made on the data set for the application of statistical tests. Further, we propose a K-means method to cluster fNIRS data (i.e. channels) as activated or not activated. The methods were validated both on simulated and in vivo fNIRS data. A time domain (TD) fNIRS technique was preferred because of its high performances in discriminating cortical activation and superficial physiological changes. However, the proposed method is also applicable to continuous wave or frequency domain fNIRS data sets. PMID:25780751

  2. Order statistics applied to the most massive and most distant galaxy clusters

    NASA Astrophysics Data System (ADS)

    Waizmann, J.-C.; Ettori, S.; Bartelmann, M.

    2013-06-01

    In this work, we present an analytic framework for calculating the individual and joint distributions of the nth most massive or nth highest redshift galaxy cluster for a given survey characteristic allowing us to formulate Λ cold dark matter (ΛCDM) exclusion criteria. We show that the cumulative distribution functions steepen with increasing order, giving them a higher constraining power with respect to the extreme value statistics. Additionally, we find that the order statistics in mass (being dominated by clusters at lower redshifts) is sensitive to the matter density and the normalization of the matter fluctuations, whereas the order statistics in redshift is particularly sensitive to the geometric evolution of the Universe. For a fixed cosmology, both order statistics are efficient probes of the functional shape of the mass function at the high-mass end. To allow a quick assessment of both order statistics, we provide fits as a function of the survey area that allow percentile estimation with an accuracy better than 2 per cent. Furthermore, we discuss the joint distributions in the two-dimensional case and find that for the combination of the largest and the second largest observation, it is most likely to find them to be realized with similar values with a broadly peaked distribution. When combining the largest observation with higher orders, it is more likely to find a larger gap between the observations and when combining higher orders in general, the joint probability density function peaks more strongly. Having introduced the theory, we apply the order statistical analysis to the Southpole Telescope (SPT) massive cluster sample and metacatalogue of X-ray detected clusters of galaxies catalogue and find that the 10 most massive clusters in the sample are consistent with ΛCDM and the Tinker mass function. For the order statistics in redshift, we find a discrepancy between the data and the theoretical distributions, which could in principle indicate a deviation from the standard cosmology. However, we attribute this deviation to the uncertainty in the modelling of the SPT survey selection function. In turn, by assuming the ΛCDM reference cosmology, order statistics can also be utilized for consistency checks of the completeness of the observed sample and of the modelling of the survey selection function.

  3. A flexible spatial scan statistic with a restricted likelihood ratio for detecting disease clusters.

    PubMed

    Tango, Toshiro; Takahashi, Kunihiko

    2012-12-30

    Spatial scan statistics are widely used tools for detection of disease clusters. Especially, the circular spatial scan statistic proposed by Kulldorff (1997) has been utilized in a wide variety of epidemiological studies and disease surveillance. However, as it cannot detect noncircular, irregularly shaped clusters, many authors have proposed different spatial scan statistics, including the elliptic version of Kulldorff's scan statistic. The flexible spatial scan statistic proposed by Tango and Takahashi (2005) has also been used for detecting irregularly shaped clusters. However, this method sets a feasible limitation of a maximum of 30 nearest neighbors for searching candidate clusters because of heavy computational load. In this paper, we show a flexible spatial scan statistic implemented with a restricted likelihood ratio proposed by Tango (2008) to (1) eliminate the limitation of 30 nearest neighbors and (2) to have surprisingly much less computational time than the original flexible spatial scan statistic. As a side effect, it is shown to be able to detect clusters with any shape reasonably well as the relative risk of the cluster becomes large via Monte Carlo simulation. We illustrate the proposed spatial scan statistic with data on mortality from cerebrovascular disease in the Tokyo Metropolitan area, Japan. Copyright © 2012 John Wiley & Sons, Ltd.

  4. Advanced analysis of forest fire clustering

    NASA Astrophysics Data System (ADS)

    Kanevski, Mikhail; Pereira, Mario; Golay, Jean

    2017-04-01

    Analysis of point pattern clustering is an important topic in spatial statistics and for many applications: biodiversity, epidemiology, natural hazards, geomarketing, etc. There are several fundamental approaches used to quantify spatial data clustering using topological, statistical and fractal measures. In the present research, the recently introduced multi-point Morisita index (mMI) is applied to study the spatial clustering of forest fires in Portugal. The data set consists of more than 30000 fire events covering the time period from 1975 to 2013. The distribution of forest fires is very complex and highly variable in space. mMI is a multi-point extension of the classical two-point Morisita index. In essence, mMI is estimated by covering the region under study by a grid and by computing how many times more likely it is that m points selected at random will be from the same grid cell than it would be in the case of a complete random Poisson process. By changing the number of grid cells (size of the grid cells), mMI characterizes the scaling properties of spatial clustering. From mMI, the data intrinsic dimension (fractal dimension) of the point distribution can be estimated as well. In this study, the mMI of forest fires is compared with the mMI of random patterns (RPs) generated within the validity domain defined as the forest area of Portugal. It turns out that the forest fires are highly clustered inside the validity domain in comparison with the RPs. Moreover, they demonstrate different scaling properties at different spatial scales. The results obtained from the mMI analysis are also compared with those of fractal measures of clustering - box counting and sand box counting approaches. REFERENCES Golay J., Kanevski M., Vega Orozco C., Leuenberger M., 2014: The multipoint Morisita index for the analysis of spatial patterns. Physica A, 406, 191-202. Golay J., Kanevski M. 2015: A new estimator of intrinsic dimension based on the multipoint Morisita index. Pattern Recognition, 48, 4070-4081.

  5. Local multiplicity adjustment for the spatial scan statistic using the Gumbel distribution.

    PubMed

    Gangnon, Ronald E

    2012-03-01

    The spatial scan statistic is an important and widely used tool for cluster detection. It is based on the simultaneous evaluation of the statistical significance of the maximum likelihood ratio test statistic over a large collection of potential clusters. In most cluster detection problems, there is variation in the extent of local multiplicity across the study region. For example, using a fixed maximum geographic radius for clusters, urban areas typically have many overlapping potential clusters, whereas rural areas have relatively few. The spatial scan statistic does not account for local multiplicity variation. We describe a previously proposed local multiplicity adjustment based on a nested Bonferroni correction and propose a novel adjustment based on a Gumbel distribution approximation to the distribution of a local scan statistic. We compare the performance of all three statistics in terms of power and a novel unbiased cluster detection criterion. These methods are then applied to the well-known New York leukemia dataset and a Wisconsin breast cancer incidence dataset. © 2011, The International Biometric Society.

  6. Local multiplicity adjustment for the spatial scan statistic using the Gumbel distribution

    PubMed Central

    Gangnon, Ronald E.

    2011-01-01

    Summary The spatial scan statistic is an important and widely used tool for cluster detection. It is based on the simultaneous evaluation of the statistical significance of the maximum likelihood ratio test statistic over a large collection of potential clusters. In most cluster detection problems, there is variation in the extent of local multiplicity across the study region. For example, using a fixed maximum geographic radius for clusters, urban areas typically have many overlapping potential clusters, while rural areas have relatively few. The spatial scan statistic does not account for local multiplicity variation. We describe a previously proposed local multiplicity adjustment based on a nested Bonferroni correction and propose a novel adjustment based on a Gumbel distribution approximation to the distribution of a local scan statistic. We compare the performance of all three statistics in terms of power and a novel unbiased cluster detection criterion. These methods are then applied to the well-known New York leukemia dataset and a Wisconsin breast cancer incidence dataset. PMID:21762118

  7. Model-based clustering for RNA-seq data.

    PubMed

    Si, Yaqing; Liu, Peng; Li, Pinghua; Brutnell, Thomas P

    2014-01-15

    RNA-seq technology has been widely adopted as an attractive alternative to microarray-based methods to study global gene expression. However, robust statistical tools to analyze these complex datasets are still lacking. By grouping genes with similar expression profiles across treatments, cluster analysis provides insight into gene functions and networks, and hence is an important technique for RNA-seq data analysis. In this manuscript, we derive clustering algorithms based on appropriate probability models for RNA-seq data. An expectation-maximization algorithm and another two stochastic versions of expectation-maximization algorithms are described. In addition, a strategy for initialization based on likelihood is proposed to improve the clustering algorithms. Moreover, we present a model-based hybrid-hierarchical clustering method to generate a tree structure that allows visualization of relationships among clusters as well as flexibility of choosing the number of clusters. Results from both simulation studies and analysis of a maize RNA-seq dataset show that our proposed methods provide better clustering results than alternative methods such as the K-means algorithm and hierarchical clustering methods that are not based on probability models. An R package, MBCluster.Seq, has been developed to implement our proposed algorithms. This R package provides fast computation and is publicly available at http://www.r-project.org

  8. Evolution of massive stars in very young clusters and associations

    NASA Technical Reports Server (NTRS)

    Stothers, R. B.

    1985-01-01

    Statistics concerning the stellar content of young galactic clusters and associations which show well defined main sequence turnups have been analyzed in order to derive information about stellar evolution in high-mass galaxies. The analytical approach is semiempirical and uses natural spectroscopic groups of stars on the H-R diagram together with the stars' apparent magnitudes. The new approach does not depend on absolute luminosities and requires only the most basic elements of stellar evolution theory. The following conclusions are offered on the basis of the statistical analysis: (1) O-tupe main-sequence stars evolve to a spectral type of B1 during core hydrogen burning; (2) most O-type blue stragglers are newly formed massive stars burning core hydrogen; (3) supergiants lying redward of the main-sequence turnup are burning core helium; and most Wolf-Rayet stars are burning core helium and originally had masses greater than 30-40 solar mass. The statistics of the natural spectroscopic stars in young galactic clusters and associations are given in a table.

  9. Scoring clustering solutions by their biological relevance.

    PubMed

    Gat-Viks, I; Sharan, R; Shamir, R

    2003-12-12

    A central step in the analysis of gene expression data is the identification of groups of genes that exhibit similar expression patterns. Clustering gene expression data into homogeneous groups was shown to be instrumental in functional annotation, tissue classification, regulatory motif identification, and other applications. Although there is a rich literature on clustering algorithms for gene expression analysis, very few works addressed the systematic comparison and evaluation of clustering results. Typically, different clustering algorithms yield different clustering solutions on the same data, and there is no agreed upon guideline for choosing among them. We developed a novel statistically based method for assessing a clustering solution according to prior biological knowledge. Our method can be used to compare different clustering solutions or to optimize the parameters of a clustering algorithm. The method is based on projecting vectors of biological attributes of the clustered elements onto the real line, such that the ratio of between-groups and within-group variance estimators is maximized. The projected data are then scored using a non-parametric analysis of variance test, and the score's confidence is evaluated. We validate our approach using simulated data and show that our scoring method outperforms several extant methods, including the separation to homogeneity ratio and the silhouette measure. We apply our method to evaluate results of several clustering methods on yeast cell-cycle gene expression data. The software is available from the authors upon request.

  10. Who are the healthy active seniors? A cluster analysis.

    PubMed

    Lai, Claudia K Y; Chan, Engle Angela; Chin, Kenny C W

    2014-12-01

    This paper reports a cluster analysis of a sample recruited from a randomized controlled trial that explored the effect of using a life story work approach to improve the psychological outcomes of older people in the community. 238 subjects from community centers were included in this analysis. After statistical testing, 169 seniors were assigned to the active ageing (AG) cluster and 69 to the inactive ageing (IG) cluster. Those in the AG were younger and healthier, with fewer chronic diseases and fewer depressive symptoms than those in the IG. They were more satisfied with their lives, and had higher self-esteem. They met with their family members more frequently, they engaged in more leisure activities and were more likely to have the ability to move freely. In summary, active ageing was observed in people with better health and functional performance. Our results echoed the limited findings reported in the literature.

  11. Generating a Magellanic star cluster catalog with ASteCA

    NASA Astrophysics Data System (ADS)

    Perren, G. I.; Piatti, A. E.; Vázquez, R. A.

    2016-08-01

    An increasing number of software tools have been employed in the recent years for the automated or semi-automated processing of astronomical data. The main advantages of using these tools over a standard by-eye analysis include: speed (particularly for large databases), homogeneity, reproducibility, and precision. At the same time, they enable a statistically correct study of the uncertainties associated with the analysis, in contrast with manually set errors, or the still widespread practice of simply not assigning errors. We present a catalog comprising 210 star clusters located in the Large and Small Magellanic Clouds, observed with Washington photometry. Their fundamental parameters were estimated through an homogeneous, automatized and completely unassisted process, via the Automated Stellar Cluster Analysis package ( ASteCA). Our results are compared with two types of studies on these clusters: one where the photometry is the same, and another where the photometric system is different than that employed by ASteCA.

  12. Cluster mass inference via random field theory.

    PubMed

    Zhang, Hui; Nichols, Thomas E; Johnson, Timothy D

    2009-01-01

    Cluster extent and voxel intensity are two widely used statistics in neuroimaging inference. Cluster extent is sensitive to spatially extended signals while voxel intensity is better for intense but focal signals. In order to leverage strength from both statistics, several nonparametric permutation methods have been proposed to combine the two methods. Simulation studies have shown that of the different cluster permutation methods, the cluster mass statistic is generally the best. However, to date, there is no parametric cluster mass inference available. In this paper, we propose a cluster mass inference method based on random field theory (RFT). We develop this method for Gaussian images, evaluate it on Gaussian and Gaussianized t-statistic images and investigate its statistical properties via simulation studies and real data. Simulation results show that the method is valid under the null hypothesis and demonstrate that it can be more powerful than the cluster extent inference method. Further, analyses with a single subject and a group fMRI dataset demonstrate better power than traditional cluster size inference, and good accuracy relative to a gold-standard permutation test.

  13. Performance of cancer cluster Q-statistics for case-control residential histories

    PubMed Central

    Sloan, Chantel D.; Jacquez, Geoffrey M.; Gallagher, Carolyn M.; Ward, Mary H.; Raaschou-Nielsen, Ole; Nordsborg, Rikke Baastrup; Meliker, Jaymie R.

    2012-01-01

    Few investigations of health event clustering have evaluated residential mobility, though causative exposures for chronic diseases such as cancer often occur long before diagnosis. Recently developed Q-statistics incorporate human mobility into disease cluster investigations by quantifying space- and time-dependent nearest neighbor relationships. Using residential histories from two cancer case-control studies, we created simulated clusters to examine Q-statistic performance. Results suggest the intersection of cases with significant clustering over their life course, Qi, with cases who are constituents of significant local clusters at given times, Qit, yielded the best performance, which improved with increasing cluster size. Upon comparison, a larger proportion of true positives were detected with Kulldorf’s spatial scan method if the time of clustering was provided. We recommend using Q-statistics to identify when and where clustering may have occurred, followed by the scan method to localize the candidate clusters. Future work should investigate the generalizability of these findings. PMID:23149326

  14. Urban Transmission of American Cutaneous Leishmaniasis in Argentina: Spatial Analysis Study

    PubMed Central

    Gil, José F.; Nasser, Julio R.; Cajal, Silvana P.; Juarez, Marisa; Acosta, Norma; Cimino, Rubén O.; Diosque, Patricio; Krolewiecki, Alejandro J.

    2010-01-01

    We used kernel density and scan statistics to examine the spatial distribution of cases of pediatric and adult American cutaneous leishmaniasis in an urban disease-endemic area in Salta Province, Argentina. Spatial analysis was used for the whole population and stratified by women > 14 years of age (n = 159), men > 14 years of age (n = 667), and children < 15 years of age (n = 213). Although kernel density for adults encompassed nearly the entire city, distribution in children was most prevalent in the peripheral areas of the city. Scan statistic analysis for adult males, adult females, and children found 11, 2, and 8 clusters, respectively. Clusters for children had the highest odds ratios (P < 0.05) and were located in proximity of plantations and secondary vegetation. The data from this study provide further evidence of the potential urban transmission of American cutaneous leishmaniasis in northern Argentina. PMID:20207869

  15. Determining the Optimal Number of Clusters with the Clustergram

    NASA Technical Reports Server (NTRS)

    Fluegemann, Joseph K.; Davies, Misty D.; Aguirre, Nathan D.

    2011-01-01

    Cluster analysis aids research in many different fields, from business to biology to aerospace. It consists of using statistical techniques to group objects in large sets of data into meaningful classes. However, this process of ordering data points presents much uncertainty because it involves several steps, many of which are subject to researcher judgment as well as inconsistencies depending on the specific data type and research goals. These steps include the method used to cluster the data, the variables on which the cluster analysis will be operating, the number of resulting clusters, and parts of the interpretation process. In most cases, the number of clusters must be guessed or estimated before employing the clustering method. Many remedies have been proposed, but none is unassailable and certainly not for all data types. Thus, the aim of current research for better techniques of determining the number of clusters is generally confined to demonstrating that the new technique excels other methods in performance for several disparate data types. Our research makes use of a new cluster-number-determination technique based on the clustergram: a graph that shows how the number of objects in the cluster and the cluster mean (the ordinate) change with the number of clusters (the abscissa). We use the features of the clustergram to make the best determination of the cluster-number.

  16. A Measurement of Gravitational Lensing of the Cosmic Microwave Background by Galaxy Clusters Using Data from the South Pole Telescope

    DOE PAGES

    Baxter, E. J.; Keisler, R.; Dodelson, S.; ...

    2015-06-22

    Clusters of galaxies are expected to gravitationally lens the cosmic microwave background (CMB) and thereby generate a distinct signal in the CMB on arcminute scales. Measurements of this effect can be used to constrain the masses of galaxy clusters with CMB data alone. Here we present a measurement of lensing of the CMB by galaxy clusters using data from the South Pole Telescope (SPT). We also develop a maximum likelihood approach to extract the CMB cluster lensing signal and validate the method on mock data. We quantify the effects on our analysis of several potential sources of systematic error andmore » find that they generally act to reduce the best-fit cluster mass. It is estimated that this bias to lower cluster mass is roughly 0.85σ in units of the statistical error bar, although this estimate should be viewed as an upper limit. Furthermore, we apply our maximum likelihood technique to 513 clusters selected via their Sunyaev–Zeldovich (SZ) signatures in SPT data, and rule out the null hypothesis of no lensing at 3.1σ. The lensing-derived mass estimate for the full cluster sample is consistent with that inferred from the SZ flux: M 200,lens = 0.83 +0.38 -0.37 M 200,SZ (68% C.L., statistical error only).« less

  17. The Wilcoxon signed rank test for paired comparisons of clustered data.

    PubMed

    Rosner, Bernard; Glynn, Robert J; Lee, Mei-Ling T

    2006-03-01

    The Wilcoxon signed rank test is a frequently used nonparametric test for paired data (e.g., consisting of pre- and posttreatment measurements) based on independent units of analysis. This test cannot be used for paired comparisons arising from clustered data (e.g., if paired comparisons are available for each of two eyes of an individual). To incorporate clustering, a generalization of the randomization test formulation for the signed rank test is proposed, where the unit of randomization is at the cluster level (e.g., person), while the individual paired units of analysis are at the subunit within cluster level (e.g., eye within person). An adjusted variance estimate of the signed rank test statistic is then derived, which can be used for either balanced (same number of subunits per cluster) or unbalanced (different number of subunits per cluster) data, with an exchangeable correlation structure, with or without tied values. The resulting test statistic is shown to be asymptotically normal as the number of clusters becomes large, if the cluster size is bounded. Simulation studies are performed based on simulating correlated ranked data from a signed log-normal distribution. These studies indicate appropriate type I error for data sets with > or =20 clusters and a superior power profile compared with either the ordinary signed rank test based on the average cluster difference score or the multivariate signed rank test of Puri and Sen. Finally, the methods are illustrated with two data sets, (i) an ophthalmologic data set involving a comparison of electroretinogram (ERG) data in retinitis pigmentosa (RP) patients before and after undergoing an experimental surgical procedure, and (ii) a nutritional data set based on a randomized prospective study of nutritional supplements in RP patients where vitamin E intake outside of study capsules is compared before and after randomization to monitor compliance with nutritional protocols.

  18. Development and optimization of SPECT gated blood pool cluster analysis for the prediction of CRT outcome.

    PubMed

    Lalonde, Michel; Wells, R Glenn; Birnie, David; Ruddy, Terrence D; Wassenaar, Richard

    2014-07-01

    Phase analysis of single photon emission computed tomography (SPECT) radionuclide angiography (RNA) has been investigated for its potential to predict the outcome of cardiac resynchronization therapy (CRT). However, phase analysis may be limited in its potential at predicting CRT outcome as valuable information may be lost by assuming that time-activity curves (TAC) follow a simple sinusoidal shape. A new method, cluster analysis, is proposed which directly evaluates the TACs and may lead to a better understanding of dyssynchrony patterns and CRT outcome. Cluster analysis algorithms were developed and optimized to maximize their ability to predict CRT response. About 49 patients (N = 27 ischemic etiology) received a SPECT RNA scan as well as positron emission tomography (PET) perfusion and viability scans prior to undergoing CRT. A semiautomated algorithm sampled the left ventricle wall to produce 568 TACs from SPECT RNA data. The TACs were then subjected to two different cluster analysis techniques, K-means, and normal average, where several input metrics were also varied to determine the optimal settings for the prediction of CRT outcome. Each TAC was assigned to a cluster group based on the comparison criteria and global and segmental cluster size and scores were used as measures of dyssynchrony and used to predict response to CRT. A repeated random twofold cross-validation technique was used to train and validate the cluster algorithm. Receiver operating characteristic (ROC) analysis was used to calculate the area under the curve (AUC) and compare results to those obtained for SPECT RNA phase analysis and PET scar size analysis methods. Using the normal average cluster analysis approach, the septal wall produced statistically significant results for predicting CRT results in the ischemic population (ROC AUC = 0.73;p < 0.05 vs. equal chance ROC AUC = 0.50) with an optimal operating point of 71% sensitivity and 60% specificity. Cluster analysis results were similar to SPECT RNA phase analysis (ROC AUC = 0.78, p = 0.73 vs cluster AUC; sensitivity/specificity = 59%/89%) and PET scar size analysis (ROC AUC = 0.73, p = 1.0 vs cluster AUC; sensitivity/specificity = 76%/67%). A SPECT RNA cluster analysis algorithm was developed for the prediction of CRT outcome. Cluster analysis results produced results equivalent to those obtained from Fourier and scar analysis.

  19. Development and optimization of SPECT gated blood pool cluster analysis for the prediction of CRT outcome

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lalonde, Michel, E-mail: mlalonde15@rogers.com; Wassenaar, Richard; Wells, R. Glenn

    2014-07-15

    Purpose: Phase analysis of single photon emission computed tomography (SPECT) radionuclide angiography (RNA) has been investigated for its potential to predict the outcome of cardiac resynchronization therapy (CRT). However, phase analysis may be limited in its potential at predicting CRT outcome as valuable information may be lost by assuming that time-activity curves (TAC) follow a simple sinusoidal shape. A new method, cluster analysis, is proposed which directly evaluates the TACs and may lead to a better understanding of dyssynchrony patterns and CRT outcome. Cluster analysis algorithms were developed and optimized to maximize their ability to predict CRT response. Methods: Aboutmore » 49 patients (N = 27 ischemic etiology) received a SPECT RNA scan as well as positron emission tomography (PET) perfusion and viability scans prior to undergoing CRT. A semiautomated algorithm sampled the left ventricle wall to produce 568 TACs from SPECT RNA data. The TACs were then subjected to two different cluster analysis techniques, K-means, and normal average, where several input metrics were also varied to determine the optimal settings for the prediction of CRT outcome. Each TAC was assigned to a cluster group based on the comparison criteria and global and segmental cluster size and scores were used as measures of dyssynchrony and used to predict response to CRT. A repeated random twofold cross-validation technique was used to train and validate the cluster algorithm. Receiver operating characteristic (ROC) analysis was used to calculate the area under the curve (AUC) and compare results to those obtained for SPECT RNA phase analysis and PET scar size analysis methods. Results: Using the normal average cluster analysis approach, the septal wall produced statistically significant results for predicting CRT results in the ischemic population (ROC AUC = 0.73;p < 0.05 vs. equal chance ROC AUC = 0.50) with an optimal operating point of 71% sensitivity and 60% specificity. Cluster analysis results were similar to SPECT RNA phase analysis (ROC AUC = 0.78, p = 0.73 vs cluster AUC; sensitivity/specificity = 59%/89%) and PET scar size analysis (ROC AUC = 0.73, p = 1.0 vs cluster AUC; sensitivity/specificity = 76%/67%). Conclusions: A SPECT RNA cluster analysis algorithm was developed for the prediction of CRT outcome. Cluster analysis results produced results equivalent to those obtained from Fourier and scar analysis.« less

  20. Spike sorting using locality preserving projection with gap statistics and landmark-based spectral clustering.

    PubMed

    Nguyen, Thanh; Khosravi, Abbas; Creighton, Douglas; Nahavandi, Saeid

    2014-12-30

    Understanding neural functions requires knowledge from analysing electrophysiological data. The process of assigning spikes of a multichannel signal into clusters, called spike sorting, is one of the important problems in such analysis. There have been various automated spike sorting techniques with both advantages and disadvantages regarding accuracy and computational costs. Therefore, developing spike sorting methods that are highly accurate and computationally inexpensive is always a challenge in the biomedical engineering practice. An automatic unsupervised spike sorting method is proposed in this paper. The method uses features extracted by the locality preserving projection (LPP) algorithm. These features afterwards serve as inputs for the landmark-based spectral clustering (LSC) method. Gap statistics (GS) is employed to evaluate the number of clusters before the LSC can be performed. The proposed LPP-LSC is highly accurate and computationally inexpensive spike sorting approach. LPP spike features are very discriminative; thereby boost the performance of clustering methods. Furthermore, the LSC method exhibits its efficiency when integrated with the cluster evaluator GS. The proposed method's accuracy is approximately 13% superior to that of the benchmark combination between wavelet transformation and superparamagnetic clustering (WT-SPC). Additionally, LPP-LSC computing time is six times less than that of the WT-SPC. LPP-LSC obviously demonstrates a win-win spike sorting solution meeting both accuracy and computational cost criteria. LPP and LSC are linear algorithms that help reduce computational burden and thus their combination can be applied into real-time spike analysis. Copyright © 2014 Elsevier B.V. All rights reserved.

  1. The statistical average of optical properties for alumina particle cluster in aircraft plume

    NASA Astrophysics Data System (ADS)

    Li, Jingying; Bai, Lu; Wu, Zhensen; Guo, Lixin

    2018-04-01

    We establish a model for lognormal distribution of monomer radius and number of alumina particle clusters in plume. According to the Multi-Sphere T Matrix (MSTM) theory, we provide a method for finding the statistical average of optical properties for alumina particle clusters in plume, analyze the effect of different distributions and different detection wavelengths on the statistical average of optical properties for alumina particle cluster, and compare the statistical average optical properties under the alumina particle cluster model established in this study and those under three simplified alumina particle models. The calculation results show that the monomer number of alumina particle cluster and its size distribution have a considerable effect on its statistical average optical properties. The statistical average of optical properties for alumina particle cluster at common detection wavelengths exhibit obvious differences, whose differences have a great effect on modeling IR and UV radiation properties of plume. Compared with the three simplified models, the alumina particle cluster model herein features both higher extinction and scattering efficiencies. Therefore, we may find that an accurate description of the scattering properties of alumina particles in aircraft plume is of great significance in the study of plume radiation properties.

  2. A spatial scan statistic for nonisotropic two-level risk cluster.

    PubMed

    Li, Xiao-Zhou; Wang, Jin-Feng; Yang, Wei-Zhong; Li, Zhong-Jie; Lai, Sheng-Jie

    2012-01-30

    Spatial scan statistic methods are commonly used for geographical disease surveillance and cluster detection. The standard spatial scan statistic does not model any variability in the underlying risks of subregions belonging to a detected cluster. For a multilevel risk cluster, the isotonic spatial scan statistic could model a centralized high-risk kernel in the cluster. Because variations in disease risks are anisotropic owing to different social, economical, or transport factors, the real high-risk kernel will not necessarily take the central place in a whole cluster area. We propose a spatial scan statistic for a nonisotropic two-level risk cluster, which could be used to detect a whole cluster and a noncentralized high-risk kernel within the cluster simultaneously. The performance of the three methods was evaluated through an intensive simulation study. Our proposed nonisotropic two-level method showed better power and geographical precision with two-level risk cluster scenarios, especially for a noncentralized high-risk kernel. Our proposed method is illustrated using the hand-foot-mouth disease data in Pingdu City, Shandong, China in May 2009, compared with two other methods. In this practical study, the nonisotropic two-level method is the only way to precisely detect a high-risk area in a detected whole cluster. Copyright © 2011 John Wiley & Sons, Ltd.

  3. Geographic clustering of elevated blood heavy metal levels in pregnant women.

    PubMed

    King, Katherine E; Darrah, Thomas H; Money, Eric; Meentemeyer, Ross; Maguire, Rachel L; Nye, Monica D; Michener, Lloyd; Murtha, Amy P; Jirtle, Randy; Murphy, Susan K; Mendez, Michelle A; Robarge, Wayne; Vengosh, Avner; Hoyo, Cathrine

    2015-10-09

    Cadmium (Cd), lead (Pb), mercury (Hg), and arsenic (As) exposure is ubiquitous and has been associated with higher risk of growth restriction and cardiometabolic and neurodevelopmental disorders. However, cost-efficient strategies to identify at-risk populations and potential sources of exposure to inform mitigation efforts are limited. The objective of this study was to describe the spatial distribution and identify factors associated with Cd, Pb, Hg, and As concentrations in peripheral blood of pregnant women. Heavy metals were measured in whole peripheral blood of 310 pregnant women obtained at gestational age ~12 weeks. Prenatal residential addresses were geocoded and geospatial analysis (Getis-Ord Gi* statistics) was used to determine if elevated blood concentrations were geographically clustered. Logistic regression models were used to identify factors associated with elevated blood metal levels and cluster membership. Geospatial clusters for Cd and Pb were identified with high confidence (p-value for Gi* statistic <0.01). The Cd and Pb clusters comprised 10.5 and 9.2 % of Durham County residents, respectively. Medians and interquartile ranges of blood concentrations (μg/dL) for all participants were Cd 0.02 (0.01-0.04), Hg 0.03 (0.01-0.07), Pb 0.34 (0.16-0.83), and As 0.04 (0.04-0.05). In the Cd cluster, medians and interquartile ranges of blood concentrations (μg/dL) were Cd 0.06 (0.02-0.16), Hg 0.02 (0.00-0.05), Pb 0.54 (0.23-1.23), and As 0.05 (0.04-0.05). In the Pb cluster, medians and interquartile ranges of blood concentrations (μg/dL) were Cd 0.03 (0.02-0.15), Hg 0.01 (0.01-0.05), Pb 0.39 (0.24-0.74), and As 0.04 (0.04-0.05). Co-exposure with Pb and Cd was also clustered, the p-values for the Gi* statistic for Pb and Cd was <0.01. Cluster membership was associated with lower education levels and higher pre-pregnancy BMI. Our data support that elevated blood concentrations of Cd and Pb are spatially clustered in this urban environment compared to the surrounding areas. Spatial analysis of metals concentrations in peripheral blood or urine obtained routinely during prenatal care can be useful in surveillance of heavy metal exposure.

  4. Use of multiple cluster analysis methods to explore the validity of a community outcomes concept map.

    PubMed

    Orsi, Rebecca

    2017-02-01

    Concept mapping is now a commonly-used technique for articulating and evaluating programmatic outcomes. However, research regarding validity of knowledge and outcomes produced with concept mapping is sparse. The current study describes quantitative validity analyses using a concept mapping dataset. We sought to increase the validity of concept mapping evaluation results by running multiple cluster analysis methods and then using several metrics to choose from among solutions. We present four different clustering methods based on analyses using the R statistical software package: partitioning around medoids (PAM), fuzzy analysis (FANNY), agglomerative nesting (AGNES) and divisive analysis (DIANA). We then used the Dunn and Davies-Bouldin indices to assist in choosing a valid cluster solution for a concept mapping outcomes evaluation. We conclude that the validity of the outcomes map is high, based on the analyses described. Finally, we discuss areas for further concept mapping methods research. Copyright © 2016 Elsevier Ltd. All rights reserved.

  5. Cluster Analysis of Longidorus Species (Nematoda: Longidoridae), a New Approach in Species Identification

    PubMed Central

    Ye, Weimin; Robbins, R. T.

    2004-01-01

    Hierarchical cluster analysis based on female morphometric character means including body length, distance from vulva opening to anterior end, head width, odontostyle length, esophagus length, body width, tail length, and tail width were used to examine the morphometric relationships and create dendrograms for (i) 62 populations belonging to 9 Longidorus species from Arkansas, (ii) 137 published Longidorus species, and (iii) 137 published Longidorus species plus 86 populations of 16 Longidorus species from Arkansas and various other locations by using JMP 4.02 software (SAS Institute, Cary, NC). Cluster analysis dendograms visually illustrated the grouping and morphometric relationships of the species and populations. It provided a computerized statistical approach to assist by helping to identify and distinguish species, by indicating morphometric relationships among species, and by assisting with new species diagnosis. The preliminary species identification can be accomplished by running cluster analysis for unknown species together with the data matrix of known published Longidorus species. PMID:19262809

  6. Structures in magnetohydrodynamic turbulence: Detection and scaling

    NASA Astrophysics Data System (ADS)

    Uritsky, V. M.; Pouquet, A.; Rosenberg, D.; Mininni, P. D.; Donovan, E. F.

    2010-11-01

    We present a systematic analysis of statistical properties of turbulent current and vorticity structures at a given time using cluster analysis. The data stem from numerical simulations of decaying three-dimensional magnetohydrodynamic turbulence in the absence of an imposed uniform magnetic field; the magnetic Prandtl number is taken equal to unity, and we use a periodic box with grids of up to 15363 points and with Taylor Reynolds numbers up to 1100. The initial conditions are either an X -point configuration embedded in three dimensions, the so-called Orszag-Tang vortex, or an Arn’old-Beltrami-Childress configuration with a fully helical velocity and magnetic field. In each case two snapshots are analyzed, separated by one turn-over time, starting just after the peak of dissipation. We show that the algorithm is able to select a large number of structures (in excess of 8000) for each snapshot and that the statistical properties of these clusters are remarkably similar for the two snapshots as well as for the two flows under study in terms of scaling laws for the cluster characteristics, with the structures in the vorticity and in the current behaving in the same way. We also study the effect of Reynolds number on cluster statistics, and we finally analyze the properties of these clusters in terms of their velocity-magnetic-field correlation. Self-organized criticality features have been identified in the dissipative range of scales. A different scaling arises in the inertial range, which cannot be identified for the moment with a known self-organized criticality class consistent with magnetohydrodynamics. We suggest that this range can be governed by turbulence dynamics as opposed to criticality and propose an interpretation of intermittency in terms of propagation of local instabilities.

  7. Evaluation of Primary Immunization Coverage of Infants Under Universal Immunization Programme in an Urban Area of Bangalore City Using Cluster Sampling and Lot Quality Assurance Sampling Techniques

    PubMed Central

    K, Punith; K, Lalitha; G, Suman; BS, Pradeep; Kumar K, Jayanth

    2008-01-01

    Research Question: Is LQAS technique better than cluster sampling technique in terms of resources to evaluate the immunization coverage in an urban area? Objective: To assess and compare the lot quality assurance sampling against cluster sampling in the evaluation of primary immunization coverage. Study Design: Population-based cross-sectional study. Study Setting: Areas under Mathikere Urban Health Center. Study Subjects: Children aged 12 months to 23 months. Sample Size: 220 in cluster sampling, 76 in lot quality assurance sampling. Statistical Analysis: Percentages and Proportions, Chi square Test. Results: (1) Using cluster sampling, the percentage of completely immunized, partially immunized and unimmunized children were 84.09%, 14.09% and 1.82%, respectively. With lot quality assurance sampling, it was 92.11%, 6.58% and 1.31%, respectively. (2) Immunization coverage levels as evaluated by cluster sampling technique were not statistically different from the coverage value as obtained by lot quality assurance sampling techniques. Considering the time and resources required, it was found that lot quality assurance sampling is a better technique in evaluating the primary immunization coverage in urban area. PMID:19876474

  8. Multivariate statistical assessment of heavy metal pollution sources of groundwater around a lead and zinc plant.

    PubMed

    Zamani, Abbas Ali; Yaftian, Mohammad Reza; Parizanganeh, Abdolhossein

    2012-12-17

    The contamination of groundwater by heavy metal ions around a lead and zinc plant has been studied. As a case study groundwater contamination in Bonab Industrial Estate (Zanjan-Iran) for iron, cobalt, nickel, copper, zinc, cadmium and lead content was investigated using differential pulse polarography (DPP). Although, cobalt, copper and zinc were found correspondingly in 47.8%, 100.0%, and 100.0% of the samples, they did not contain these metals above their maximum contaminant levels (MCLs). Cadmium was detected in 65.2% of the samples and 17.4% of them were polluted by this metal. All samples contained detectable levels of lead and iron with 8.7% and 13.0% of the samples higher than their MCLs. Nickel was also found in 78.3% of the samples, out of which 8.7% were polluted. In general, the results revealed the contamination of groundwater sources in the studied zone. The higher health risks are related to lead, nickel, and cadmium ions. Multivariate statistical techniques were applied for interpreting the experimental data and giving a description for the sources. The data analysis showed correlations and similarities between investigated heavy metals and helps to classify these ion groups. Cluster analysis identified five clusters among the studied heavy metals. Cluster 1 consisted of Pb, Cu, and cluster 3 included Cd, Fe; also each of the elements Zn, Co and Ni was located in groups with single member. The same results were obtained by factor analysis. Statistical investigations revealed that anthropogenic factors and notably lead and zinc plant and pedo-geochemical pollution sources are influencing water quality in the studied area.

  9. Multivariate statistical assessment of heavy metal pollution sources of groundwater around a lead and zinc plant

    PubMed Central

    2012-01-01

    The contamination of groundwater by heavy metal ions around a lead and zinc plant has been studied. As a case study groundwater contamination in Bonab Industrial Estate (Zanjan-Iran) for iron, cobalt, nickel, copper, zinc, cadmium and lead content was investigated using differential pulse polarography (DPP). Although, cobalt, copper and zinc were found correspondingly in 47.8%, 100.0%, and 100.0% of the samples, they did not contain these metals above their maximum contaminant levels (MCLs). Cadmium was detected in 65.2% of the samples and 17.4% of them were polluted by this metal. All samples contained detectable levels of lead and iron with 8.7% and 13.0% of the samples higher than their MCLs. Nickel was also found in 78.3% of the samples, out of which 8.7% were polluted. In general, the results revealed the contamination of groundwater sources in the studied zone. The higher health risks are related to lead, nickel, and cadmium ions. Multivariate statistical techniques were applied for interpreting the experimental data and giving a description for the sources. The data analysis showed correlations and similarities between investigated heavy metals and helps to classify these ion groups. Cluster analysis identified five clusters among the studied heavy metals. Cluster 1 consisted of Pb, Cu, and cluster 3 included Cd, Fe; also each of the elements Zn, Co and Ni was located in groups with single member. The same results were obtained by factor analysis. Statistical investigations revealed that anthropogenic factors and notably lead and zinc plant and pedo-geochemical pollution sources are influencing water quality in the studied area. PMID:23369182

  10. Application of Scan Statistics to Detect Suicide Clusters in Australia

    PubMed Central

    Cheung, Yee Tak Derek; Spittal, Matthew J.; Williamson, Michelle Kate; Tung, Sui Jay; Pirkis, Jane

    2013-01-01

    Background Suicide clustering occurs when multiple suicide incidents take place in a small area or/and within a short period of time. In spite of the multi-national research attention and particular efforts in preparing guidelines for tackling suicide clusters, the broader picture of epidemiology of suicide clustering remains unclear. This study aimed to develop techniques in using scan statistics to detect clusters, with the detection of suicide clusters in Australia as example. Methods and Findings Scan statistics was applied to detect clusters among suicides occurring between 2004 and 2008. Manipulation of parameter settings and change of area for scan statistics were performed to remedy shortcomings in existing methods. In total, 243 suicides out of 10,176 (2.4%) were identified as belonging to 15 suicide clusters. These clusters were mainly located in the Northern Territory, the northern part of Western Australia, and the northern part of Queensland. Among the 15 clusters, 4 (26.7%) were detected by both national and state cluster detections, 8 (53.3%) were only detected by the state cluster detection, and 3 (20%) were only detected by the national cluster detection. Conclusions These findings illustrate that the majority of spatial-temporal clusters of suicide were located in the inland northern areas, with socio-economic deprivation and higher proportions of indigenous people. Discrepancies between national and state/territory cluster detection by scan statistics were due to the contrast of the underlying suicide rates across states/territories. Performing both small-area and large-area analyses, and applying multiple parameter settings may yield the maximum benefits for exploring clusters. PMID:23342098

  11. Spatial variation of volcanic rock geochemistry in the Virunga Volcanic Province: Statistical analysis of an integrated database

    NASA Astrophysics Data System (ADS)

    Barette, Florian; Poppe, Sam; Smets, Benoît; Benbakkar, Mhammed; Kervyn, Matthieu

    2017-10-01

    We present an integrated, spatially-explicit database of existing geochemical major-element analyses available from (post-) colonial scientific reports, PhD Theses and international publications for the Virunga Volcanic Province, located in the western branch of the East African Rift System. This volcanic province is characterised by alkaline volcanism, including silica-undersaturated, alkaline and potassic lavas. The database contains a total of 908 geochemical analyses of eruptive rocks for the entire volcanic province with a localisation for most samples. A preliminary analysis of the overall consistency of the database, using statistical techniques on sets of geochemical analyses with contrasted analytical methods or dates, demonstrates that the database is consistent. We applied a principal component analysis and cluster analysis on whole-rock major element compositions included in the database to study the spatial variation of the chemical composition of eruptive products in the Virunga Volcanic Province. These statistical analyses identify spatially distributed clusters of eruptive products. The known geochemical contrasts are highlighted by the spatial analysis, such as the unique geochemical signature of Nyiragongo lavas compared to other Virunga lavas, the geochemical heterogeneity of the Bulengo area, and the trachyte flows of Karisimbi volcano. Most importantly, we identified separate clusters of eruptive products which originate from primitive magmatic sources. These lavas of primitive composition are preferentially located along NE-SW inherited rift structures, often at distance from the central Virunga volcanoes. Our results illustrate the relevance of a spatial analysis on integrated geochemical data for a volcanic province, as a complement to classical petrological investigations. This approach indeed helps to characterise geochemical variations within a complex of magmatic systems and to identify specific petrologic and geochemical investigations that should be tackled within a study area.

  12. A Cyber-Attack Detection Model Based on Multivariate Analyses

    NASA Astrophysics Data System (ADS)

    Sakai, Yuto; Rinsaka, Koichiro; Dohi, Tadashi

    In the present paper, we propose a novel cyber-attack detection model based on two multivariate-analysis methods to the audit data observed on a host machine. The statistical techniques used here are the well-known Hayashi's quantification method IV and cluster analysis method. We quantify the observed qualitative audit event sequence via the quantification method IV, and collect similar audit event sequence in the same groups based on the cluster analysis. It is shown in simulation experiments that our model can improve the cyber-attack detection accuracy in some realistic cases where both normal and attack activities are intermingled.

  13. A scan statistic for binary outcome based on hypergeometric probability model, with an application to detecting spatial clusters of Japanese encephalitis.

    PubMed

    Zhao, Xing; Zhou, Xiao-Hua; Feng, Zijian; Guo, Pengfei; He, Hongyan; Zhang, Tao; Duan, Lei; Li, Xiaosong

    2013-01-01

    As a useful tool for geographical cluster detection of events, the spatial scan statistic is widely applied in many fields and plays an increasingly important role. The classic version of the spatial scan statistic for the binary outcome is developed by Kulldorff, based on the Bernoulli or the Poisson probability model. In this paper, we apply the Hypergeometric probability model to construct the likelihood function under the null hypothesis. Compared with existing methods, the likelihood function under the null hypothesis is an alternative and indirect method to identify the potential cluster, and the test statistic is the extreme value of the likelihood function. Similar with Kulldorff's methods, we adopt Monte Carlo test for the test of significance. Both methods are applied for detecting spatial clusters of Japanese encephalitis in Sichuan province, China, in 2009, and the detected clusters are identical. Through a simulation to independent benchmark data, it is indicated that the test statistic based on the Hypergeometric model outweighs Kulldorff's statistics for clusters of high population density or large size; otherwise Kulldorff's statistics are superior.

  14. Spatial characterization of dissolved trace elements and heavy metals in the upper Han River (China) using multivariate statistical techniques.

    PubMed

    Li, Siyue; Zhang, Quanfa

    2010-04-15

    A data matrix (4032 observations), obtained during a 2-year monitoring period (2005-2006) from 42 sites in the upper Han River is subjected to various multivariate statistical techniques including cluster analysis, principal component analysis (PCA), factor analysis (FA), correlation analysis and analysis of variance to determine the spatial characterization of dissolved trace elements and heavy metals. Our results indicate that waters in the upper Han River are primarily polluted by Al, As, Cd, Pb, Sb and Se, and the potential pollutants include Ba, Cr, Hg, Mn and Ni. Spatial distribution of trace metals indicates the polluted sections mainly concentrate in the Danjiang, Danjiangkou Reservoir catchment and Hanzhong Plain, and the most contaminated river is in the Hanzhong Plain. Q-model clustering depends on geographical location of sampling sites and groups the 42 sampling sites into four clusters, i.e., Danjiang, Danjiangkou Reservoir region (lower catchment), upper catchment and one river in headwaters pertaining to water quality. The headwaters, Danjiang and lower catchment, and upper catchment correspond to very high polluted, moderate polluted and relatively low polluted regions, respectively. Additionally, PCA/FA and correlation analysis demonstrates that Al, Cd, Mn, Ni, Fe, Si and Sr are controlled by natural sources, whereas the other metals appear to be primarily controlled by anthropogenic origins though geogenic source contributing to them. 2009 Elsevier B.V. All rights reserved.

  15. HICOSMO: cosmology with a complete sample of galaxy clusters - II. Cosmological results

    NASA Astrophysics Data System (ADS)

    Schellenberger, G.; Reiprich, T. H.

    2017-10-01

    The X-ray bright, hot gas in the potential well of a galaxy cluster enables systematic X-ray studies of samples of galaxy clusters to constrain cosmological parameters. HIFLUGCS consists of the 64 X-ray brightest galaxy clusters in the Universe, building up a local sample. Here, we utilize this sample to determine, for the first time, individual hydrostatic mass estimates for all the clusters of the sample and, by making use of the completeness of the sample, we quantify constraints on the two interesting cosmological parameters, Ωm and σ8. We apply our total hydrostatic and gas mass estimates from the X-ray analysis to a Bayesian cosmological likelihood analysis and leave several parameters free to be constrained. We find Ωm = 0.30 ± 0.01 and σ8 = 0.79 ± 0.03 (statistical uncertainties, 68 per cent credibility level) using our default analysis strategy combining both a mass function analysis and the gas mass fraction results. The main sources of biases that we correct here are (1) the influence of galaxy groups (incompleteness in parent samples and differing behaviour of the Lx-M relation), (2) the hydrostatic mass bias, (3) the extrapolation of the total mass (comparing various methods), (4) the theoretical halo mass function and (5) other physical effects (non-negligible neutrino mass). We find that galaxy groups introduce a strong bias, since their number density seems to be over predicted by the halo mass function. On the other hand, incorporating baryonic effects does not result in a significant change in the constraints. The total (uncorrected) systematic uncertainties (∼20 per cent) clearly dominate the statistical uncertainties on cosmological parameters for our sample.

  16. Groundwater flow and hydrogeochemical evolution in the Jianghan Plain, central China

    NASA Astrophysics Data System (ADS)

    Gan, Yiqun; Zhao, Ke; Deng, Yamin; Liang, Xing; Ma, Teng; Wang, Yanxin

    2018-05-01

    Hydrogeochemical analysis and multivariate statistics were applied to identify flow patterns and major processes controlling the hydrogeochemistry of groundwater in the Jianghan Plain, which is located in central Yangtze River Basin (central China) and characterized by intensive surface-water/groundwater interaction. Although HCO3-Ca-(Mg) type water predominated in the study area, the 457 (21 surface water and 436 groundwater) samples were effectively classified into five clusters by hierarchical cluster analysis. The hydrochemical variations among these clusters were governed by three factors from factor analysis. Major components (e.g., Ca, Mg and HCO3) in surface water and groundwater originated from carbonate and silicate weathering (factor 1). Redox conditions (factor 2) influenced the geogenic Fe and As contamination in shallow confined groundwater. Anthropogenic activities (factor 3) primarily caused high levels of Cl and SO4 in surface water and phreatic groundwater. Furthermore, the factor score 1 of samples in the shallow confined aquifer gradually increased along the flow paths. This study demonstrates that enhanced information on hydrochemistry in complex groundwater flow systems, by multivariate statistical methods, improves the understanding of groundwater flow and hydrogeochemical evolution due to natural and anthropogenic impacts.

  17. A comparison of performance of automatic cloud coverage assessment algorithm for Formosat-2 image using clustering-based and spatial thresholding methods

    NASA Astrophysics Data System (ADS)

    Hsu, Kuo-Hsien

    2012-11-01

    Formosat-2 image is a kind of high-spatial-resolution (2 meters GSD) remote sensing satellite data, which includes one panchromatic band and four multispectral bands (Blue, Green, Red, near-infrared). An essential sector in the daily processing of received Formosat-2 image is to estimate the cloud statistic of image using Automatic Cloud Coverage Assessment (ACCA) algorithm. The information of cloud statistic of image is subsequently recorded as an important metadata for image product catalog. In this paper, we propose an ACCA method with two consecutive stages: preprocessing and post-processing analysis. For pre-processing analysis, the un-supervised K-means classification, Sobel's method, thresholding method, non-cloudy pixels reexamination, and cross-band filter method are implemented in sequence for cloud statistic determination. For post-processing analysis, Box-Counting fractal method is implemented. In other words, the cloud statistic is firstly determined via pre-processing analysis, the correctness of cloud statistic of image of different spectral band is eventually cross-examined qualitatively and quantitatively via post-processing analysis. The selection of an appropriate thresholding method is very critical to the result of ACCA method. Therefore, in this work, We firstly conduct a series of experiments of the clustering-based and spatial thresholding methods that include Otsu's, Local Entropy(LE), Joint Entropy(JE), Global Entropy(GE), and Global Relative Entropy(GRE) method, for performance comparison. The result shows that Otsu's and GE methods both perform better than others for Formosat-2 image. Additionally, our proposed ACCA method by selecting Otsu's method as the threshoding method has successfully extracted the cloudy pixels of Formosat-2 image for accurate cloud statistic estimation.

  18. Point process statistics in atom probe tomography.

    PubMed

    Philippe, T; Duguay, S; Grancher, G; Blavette, D

    2013-09-01

    We present a review of spatial point processes as statistical models that we have designed for the analysis and treatment of atom probe tomography (APT) data. As a major advantage, these methods do not require sampling. The mean distance to nearest neighbour is an attractive approach to exhibit a non-random atomic distribution. A χ(2) test based on distance distributions to nearest neighbour has been developed to detect deviation from randomness. Best-fit methods based on first nearest neighbour distance (1 NN method) and pair correlation function are presented and compared to assess the chemical composition of tiny clusters. Delaunay tessellation for cluster selection has been also illustrated. These statistical tools have been applied to APT experiments on microelectronics materials. Copyright © 2012 Elsevier B.V. All rights reserved.

  19. Testing prediction methods: Earthquake clustering versus the Poisson model

    USGS Publications Warehouse

    Michael, A.J.

    1997-01-01

    Testing earthquake prediction methods requires statistical techniques that compare observed success to random chance. One technique is to produce simulated earthquake catalogs and measure the relative success of predicting real and simulated earthquakes. The accuracy of these tests depends on the validity of the statistical model used to simulate the earthquakes. This study tests the effect of clustering in the statistical earthquake model on the results. Three simulation models were used to produce significance levels for a VLF earthquake prediction method. As the degree of simulated clustering increases, the statistical significance drops. Hence, the use of a seismicity model with insufficient clustering can lead to overly optimistic results. A successful method must pass the statistical tests with a model that fully replicates the observed clustering. However, a method can be rejected based on tests with a model that contains insufficient clustering. U.S. copyright. Published in 1997 by the American Geophysical Union.

  20. Agro-ecoregionalization of Iowa using multivariate geographical clustering

    Treesearch

    Carol L. Williams; William W. Hargrove; Matt Leibman; David E. James

    2008-01-01

    Agro-ecoregionalization is categorization of landscapes for use in crop suitability analysis, strategic agroeconomic development, risk analysis, and other purposes. Past agro-ecoregionalizations have been subjective, expert opinion driven, crop specific, and unsuitable for statistical extrapolation. Use of quantitative analytical methods provides an opportunity for...

  1. Determining the Number of Component Clusters in the Standard Multivariate Normal Mixture Model Using Model-Selection Criteria.

    DTIC Science & Technology

    1983-06-16

    has been advocated by Gnanadesikan and 𔃾ilk (1969), and others in the literature. This suggests that, if we use the formal signficance test type...American Statistical Asso., 62, 1159-1178. Gnanadesikan , R., and Wilk, M..B. (1969). Data Analytic Methods in Multi- variate Statistical Analysis. In

  2. COVARIATE-ADAPTIVE CLUSTERING OF EXPOSURES FOR AIR POLLUTION EPIDEMIOLOGY COHORTS*

    PubMed Central

    Keller, Joshua P.; Drton, Mathias; Larson, Timothy; Kaufman, Joel D.; Sandler, Dale P.; Szpiro, Adam A.

    2017-01-01

    Cohort studies in air pollution epidemiology aim to establish associations between health outcomes and air pollution exposures. Statistical analysis of such associations is complicated by the multivariate nature of the pollutant exposure data as well as the spatial misalignment that arises from the fact that exposure data are collected at regulatory monitoring network locations distinct from cohort locations. We present a novel clustering approach for addressing this challenge. Specifically, we present a method that uses geographic covariate information to cluster multi-pollutant observations and predict cluster membership at cohort locations. Our predictive k-means procedure identifies centers using a mixture model and is followed by multi-class spatial prediction. In simulations, we demonstrate that predictive k-means can reduce misclassification error by over 50% compared to ordinary k-means, with minimal loss in cluster representativeness. The improved prediction accuracy results in large gains of 30% or more in power for detecting effect modification by cluster in a simulated health analysis. In an analysis of the NIEHS Sister Study cohort using predictive k-means, we find that the association between systolic blood pressure (SBP) and long-term fine particulate matter (PM2.5) exposure varies significantly between different clusters of PM2.5 component profiles. Our cluster-based analysis shows that for subjects assigned to a cluster located in the Midwestern U.S., a 10 μg/m3 difference in exposure is associated with 4.37 mmHg (95% CI, 2.38, 6.35) higher SBP. PMID:28572869

  3. Accounting for isotopic clustering in Fourier transform mass spectrometry data analysis for clinical diagnostic studies.

    PubMed

    Kakourou, Alexia; Vach, Werner; Nicolardi, Simone; van der Burgt, Yuri; Mertens, Bart

    2016-10-01

    Mass spectrometry based clinical proteomics has emerged as a powerful tool for high-throughput protein profiling and biomarker discovery. Recent improvements in mass spectrometry technology have boosted the potential of proteomic studies in biomedical research. However, the complexity of the proteomic expression introduces new statistical challenges in summarizing and analyzing the acquired data. Statistical methods for optimally processing proteomic data are currently a growing field of research. In this paper we present simple, yet appropriate methods to preprocess, summarize and analyze high-throughput MALDI-FTICR mass spectrometry data, collected in a case-control fashion, while dealing with the statistical challenges that accompany such data. The known statistical properties of the isotopic distribution of the peptide molecules are used to preprocess the spectra and translate the proteomic expression into a condensed data set. Information on either the intensity level or the shape of the identified isotopic clusters is used to derive summary measures on which diagnostic rules for disease status allocation will be based. Results indicate that both the shape of the identified isotopic clusters and the overall intensity level carry information on the class outcome and can be used to predict the presence or absence of the disease.

  4. Bringing Clouds into Our Lab! - The Influence of Turbulence on the Early Stage Rain Droplets

    NASA Astrophysics Data System (ADS)

    Yavuz, Mehmet Altug; Kunnen, Rudie; Heijst, Gertjan; Clercx, Herman

    2015-11-01

    We are investigating a droplet-laden flow in an air-filled turbulence chamber, forced by speaker-driven air jets. The speakers are running in a random manner; yet they allow us to control and define the statistics of the turbulence. We study the motion of droplets with tunable size (Stokes numbers ~ 0.13 - 9) in a turbulent flow, mimicking the early stages of raindrop formation. 3D Particle Tracking Velocimetry (PTV) together with Laser Induced Fluorescence (LIF) methods are chosen as the experimental method to track the droplets and collect data for statistical analysis. Thereby it is possible to study the spatial distribution of the droplets in turbulence using the so-called Radial Distribution Function (RDF), a statistical measure to quantify the clustering of particles. Additionally, 3D-PTV technique allows us to measure velocity statistics of the droplets and the influence of the turbulence on droplet trajectories, both individually and collectively. In this contribution, we will present the clustering probability quantified by the RDF for different Stokes numbers. We will explain the physics underlying the influence of turbulence on droplet cluster behavior. This study supported by FOM/NWO Netherlands.

  5. A spatial scan statistic for multiple clusters.

    PubMed

    Li, Xiao-Zhou; Wang, Jin-Feng; Yang, Wei-Zhong; Li, Zhong-Jie; Lai, Sheng-Jie

    2011-10-01

    Spatial scan statistics are commonly used for geographical disease surveillance and cluster detection. While there are multiple clusters coexisting in the study area, they become difficult to detect because of clusters' shadowing effect to each other. The recently proposed sequential method showed its better power for detecting the second weaker cluster, but did not improve the ability of detecting the first stronger cluster which is more important than the second one. We propose a new extension of the spatial scan statistic which could be used to detect multiple clusters. Through constructing two or more clusters in the alternative hypothesis, our proposed method accounts for other coexisting clusters in the detecting and evaluating process. The performance of the proposed method is compared to the sequential method through an intensive simulation study, in which our proposed method shows better power in terms of both rejecting the null hypothesis and accurately detecting the coexisting clusters. In the real study of hand-foot-mouth disease data in Pingdu city, a true cluster town is successfully detected by our proposed method, which cannot be evaluated to be statistically significant by the standard method due to another cluster's shadowing effect. Copyright © 2011 Elsevier Inc. All rights reserved.

  6. Using Geographic Information Science to Explore Associations between Air Pollution, Environmental Amenities, and Preterm Births

    PubMed Central

    Ogneva-Himmelberger, Yelena; Dahlberg, Tyler; Kelly, Kristen; Simas, Tiffany A. Moore

    2015-01-01

    The study uses geographic information science (GIS) and statistics to find out if there are statistical differences between full term and preterm births to non-Hispanic white, non-Hispanic Black, and Hispanic mothers in their exposure to air pollution and access to environmental amenities (green space and vendors of healthy food) in the second largest city in New England, Worcester, Massachusetts. Proximity to a Toxic Release Inventory site has a statistically significant effect on preterm birth regardless of race. The air-pollution hazard score from the Risk Screening Environmental Indicators Model is also a statistically significant factor when preterm births are categorized into three groups based on the degree of prematurity. Proximity to green space and to a healthy food vendor did not have an effect on preterm births. The study also used cluster analysis and found statistically significant spatial clusters of high preterm birth volume for non-Hispanic white, non-Hispanic Black, and Hispanic mothers. PMID:29546120

  7. Using Geographic Information Science to Explore Associations between Air Pollution, Environmental Amenities, and Preterm Births.

    PubMed

    Ogneva-Himmelberger, Yelena; Dahlberg, Tyler; Kelly, Kristen; Simas, Tiffany A Moore

    2015-01-01

    The study uses geographic information science (GIS) and statistics to find out if there are statistical differences between full term and preterm births to non-Hispanic white, non-Hispanic Black, and Hispanic mothers in their exposure to air pollution and access to environmental amenities (green space and vendors of healthy food) in the second largest city in New England, Worcester, Massachusetts. Proximity to a Toxic Release Inventory site has a statistically significant effect on preterm birth regardless of race. The air-pollution hazard score from the Risk Screening Environmental Indicators Model is also a statistically significant factor when preterm births are categorized into three groups based on the degree of prematurity. Proximity to green space and to a healthy food vendor did not have an effect on preterm births. The study also used cluster analysis and found statistically significant spatial clusters of high preterm birth volume for non-Hispanic white, non-Hispanic Black, and Hispanic mothers.

  8. Multivariate Statistical Analysis of Water Quality data in Indian River Lagoon, Florida

    NASA Astrophysics Data System (ADS)

    Sayemuzzaman, M.; Ye, M.

    2015-12-01

    The Indian River Lagoon, is part of the longest barrier island complex in the United States, is a region of particular concern to the environmental scientist because of the rapid rate of human development throughout the region and the geographical position in between the colder temperate zone and warmer sub-tropical zone. Thus, the surface water quality analysis in this region always brings the newer information. In this present study, multivariate statistical procedures were applied to analyze the spatial and temporal water quality in the Indian River Lagoon over the period 1998-2013. Twelve parameters have been analyzed on twelve key water monitoring stations in and beside the lagoon on monthly datasets (total of 27,648 observations). The dataset was treated using cluster analysis (CA), principle component analysis (PCA) and non-parametric trend analysis. The CA was used to cluster twelve monitoring stations into four groups, with stations on the similar surrounding characteristics being in the same group. The PCA was then applied to the similar groups to find the important water quality parameters. The principal components (PCs), PC1 to PC5 was considered based on the explained cumulative variances 75% to 85% in each cluster groups. Nutrient species (phosphorus and nitrogen), salinity, specific conductivity and erosion factors (TSS, Turbidity) were major variables involved in the construction of the PCs. Statistical significant positive or negative trends and the abrupt trend shift were detected applying Mann-Kendall trend test and Sequential Mann-Kendall (SQMK), for each individual stations for the important water quality parameters. Land use land cover change pattern, local anthropogenic activities and extreme climate such as drought might be associated with these trends. This study presents the multivariate statistical assessment in order to get better information about the quality of surface water. Thus, effective pollution control/management of the surface waters can be undertaken.

  9. Optimization method of superpixel analysis for multi-contrast Jones matrix tomography (Conference Presentation)

    NASA Astrophysics Data System (ADS)

    Miyazawa, Arata; Hong, Young-Joo; Makita, Shuichi; Kasaragod, Deepa K.; Miura, Masahiro; Yasuno, Yoshiaki

    2017-02-01

    Local statistics are widely utilized for quantification and image processing of OCT. For example, local mean is used to reduce speckle, local variation of polarization state (degree-of-polarization-uniformity (DOPU)) is used to visualize melanin. Conventionally, these statistics are calculated in a rectangle kernel whose size is uniform over the image. However, the fixed size and shape of the kernel result in a tradeoff between image sharpness and statistical accuracy. Superpixel is a cluster of pixels which is generated by grouping image pixels based on the spatial proximity and similarity of signal values. Superpixels have variant size and flexible shapes which preserve the tissue structure. Here we demonstrate a new superpixel method which is tailored for multifunctional Jones matrix OCT (JM-OCT). This new method forms the superpixels by clustering image pixels in a 6-dimensional (6-D) feature space (spatial two dimensions and four dimensions of optical features). All image pixels were clustered based on their spatial proximity and optical feature similarity. The optical features are scattering, OCT-A, birefringence and DOPU. The method is applied to retinal OCT. Generated superpixels preserve the tissue structures such as retinal layers, sclera, vessels, and retinal pigment epithelium. Hence, superpixel can be utilized as a local statistics kernel which would be more suitable than a uniform rectangle kernel. Superpixelized image also can be used for further image processing and analysis. Since it reduces the number of pixels to be analyzed, it reduce the computational cost of such image processing.

  10. Population changes in residential clusters in Japan.

    PubMed

    Sekiguchi, Takuya; Tamura, Kohei; Masuda, Naoki

    2018-01-01

    Population dynamics in urban and rural areas are different. Understanding factors that contribute to local population changes has various socioeconomic and political implications. In the present study, we use population census data in Japan to examine contributors to the population growth of residential clusters between years 2005 and 2010. The data set covers the entirety of Japan and has a high spatial resolution of 500 × 500 m2, enabling us to examine population dynamics in various parts of the country (urban and rural) using statistical analysis. We found that, in addition to the area, population density, and age, the shape of the cluster and the spatial distribution of inhabitants within the cluster are significantly related to the population growth rate of a residential cluster. Specifically, the population tends to grow if the cluster is "round" shaped (given the area) and the population is concentrated near the center rather than periphery of the cluster. Combination of the present results and analysis framework with other factors that have been omitted in the present study, such as migration, terrain, and transportation infrastructure, will be fruitful.

  11. A critical analysis of high-redshift, massive, galaxy clusters. Part I

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Hoyle, Ben; Jimenez, Raul; Verde, Licia

    2012-02-01

    We critically investigate current statistical tests applied to high redshift clusters of galaxies in order to test the standard cosmological model and describe their range of validity. We carefully compare a sample of high-redshift, massive, galaxy clusters with realistic Poisson sample simulations of the theoretical mass function, which include the effect of Eddington bias. We compare the observations and simulations using the following statistical tests: the distributions of ensemble and individual existence probabilities (in the > M, > z sense), the redshift distributions, and the 2d Kolmogorov-Smirnov test. Using seemingly rare clusters from Hoyle et al. (2011), and Jee etmore » al. (2011) and assuming the same survey geometry as in Jee et al. (2011, which is less conservative than Hoyle et al. 2011), we find that the ( > M, > z) existence probabilities of all clusters are fully consistent with ΛCDM. However assuming the same survey geometry, we use the 2d K-S test probability to show that the observed clusters are not consistent with being the least probable clusters from simulations at > 95% confidence, and are also not consistent with being a random selection of clusters, which may be caused by the non-trivial selection function and survey geometry. Tension can be removed if we examine only a X-ray selected sub sample, with simulations performed assuming a modified survey geometry.« less

  12. Scalable Integrated Region-Based Image Retrieval Using IRM and Statistical Clustering.

    ERIC Educational Resources Information Center

    Wang, James Z.; Du, Yanping

    Statistical clustering is critical in designing scalable image retrieval systems. This paper presents a scalable algorithm for indexing and retrieving images based on region segmentation. The method uses statistical clustering on region features and IRM (Integrated Region Matching), a measure developed to evaluate overall similarity between images…

  13. Tsallis p⊥ distribution from statistical clusters

    NASA Astrophysics Data System (ADS)

    Bialas, A.

    2015-07-01

    It is shown that the transverse momentum distributions of particles emerging from the decay of statistical clusters, distributed according to a power law in their transverse energy, closely resemble those following from the Tsallis non-extensive statistical model. The experimental data are well reproduced with the cluster temperature T ≈ 160 MeV.

  14. A study on phenomenology of Dhat syndrome in men in a general medical setting

    PubMed Central

    Prakash, Sathya; Sharan, Pratap; Sood, Mamta

    2016-01-01

    Background: “Dhat syndrome” is believed to be a culture-bound syndrome of the Indian subcontinent. Although many studies have been performed, many have methodological limitations and there is a lack of agreement in many areas. Aims: The aim is to study the phenomenology of “Dhat syndrome” in men and to explore the possibility of subtypes within this entity. Settings and Design: It is a cross-sectional descriptive study conducted at a sex and marriage counseling clinic of a tertiary care teaching hospital in Northern India. Materials and Methods: An operational definition and assessment instrument for “Dhat syndrome” was developed after taking all concerned stakeholders into account and review of literature. It was applied on 100 patients along with socio-demographic profile, Hamilton Depression Rating Scale, Hamilton Anxiety Rating Scale, Mini International Neuropsychiatric Interview, and Postgraduate Institute Neuroticism Scale. Statistical Analysis: For statistical analysis, descriptive statistics, group comparisons, and Pearson's product moment correlations were carried out. Factor analysis and cluster analysis were done to determine the factor structure and subtypes of “Dhat syndrome.” Results: A diagnostic and assessment instrument for “Dhat syndrome” has been developed and the phenomenology in 100 patients has been described. Both the health beliefs scale and associated symptoms scale demonstrated a three-factor structure. The patients with “Dhat syndrome” could be categorized into three clusters based on severity. Conclusions: There appears to be a significant agreement among various stakeholders on the phenomenology of “Dhat syndrome” although some differences exist. “Dhat syndrome” could be subtyped into three clusters based on severity. PMID:27385844

  15. Characteristics and Classification of Least Altered Streamflows in Massachusetts

    USGS Publications Warehouse

    Armstrong, David S.; Parker, Gene W.; Richards, Todd A.

    2008-01-01

    Streamflow records from 85 streamflow-gaging stations at which streamflows were considered to be least altered were used to characterize natural streamflows within southern New England. Period-of-record streamflow data were used to determine annual hydrographs of median monthly flows. The shapes and magnitudes of annual hydrographs of median monthly flows, normalized by drainage area, differed among stations in different geographic areas of southern New England. These differences were gradational across southern New England and were attributed to differences in basin and climate characteristics. Period-of-record streamflow data were also used to analyze the statistical properties of daily streamflows at 61 stations across southern New England by using L-moment ratios. An L-moment ratio diagram of L-skewness and L-kurtosis showed a continuous gradation in these properties between stations and indicated differences between base-flow dominated and runoff-dominated rivers. Streamflow records from a concurrent period (1960-2004) for 61 stations were used in a multivariate statistical analysis to develop a hydrologic classification of rivers in southern New England. Missing records from 46 of these stations were extended by using a Maintenance of Variation Extension technique. The concurrent-period streamflows were used in the Indicators of Hydrologic Alteration and Hydrologic Index Tool programs to determine 224 hydrologic indices for the 61 stations. Principal-components analysis (PCA) was used to reduce the number of hydrologic indices to 20 that provided nonredundant information. The PCA also indicated that the major patterns of variability in the dataset are related to differences in flow variability and low-flow magnitude among the stations. Hierarchical cluster analysis was used to classify stations into groups with similar hydrologic properties. The cluster analysis classified rivers in southern New England into two broad groups: (1) base-flow dominated rivers, whose statistical properties indicated less flow variability and high magnitudes of low flow, and (2) runoff-dominated rivers, whose statistical properties indicated greater flow variability and lower magnitudes of low flow. A four-cluster classification further classified the runoff-dominated streams into three groups that varied in gradient, elevation, and differences in winter streamflow conditions: high-gradient runoff-dominated rivers, northern runoff-dominated rivers, and southern runoff-dominated rivers. A nine-cluster division indicated that basin size also becomes a distinguishing factor among basins at finer levels of classification. Smaller basins (less than 10 square miles) were classified into different groups than larger basins. A comparison of station classifications indicated that a classification based on multiple hydrologic indices that represent different aspects of the flow regime did not result in the same classification of stations as a classification based on a single type of statistic such as a monthly median. River basins identified by the cluster analysis as having similar hydrologic properties tended to have similar basin and climate characteristics and to be in close proximity to one another. Stations were not classified in the same cluster on the basis of geographic location alone; as a result, boundaries cannot be drawn between geographic regions with similar streamflow characteristics. Rivers with different basin and climate characteristics were classified in different clusters, even if they were in adjacent basins or upstream and downstream within the same basin.

  16. Unequal cluster sizes in stepped-wedge cluster randomised trials: a systematic review

    PubMed Central

    Morris, Tom; Gray, Laura

    2017-01-01

    Objectives To investigate the extent to which cluster sizes vary in stepped-wedge cluster randomised trials (SW-CRT) and whether any variability is accounted for during the sample size calculation and analysis of these trials. Setting Any, not limited to healthcare settings. Participants Any taking part in an SW-CRT published up to March 2016. Primary and secondary outcome measures The primary outcome is the variability in cluster sizes, measured by the coefficient of variation (CV) in cluster size. Secondary outcomes include the difference between the cluster sizes assumed during the sample size calculation and those observed during the trial, any reported variability in cluster sizes and whether the methods of sample size calculation and methods of analysis accounted for any variability in cluster sizes. Results Of the 101 included SW-CRTs, 48% mentioned that the included clusters were known to vary in size, yet only 13% of these accounted for this during the calculation of the sample size. However, 69% of the trials did use a method of analysis appropriate for when clusters vary in size. Full trial reports were available for 53 trials. The CV was calculated for 23 of these: the median CV was 0.41 (IQR: 0.22–0.52). Actual cluster sizes could be compared with those assumed during the sample size calculation for 14 (26%) of the trial reports; the cluster sizes were between 29% and 480% of that which had been assumed. Conclusions Cluster sizes often vary in SW-CRTs. Reporting of SW-CRTs also remains suboptimal. The effect of unequal cluster sizes on the statistical power of SW-CRTs needs further exploration and methods appropriate to studies with unequal cluster sizes need to be employed. PMID:29146637

  17. Binomial outcomes in dataset with some clusters of size two: can the dependence of twins be accounted for? A simulation study comparing the reliability of statistical methods based on a dataset of preterm infants.

    PubMed

    Sauzet, Odile; Peacock, Janet L

    2017-07-20

    The analysis of perinatal outcomes often involves datasets with some multiple births. These are datasets mostly formed of independent observations and a limited number of clusters of size two (twins) and maybe of size three or more. This non-independence needs to be accounted for in the statistical analysis. Using simulated data based on a dataset of preterm infants we have previously investigated the performance of several approaches to the analysis of continuous outcomes in the presence of some clusters of size two. Mixed models have been developed for binomial outcomes but very little is known about their reliability when only a limited number of small clusters are present. Using simulated data based on a dataset of preterm infants we investigated the performance of several approaches to the analysis of binomial outcomes in the presence of some clusters of size two. Logistic models, several methods of estimation for the logistic random intercept models and generalised estimating equations were compared. The presence of even a small percentage of twins means that a logistic regression model will underestimate all parameters but a logistic random intercept model fails to estimate the correlation between siblings if the percentage of twins is too small and will provide similar estimates to logistic regression. The method which seems to provide the best balance between estimation of the standard error and the parameter for any percentage of twins is the generalised estimating equations. This study has shown that the number of covariates or the level two variance do not necessarily affect the performance of the various methods used to analyse datasets containing twins but when the percentage of small clusters is too small, mixed models cannot capture the dependence between siblings.

  18. On intracluster Faraday rotation. II - Statistical analysis

    NASA Technical Reports Server (NTRS)

    Lawler, J. M.; Dennison, B.

    1982-01-01

    The comparison of a reliable sample of radio source Faraday rotation measurements seen through rich clusters of galaxies, with sources seen through the outer parts of clusters and therefore having little intracluster Faraday rotation, indicates that the distribution of rotation in the former population is broadened, but only at the 80% level of statistical confidence. Employing a physical model for the intracluster medium in which the square root of magnetic field strength/turbulent cell per gas core radius number ratio equals approximately 0.07 microgauss, a Monte Carlo simulation is able to reproduce the observed broadening. An upper-limit analysis figure of less than 0.20 microgauss for the field strength/turbulent cell ratio, combined with lower limits on field strength imposed by limitations on the Compton-scattered flux, shows that intracluster magnetic fields must be tangled on scales greater than about 20 kpc.

  19. Chemical modeling of groundwater in the Banat Plain, southwestern Romania, with elevated As content and co-occurring species by combining diagrams and unsupervised multivariate statistical approaches.

    PubMed

    Butaciu, Sinziana; Senila, Marin; Sarbu, Costel; Ponta, Michaela; Tanaselia, Claudiu; Cadar, Oana; Roman, Marius; Radu, Emil; Sima, Mihaela; Frentiu, Tiberiu

    2017-04-01

    The study proposes a combined model based on diagrams (Gibbs, Piper, Stuyfzand Hydrogeochemical Classification System) and unsupervised statistical approaches (Cluster Analysis, Principal Component Analysis, Fuzzy Principal Component Analysis, Fuzzy Hierarchical Cross-Clustering) to describe natural enrichment of inorganic arsenic and co-occurring species in groundwater in the Banat Plain, southwestern Romania. Speciation of inorganic As (arsenite, arsenate), ion concentrations (Na + , K + , Ca 2+ , Mg 2+ , HCO 3 - , Cl - , F - , SO 4 2- , PO 4 3- , NO 3 - ), pH, redox potential, conductivity and total dissolved substances were performed. Classical diagrams provided the hydrochemical characterization, while statistical approaches were helpful to establish (i) the mechanism of naturally occurring of As and F - species and the anthropogenic one for NO 3 - , SO 4 2- , PO 4 3- and K + and (ii) classification of groundwater based on content of arsenic species. The HCO 3 - type of local groundwater and alkaline pH (8.31-8.49) were found to be responsible for the enrichment of arsenic species and occurrence of F - but by different paths. The PO 4 3- -AsO 4 3- ion exchange, water-rock interaction (silicates hydrolysis and desorption from clay) were associated to arsenate enrichment in the oxidizing aquifer. Fuzzy Hierarchical Cross-Clustering was the strongest tool for the rapid simultaneous classification of groundwaters as a function of arsenic content and hydrogeochemical characteristics. The approach indicated the Na + -F - -pH cluster as marker for groundwater with naturally elevated As and highlighted which parameters need to be monitored. A chemical conceptual model illustrating the natural and anthropogenic paths and enrichment of As and co-occurring species in the local groundwater supported by mineralogical analysis of rocks was established. Copyright © 2016 Elsevier Ltd. All rights reserved.

  20. A Gender Bias Habit-Breaking Intervention Led to Increased Hiring of Female Faculty in STEMM Departments.

    PubMed

    Devine, Patricia G; Forscher, Patrick S; Cox, William T L; Kaatz, Anna; Sheridan, Jennifer; Carnes, Molly

    2017-11-01

    Addressing the underrepresentation of women in science is a top priority for many institutions, but the majority of efforts to increase representation of women are neither evidence-based nor rigorously assessed. One exception is the gender bias habit-breaking intervention (Carnes et al., 2015), which, in a cluster-randomized trial involving all but two departmental clusters ( N = 92) in the 6 STEMM focused schools/colleges at the University of Wisconsin - Madison, led to increases in gender bias awareness and self-efficacy to promote gender equity in academic science departments. Following this initial success, the present study compares, in a preregistered analysis, hiring rates of new female faculty pre- and post-manipulation. Whereas the proportion of women hired by control departments remained stable over time, the proportion of women hired by intervention departments increased by an estimated 18 percentage points ( OR = 2.23, d OR = 0.34). Though the preregistered analysis did not achieve conventional levels of statistical significance ( p < 0.07), our study has a hard upper limit on statistical power, as the cluster-randomized trial has a maximum sample size of 92 departmental clusters. These patterns have undeniable practical significance for the advancement of women in science, and provide promising evidence that psychological interventions can facilitate gender equity and diversity.

  1. Detecting Statistically Significant Communities of Triangle Motifs in Undirected Networks

    DTIC Science & Technology

    2016-04-26

    REPORT TYPE Final 3. DATES COVERED (From - To) 15 Oct 2014 to 14 Jan 2015 4. TITLE AND SUBTITLE Detecting statistically significant clusters of...extend the work of Perry et al. [6] by developing a statistical framework that supports the detection of triangle motif-based clusters in complex...priori, the need for triangle motif-based clustering . 2. Developed an algorithm for clustering undirected networks, where the triangle con guration was

  2. Alteration mapping at Goldfield, Nevada, by cluster and discriminant analysis of LANDSAT digital data

    NASA Technical Reports Server (NTRS)

    Ballew, G.

    1977-01-01

    The ability of Landsat multispectral digital data to differentiate among 62 combinations of rock and alteration types at the Goldfield mining district of Western Nevada was investigated by using statistical techniques of cluster and discriminant analysis. Multivariate discriminant analysis was not effective in classifying each of the 62 groups, with classification results essentially the same whether data of four channels alone or combined with six ratios of channels were used. Bivariate plots of group means revealed a cluster of three groups including mill tailings, basalt and all other rock and alteration types. Automatic hierarchical clustering based on the fourth dimensional Mahalanobis distance between group means of 30 groups having five or more samples was performed. The results of the cluster analysis revealed hierarchies of mill tailings vs. natural materials, basalt vs. non-basalt, highly reflectant rocks vs. other rocks and exclusively unaltered rocks vs. predominantly altered rocks. The hierarchies were used to determine the order in which sets of multiple discriminant analyses were to be performed and the resulting discriminant functions were used to produce a map of geology and alteration which has an overall accuracy of 70 percent for discriminating exclusively altered rocks from predominantly altered rocks.

  3. Automated clustering-based workload characterization

    NASA Technical Reports Server (NTRS)

    Pentakalos, Odysseas I.; Menasce, Daniel A.; Yesha, Yelena

    1996-01-01

    The demands placed on the mass storage systems at various federal agencies and national laboratories are continuously increasing in intensity. This forces system managers to constantly monitor the system, evaluate the demand placed on it, and tune it appropriately using either heuristics based on experience or analytic models. Performance models require an accurate workload characterization. This can be a laborious and time consuming process. It became evident from our experience that a tool is necessary to automate the workload characterization process. This paper presents the design and discusses the implementation of a tool for workload characterization of mass storage systems. The main features of the tool discussed here are: (1)Automatic support for peak-period determination. Histograms of system activity are generated and presented to the user for peak-period determination; (2) Automatic clustering analysis. The data collected from the mass storage system logs is clustered using clustering algorithms and tightness measures to limit the number of generated clusters; (3) Reporting of varied file statistics. The tool computes several statistics on file sizes such as average, standard deviation, minimum, maximum, frequency, as well as average transfer time. These statistics are given on a per cluster basis; (4) Portability. The tool can easily be used to characterize the workload in mass storage systems of different vendors. The user needs to specify through a simple log description language how the a specific log should be interpreted. The rest of this paper is organized as follows. Section two presents basic concepts in workload characterization as they apply to mass storage systems. Section three describes clustering algorithms and tightness measures. The following section presents the architecture of the tool. Section five presents some results of workload characterization using the tool.Finally, section six presents some concluding remarks.

  4. Identifying clusters of active transportation using spatial scan statistics.

    PubMed

    Huang, Lan; Stinchcomb, David G; Pickle, Linda W; Dill, Jennifer; Berrigan, David

    2009-08-01

    There is an intense interest in the possibility that neighborhood characteristics influence active transportation such as walking or biking. The purpose of this paper is to illustrate how a spatial cluster identification method can evaluate the geographic variation of active transportation and identify neighborhoods with unusually high/low levels of active transportation. Self-reported walking/biking prevalence, demographic characteristics, street connectivity variables, and neighborhood socioeconomic data were collected from respondents to the 2001 California Health Interview Survey (CHIS; N=10,688) in Los Angeles County (LAC) and San Diego County (SDC). Spatial scan statistics were used to identify clusters of high or low prevalence (with and without age-adjustment) and the quantity of time spent walking and biking. The data, a subset from the 2001 CHIS, were analyzed in 2007-2008. Geographic clusters of significantly high or low prevalence of walking and biking were detected in LAC and SDC. Structural variables such as street connectivity and shorter block lengths are consistently associated with higher levels of active transportation, but associations between active transportation and socioeconomic variables at the individual and neighborhood levels are mixed. Only one cluster with less time spent walking and biking among walkers/bikers was detected in LAC, and this was of borderline significance. Age-adjustment affects the clustering pattern of walking/biking prevalence in LAC, but not in SDC. The use of spatial scan statistics to identify significant clustering of health behaviors such as active transportation adds to the more traditional regression analysis that examines associations between behavior and environmental factors by identifying specific geographic areas with unusual levels of the behavior independent of predefined administrative units.

  5. Identifying Clusters of Active Transportation Using Spatial Scan Statistics

    PubMed Central

    Huang, Lan; Stinchcomb, David G.; Pickle, Linda W.; Dill, Jennifer; Berrigan, David

    2009-01-01

    Background There is an intense interest in the possibility that neighborhood characteristics influence active transportation such as walking or biking. The purpose of this paper is to illustrate how a spatial cluster identification method can evaluate the geographic variation of active transportation and identify neighborhoods with unusually high/low levels of active transportation. Methods Self-reported walking/biking prevalence, demographic characteristics, street connectivity variables, and neighborhood socioeconomic data were collected from respondents to the 2001 California Health Interview Survey (CHIS; N=10,688) in Los Angeles County (LAC) and San Diego County (SDC). Spatial scan statistics were used to identify clusters of high or low prevalence (with and without age-adjustment) and the quantity of time spent walking and biking. The data, a subset from the 2001 CHIS, were analyzed in 2007–2008. Results Geographic clusters of significantly high or low prevalence of walking and biking were detected in LAC and SDC. Structural variables such as street connectivity and shorter block lengths are consistently associated with higher levels of active transportation, but associations between active transportation and socioeconomic variables at the individual and neighborhood levels are mixed. Only one cluster with less time spent walking and biking among walkers/bikers was detected in LAC, and this was of borderline significance. Age-adjustment affects the clustering pattern of walking/biking prevalence in LAC, but not in SDC. Conclusions The use of spatial scan statistics to identify significant clustering of health behaviors such as active transportation adds to the more traditional regression analysis that examines associations between behavior and environmental factors by identifying specific geographic areas with unusual levels of the behavior independent of predefined administrative units. PMID:19589451

  6. Cardiovascular reactivity patterns and pathways to hypertension: a multivariate cluster analysis.

    PubMed

    Brindle, R C; Ginty, A T; Jones, A; Phillips, A C; Roseboom, T J; Carroll, D; Painter, R C; de Rooij, S R

    2016-12-01

    Substantial evidence links exaggerated mental stress induced blood pressure reactivity to future hypertension, but the results for heart rate reactivity are less clear. For this reason multivariate cluster analysis was carried out to examine the relationship between heart rate and blood pressure reactivity patterns and hypertension in a large prospective cohort (age range 55-60 years). Four clusters emerged with statistically different systolic and diastolic blood pressure and heart rate reactivity patterns. Cluster 1 was characterised by a relatively exaggerated blood pressure and heart rate response while the blood pressure and heart rate responses of cluster 2 were relatively modest and in line with the sample mean. Cluster 3 was characterised by blunted cardiovascular stress reactivity across all variables and cluster 4, by an exaggerated blood pressure response and modest heart rate response. Membership to cluster 4 conferred an increased risk of hypertension at 5-year follow-up (hazard ratio=2.98 (95% CI: 1.50-5.90), P<0.01) that survived adjustment for a host of potential confounding variables. These results suggest that the cardiac reactivity plays a potentially important role in the link between blood pressure reactivity and hypertension and support the use of multivariate approaches to stress psychophysiology.

  7. Epidemiologic Surveillance of Teenage Birth Rates in the United States, 2006-2012.

    PubMed

    Amin, Raid; Decesare, Julie Zemaitis; Hans, Jennifer; Roussos-Ross, Kay

    2017-06-01

    To investigate the geographic variation in the average teenage birth rates by county in the contiguous United States. Data from the National Center for Health Statistics were used in this retrospective cohort to count the total number of live births to females aged 15-19 years by county between 2006 and 2012. Software for disease surveillance and spatial cluster analysis was used to identify clusters of high or low teenage births in counties or areas of greater than 100,000 teenage females. The analysis was then adjusted for percentage of poverty and high school diploma achievement. The unadjusted analysis identified the top 10 clusters of teenage births. The cluster with the highest rate was a city and the surrounding 40 counties, demonstrating an average teen birth rate of 67 per 1,000 females in the age range, 87% higher than the rate in the contiguous United States. Adjustments for poverty rates and high school diploma achievement shifted the top clusters to other areas. Despite an overall national decline in the teenage birth rate, clusters of elevated teenage birth rates remain. These clusters are not random and remain higher than expected when adjusted for poverty and education. This data set provides a framework to focus targeted interventions to reduce teenage birth rates in this high-risk population.

  8. Theory of mind predicts severity level in autism.

    PubMed

    Hoogenhout, Michelle; Malcolm-Smith, Susan

    2017-02-01

    We investigated whether theory of mind skills can indicate autism spectrum disorder severity. In all, 62 children with autism spectrum disorder completed a developmentally sensitive theory of mind battery. We used intelligence quotient, Diagnostic and Statistical Manual of Mental Disorders (4th ed.) diagnosis and level of support needed as indicators of severity level. Using hierarchical cluster analysis, we found three distinct clusters of theory of mind ability: early-developing theory of mind (Cluster 1), false-belief reasoning (Cluster 2) and sophisticated theory of mind understanding (Cluster 3). The clusters corresponded to severe, moderate and mild autism spectrum disorder. As an indicator of level of support needed, cluster grouping predicted the type of school children attended. All Cluster 1 children attended autism-specific schools; Cluster 2 was divided between autism-specific and special needs schools and nearly all Cluster 3 children attended general special needs and mainstream schools. Assessing theory of mind skills can reliably discriminate severity levels within autism spectrum disorder.

  9. A comparison of hierarchical cluster analysis and league table rankings as methods for analysis and presentation of district health system performance data in Uganda.

    PubMed

    Tashobya, Christine K; Dubourg, Dominique; Ssengooba, Freddie; Speybroeck, Niko; Macq, Jean; Criel, Bart

    2016-03-01

    In 2003, the Uganda Ministry of Health introduced the district league table for district health system performance assessment. The league table presents district performance against a number of input, process and output indicators and a composite index to rank districts. This study explores the use of hierarchical cluster analysis for analysing and presenting district health systems performance data and compares this approach with the use of the league table in Uganda. Ministry of Health and district plans and reports, and published documents were used to provide information on the development and utilization of the Uganda district league table. Quantitative data were accessed from the Ministry of Health databases. Statistical analysis using SPSS version 20 and hierarchical cluster analysis, utilizing Wards' method was used. The hierarchical cluster analysis was conducted on the basis of seven clusters determined for each year from 2003 to 2010, ranging from a cluster of good through moderate-to-poor performers. The characteristics and membership of clusters varied from year to year and were determined by the identity and magnitude of performance of the individual variables. Criticisms of the league table include: perceived unfairness, as it did not take into consideration district peculiarities; and being oversummarized and not adequately informative. Clustering organizes the many data points into clusters of similar entities according to an agreed set of indicators and can provide the beginning point for identifying factors behind the observed performance of districts. Although league table ranking emphasize summation and external control, clustering has the potential to encourage a formative, learning approach. More research is required to shed more light on factors behind observed performance of the different clusters. Other countries especially low-income countries that share many similarities with Uganda can learn from these experiences. © The Author 2015. Published by Oxford University Press in association with The London School of Hygiene and Tropical Medicine.

  10. A comparison of hierarchical cluster analysis and league table rankings as methods for analysis and presentation of district health system performance data in Uganda†

    PubMed Central

    Tashobya, Christine K; Dubourg, Dominique; Ssengooba, Freddie; Speybroeck, Niko; Macq, Jean; Criel, Bart

    2016-01-01

    In 2003, the Uganda Ministry of Health introduced the district league table for district health system performance assessment. The league table presents district performance against a number of input, process and output indicators and a composite index to rank districts. This study explores the use of hierarchical cluster analysis for analysing and presenting district health systems performance data and compares this approach with the use of the league table in Uganda. Ministry of Health and district plans and reports, and published documents were used to provide information on the development and utilization of the Uganda district league table. Quantitative data were accessed from the Ministry of Health databases. Statistical analysis using SPSS version 20 and hierarchical cluster analysis, utilizing Wards’ method was used. The hierarchical cluster analysis was conducted on the basis of seven clusters determined for each year from 2003 to 2010, ranging from a cluster of good through moderate-to-poor performers. The characteristics and membership of clusters varied from year to year and were determined by the identity and magnitude of performance of the individual variables. Criticisms of the league table include: perceived unfairness, as it did not take into consideration district peculiarities; and being oversummarized and not adequately informative. Clustering organizes the many data points into clusters of similar entities according to an agreed set of indicators and can provide the beginning point for identifying factors behind the observed performance of districts. Although league table ranking emphasize summation and external control, clustering has the potential to encourage a formative, learning approach. More research is required to shed more light on factors behind observed performance of the different clusters. Other countries especially low-income countries that share many similarities with Uganda can learn from these experiences. PMID:26024882

  11. Nursing home care quality: a cluster analysis.

    PubMed

    Grøndahl, Vigdis Abrahamsen; Fagerli, Liv Berit

    2017-02-13

    Purpose The purpose of this paper is to explore potential differences in how nursing home residents rate care quality and to explore cluster characteristics. Design/methodology/approach A cross-sectional design was used, with one questionnaire including questions from quality from patients' perspective and Big Five personality traits, together with questions related to socio-demographic aspects and health condition. Residents ( n=103) from four Norwegian nursing homes participated (74.1 per cent response rate). Hierarchical cluster analysis identified clusters with respect to care quality perceptions. χ 2 tests and one-way between-groups ANOVA were performed to characterise the clusters ( p<0.05). Findings Two clusters were identified; Cluster 1 residents (28.2 per cent) had the best care quality perceptions and Cluster 2 (67.0 per cent) had the worst perceptions. The clusters were statistically significant and characterised by personal-related conditions: gender, psychological well-being, preferences, admission, satisfaction with staying in the nursing home, emotional stability and agreeableness, and by external objective care conditions: healthcare personnel and registered nurses. Research limitations/implications Residents assessed as having no cognitive impairments were included, thus excluding the largest group. By choosing questionnaire design and structured interviews, the number able to participate may increase. Practical implications Findings may provide healthcare personnel and managers with increased knowledge on which to develop strategies to improve specific care quality perceptions. Originality/value Cluster analysis can be an effective tool for differentiating between nursing homes residents' care quality perceptions.

  12. Fascioliasis risk factors and space-time clusters in domestic ruminants in Bangladesh.

    PubMed

    Rahman, A K M Anisur; Islam, S K Shaheenur; Talukder, Md Hasanuzzaman; Hassan, Md Kumrul; Dhand, Navneet K; Ward, Michael P

    2017-05-08

    A retrospective observational study was conducted to identify fascioliasis hotspots, clusters, potential risk factors and to map fascioliasis risk in domestic ruminants in Bangladesh. Cases of fascioliasis in cattle, buffalo, sheep and goats from all districts in Bangladesh between 2011 and 2013 were identified via secondary surveillance data from the Department of Livestock Services' Epidemiology Unit. From each case report, date of report, species affected and district data were extracted. The total number of domestic ruminants in each district was used to calculate fascioliasis cases per ten thousand animals at risk per district, and this was used for cluster and hotspot analysis. Clustering was assessed with Moran's spatial autocorrelation statistic, hotspots with the local indicator of spatial association (LISA) statistic and space-time clusters with the scan statistic (Poisson model). The association between district fascioliasis prevalence and climate (temperature, precipitation), elevation, land cover and water bodies was investigated using a spatial regression model. A total of 1,723,971 cases of fascioliasis were reported in the three-year study period in cattle (1,164,560), goats (424,314), buffalo (88,924) and sheep (46,173). A total of nine hotspots were identified; one of these persisted in each of the three years. Only two local clusters were found. Five space-time clusters located within 22 districts were also identified. Annual risk maps of fascioliasis cases correlated with the hotspots and clusters detected. Cultivated and managed (P < 0.001) and artificial surface (P = 0.04) land cover areas, and elevation (P = 0.003) were positively and negatively associated with fascioliasis in Bangladesh, respectively. Results indicate that due to land use characteristics some areas of Bangladesh are at greater risk of fascioliasis. The potential risk factors, hot spots and clusters identified in this study can be used to guide science-based treatment and control decisions for fascioliasis in Bangladesh and in other similar geo-climatic zones throughout the world.

  13. Groundwater source contamination mechanisms: Physicochemical profile clustering, risk factor analysis and multivariate modelling

    NASA Astrophysics Data System (ADS)

    Hynds, Paul; Misstear, Bruce D.; Gill, Laurence W.; Murphy, Heather M.

    2014-04-01

    An integrated domestic well sampling and "susceptibility assessment" programme was undertaken in the Republic of Ireland from April 2008 to November 2010. Overall, 211 domestic wells were sampled, assessed and collated with local climate data. Based upon groundwater physicochemical profile, three clusters have been identified and characterised by source type (borehole or hand-dug well) and local geological setting. Statistical analysis indicates that cluster membership is significantly associated with the prevalence of bacteria (p = 0.001), with mean Escherichia coli presence within clusters ranging from 15.4% (Cluster-1) to 47.6% (Cluster-3). Bivariate risk factor analysis shows that on-site septic tank presence was the only risk factor significantly associated (p < 0.05) with bacterial presence within all clusters. Point agriculture adjacency was significantly associated with both borehole-related clusters. Well design criteria were associated with hand-dug wells and boreholes in areas characterised by high permeability subsoils, while local geological setting was significant for hand-dug wells and boreholes in areas dominated by low/moderate permeability subsoils. Multivariate susceptibility models were developed for all clusters, with predictive accuracies of 84% (Cluster-1) to 91% (Cluster-2) achieved. Septic tank setback was a common variable within all multivariate models, while agricultural sources were also significant, albeit to a lesser degree. Furthermore, well liner clearance was a significant factor in all models, indicating that direct surface ingress is a significant well contamination mechanism. Identification and elucidation of cluster-specific contamination mechanisms may be used to develop improved overall risk management and wellhead protection strategies, while also informing future remediation and maintenance efforts.

  14. An integrated workflow for robust alignment and simplified quantitative analysis of NMR spectrometry data.

    PubMed

    Vu, Trung N; Valkenborg, Dirk; Smets, Koen; Verwaest, Kim A; Dommisse, Roger; Lemière, Filip; Verschoren, Alain; Goethals, Bart; Laukens, Kris

    2011-10-20

    Nuclear magnetic resonance spectroscopy (NMR) is a powerful technique to reveal and compare quantitative metabolic profiles of biological tissues. However, chemical and physical sample variations make the analysis of the data challenging, and typically require the application of a number of preprocessing steps prior to data interpretation. For example, noise reduction, normalization, baseline correction, peak picking, spectrum alignment and statistical analysis are indispensable components in any NMR analysis pipeline. We introduce a novel suite of informatics tools for the quantitative analysis of NMR metabolomic profile data. The core of the processing cascade is a novel peak alignment algorithm, called hierarchical Cluster-based Peak Alignment (CluPA). The algorithm aligns a target spectrum to the reference spectrum in a top-down fashion by building a hierarchical cluster tree from peak lists of reference and target spectra and then dividing the spectra into smaller segments based on the most distant clusters of the tree. To reduce the computational time to estimate the spectral misalignment, the method makes use of Fast Fourier Transformation (FFT) cross-correlation. Since the method returns a high-quality alignment, we can propose a simple methodology to study the variability of the NMR spectra. For each aligned NMR data point the ratio of the between-group and within-group sum of squares (BW-ratio) is calculated to quantify the difference in variability between and within predefined groups of NMR spectra. This differential analysis is related to the calculation of the F-statistic or a one-way ANOVA, but without distributional assumptions. Statistical inference based on the BW-ratio is achieved by bootstrapping the null distribution from the experimental data. The workflow performance was evaluated using a previously published dataset. Correlation maps, spectral and grey scale plots show clear improvements in comparison to other methods, and the down-to-earth quantitative analysis works well for the CluPA-aligned spectra. The whole workflow is embedded into a modular and statistically sound framework that is implemented as an R package called "speaq" ("spectrum alignment and quantitation"), which is freely available from http://code.google.com/p/speaq/.

  15. Eye-gaze determination of user intent at the computer interface

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Goldberg, J.H.; Schryver, J.C.

    1993-12-31

    Determination of user intent at the computer interface through eye-gaze monitoring can significantly aid applications for the disabled, as well as telerobotics and process control interfaces. Whereas current eye-gaze control applications are limited to object selection and x/y gazepoint tracking, a methodology was developed here to discriminate a more abstract interface operation: zooming-in or out. This methodology first collects samples of eve-gaze location looking at controlled stimuli, at 30 Hz, just prior to a user`s decision to zoom. The sample is broken into data frames, or temporal snapshots. Within a data frame, all spatial samples are connected into a minimummore » spanning tree, then clustered, according to user defined parameters. Each cluster is mapped to one in the prior data frame, and statistics are computed from each cluster. These characteristics include cluster size, position, and pupil size. A multiple discriminant analysis uses these statistics both within and between data frames to formulate optimal rules for assigning the observations into zooming, zoom-out, or no zoom conditions. The statistical procedure effectively generates heuristics for future assignments, based upon these variables. Future work will enhance the accuracy and precision of the modeling technique, and will empirically test users in controlled experiments.« less

  16. Dissociation kinetics of metal clusters on multiple electronic states including electronic level statistics into the vibronic soup

    NASA Astrophysics Data System (ADS)

    Shvartsburg, Alexandre A.; Siu, K. W. Michael

    2001-06-01

    Modeling the delayed dissociation of clusters had been over the last decade a frontline development area in chemical physics. It is of fundamental interest how statistical kinetics methods previously validated for regular molecules and atomic nuclei may apply to clusters, as this would help to understand the transferability of statistical models for disintegration of complex systems across various classes of physical objects. From a practical perspective, accurate simulation of unimolecular decomposition is critical for the extraction of true thermochemical values from measurements on the decay of energized clusters. Metal clusters are particularly challenging because of the multitude of low-lying electronic states that are coupled to vibrations. This has previously been accounted for assuming the average electronic structure of a conducting cluster approximated by the levels of electron in a cavity. While this provides a reasonable time-averaged description, it ignores the distribution of instantaneous electronic structures in a "boiling" cluster around that average. Here we set up a new treatment that incorporates the statistical distribution of electronic levels around the average picture using random matrix theory. This approach faithfully reflects the completely chaotic "vibronic soup" nature of hot metal clusters. We found that the consideration of electronic level statistics significantly promotes electronic excitation and thus increases the magnitude of its effect. As this excitation always depresses the decay rates, the inclusion of level statistics results in slower dissociation of metal clusters.

  17. Modeling stock price dynamics by continuum percolation system and relevant complex systems analysis

    NASA Astrophysics Data System (ADS)

    Xiao, Di; Wang, Jun

    2012-10-01

    The continuum percolation system is developed to model a random stock price process in this work. Recent empirical research has demonstrated various statistical features of stock price changes, the financial model aiming at understanding price fluctuations needs to define a mechanism for the formation of the price, in an attempt to reproduce and explain this set of empirical facts. The continuum percolation model is usually referred to as a random coverage process or a Boolean model, the local interaction or influence among traders is constructed by the continuum percolation, and a cluster of continuum percolation is applied to define the cluster of traders sharing the same opinion about the market. We investigate and analyze the statistical behaviors of normalized returns of the price model by some analysis methods, including power-law tail distribution analysis, chaotic behavior analysis and Zipf analysis. Moreover, we consider the daily returns of Shanghai Stock Exchange Composite Index from January 1997 to July 2011, and the comparisons of return behaviors between the actual data and the simulation data are exhibited.

  18. Identification of atypical flight patterns

    NASA Technical Reports Server (NTRS)

    Statler, Irving C. (Inventor); Ferryman, Thomas A. (Inventor); Amidan, Brett G. (Inventor); Whitney, Paul D. (Inventor); White, Amanda M. (Inventor); Willse, Alan R. (Inventor); Cooley, Scott K. (Inventor); Jay, Joseph Griffith (Inventor); Lawrence, Robert E. (Inventor); Mosbrucker, Chris (Inventor)

    2005-01-01

    Method and system for analyzing aircraft data, including multiple selected flight parameters for a selected phase of a selected flight, and for determining when the selected phase of the selected flight is atypical, when compared with corresponding data for the same phase for other similar flights. A flight signature is computed using continuous-valued and discrete-valued flight parameters for the selected flight parameters and is optionally compared with a statistical distribution of other observed flight signatures, yielding atypicality scores for the same phase for other similar flights. A cluster analysis is optionally applied to the flight signatures to define an optimal collection of clusters. A level of atypicality for a selected flight is estimated, based upon an index associated with the cluster analysis.

  19. Effects of additional data on Bayesian clustering.

    PubMed

    Yamazaki, Keisuke

    2017-10-01

    Hierarchical probabilistic models, such as mixture models, are used for cluster analysis. These models have two types of variables: observable and latent. In cluster analysis, the latent variable is estimated, and it is expected that additional information will improve the accuracy of the estimation of the latent variable. Many proposed learning methods are able to use additional data; these include semi-supervised learning and transfer learning. However, from a statistical point of view, a complex probabilistic model that encompasses both the initial and additional data might be less accurate due to having a higher-dimensional parameter. The present paper presents a theoretical analysis of the accuracy of such a model and clarifies which factor has the greatest effect on its accuracy, the advantages of obtaining additional data, and the disadvantages of increasing the complexity. Copyright © 2017 Elsevier Ltd. All rights reserved.

  20. Comparative Performance Analysis of a Hyper-Temporal Ndvi Analysis Approach and a Landscape-Ecological Mapping Approach

    NASA Astrophysics Data System (ADS)

    Ali, A.; de Bie, C. A. J. M.; Scarrott, R. G.; Ha, N. T. T.; Skidmore, A. K.

    2012-07-01

    Both agricultural area expansion and intensification are necessary to cope with the growing demand for food, and the growing threat of food insecurity which is rapidly engulfing poor and under-privileged sections of the global population. Therefore, it is of paramount importance to have the ability to accurately estimate crop area and spatial distribution. Remote sensing has become a valuable tool for estimating and mapping cropland areas, useful in food security monitoring. This work contributes to addressing this broad issue, focusing on the comparative performance analysis of two mapping approaches (i) a hyper-temporal Normalized Difference Vegetation Index (NDVI) analysis approach and (ii) a Landscape-ecological approach. The hyper-temporal NDVI analysis approach utilized SPOT 10-day NDVI imagery from April 1998-December 2008, whilst the Landscape-ecological approach used multitemporal Landsat-7 ETM+ imagery acquired intermittently between 1992 and 2002. Pixels in the time-series NDVI dataset were clustered using an ISODATA clustering algorithm adapted to determine the optimal number of pixel clusters to successfully generalize hyper-temporal datasets. Clusters were then characterized with crop cycle information, and flooding information to produce an NDVI unit map of rice classes with flood regime and NDVI profile information. A Landscape-ecological map was generated using a combination of digitized homogenous map units in the Landsat-7 ETM+ imagery, a Land use map 2005 of the Mekong delta, and supplementary datasets on the regions terrain, geo-morphology and flooding depths. The output maps were validated using reported crop statistics, and regression analyses were used to ascertain the relationship between land use area estimated from maps, and those reported in district crop statistics. The regression analysis showed that the hyper-temporal NDVI analysis approach explained 74% and 76% of the variability in reported crop statistics in two rice crop and three rice crop land use systems respectively. In contrast, 64% and 63% of the variability was explained respectively by the Landscape-ecological map. Overall, the results indicate the hyper-temporal NDVI analysis approach is more accurate and more useful in exploring when, why and how agricultural land use manifests itself in space and time. Furthermore, the NDVI analysis approach was found to be easier to implement, was more cost effective, and involved less subjective user intervention than the landscape-ecological approach.

  1. Spatial-temporal clustering of companion animal enteric syndrome: detection and investigation through the use of electronic medical records from participating private practices.

    PubMed

    Anholt, R M; Berezowski, J; Robertson, C; Stephen, C

    2015-09-01

    There is interest in the potential of companion animal surveillance to provide data to improve pet health and to provide early warning of environmental hazards to people. We implemented a companion animal surveillance system in Calgary, Alberta and the surrounding communities. Informatics technologies automatically extracted electronic medical records from participating veterinary practices and identified cases of enteric syndrome in the warehoused records. The data were analysed using time-series analyses and a retrospective space-time permutation scan statistic. We identified a seasonal pattern of reports of occurrences of enteric syndromes in companion animals and four statistically significant clusters of enteric syndrome cases. The cases within each cluster were examined and information about the animals involved (species, age, sex), their vaccination history, possible exposure or risk behaviour history, information about disease severity, and the aetiological diagnosis was collected. We then assessed whether the cases within the cluster were unusual and if they represented an animal or public health threat. There was often insufficient information recorded in the medical record to characterize the clusters by aetiology or exposures. Space-time analysis of companion animal enteric syndrome cases found evidence of clustering. Collection of more epidemiologically relevant data would enhance the utility of practice-based companion animal surveillance.

  2. Global, local and focused geographic clustering for case-control data with residential histories

    PubMed Central

    Jacquez, Geoffrey M; Kaufmann, Andy; Meliker, Jaymie; Goovaerts, Pierre; AvRuskin, Gillian; Nriagu, Jerome

    2005-01-01

    Background This paper introduces a new approach for evaluating clustering in case-control data that accounts for residential histories. Although many statistics have been proposed for assessing local, focused and global clustering in health outcomes, few, if any, exist for evaluating clusters when individuals are mobile. Methods Local, global and focused tests for residential histories are developed based on sets of matrices of nearest neighbor relationships that reflect the changing topology of cases and controls. Exposure traces are defined that account for the latency between exposure and disease manifestation, and that use exposure windows whose duration may vary. Several of the methods so derived are applied to evaluate clustering of residential histories in a case-control study of bladder cancer in south eastern Michigan. These data are still being collected and the analysis is conducted for demonstration purposes only. Results Statistically significant clustering of residential histories of cases was found but is likely due to delayed reporting of cases by one of the hospitals participating in the study. Conclusion Data with residential histories are preferable when causative exposures and disease latencies occur on a long enough time span that human mobility matters. To analyze such data, methods are needed that take residential histories into account. PMID:15784151

  3. Hydrometeor classification through statistical clustering of polarimetric radar measurements: a semi-supervised approach

    NASA Astrophysics Data System (ADS)

    Besic, Nikola; Ventura, Jordi Figueras i.; Grazioli, Jacopo; Gabella, Marco; Germann, Urs; Berne, Alexis

    2016-09-01

    Polarimetric radar-based hydrometeor classification is the procedure of identifying different types of hydrometeors by exploiting polarimetric radar observations. The main drawback of the existing supervised classification methods, mostly based on fuzzy logic, is a significant dependency on a presumed electromagnetic behaviour of different hydrometeor types. Namely, the results of the classification largely rely upon the quality of scattering simulations. When it comes to the unsupervised approach, it lacks the constraints related to the hydrometeor microphysics. The idea of the proposed method is to compensate for these drawbacks by combining the two approaches in a way that microphysical hypotheses can, to a degree, adjust the content of the classes obtained statistically from the observations. This is done by means of an iterative approach, performed offline, which, in a statistical framework, examines clustered representative polarimetric observations by comparing them to the presumed polarimetric properties of each hydrometeor class. Aside from comparing, a routine alters the content of clusters by encouraging further statistical clustering in case of non-identification. By merging all identified clusters, the multi-dimensional polarimetric signatures of various hydrometeor types are obtained for each of the studied representative datasets, i.e. for each radar system of interest. These are depicted by sets of centroids which are then employed in operational labelling of different hydrometeors. The method has been applied on three C-band datasets, each acquired by different operational radar from the MeteoSwiss Rad4Alp network, as well as on two X-band datasets acquired by two research mobile radars. The results are discussed through a comparative analysis which includes a corresponding supervised and unsupervised approach, emphasising the operational potential of the proposed method.

  4. Cluster stability in the analysis of mass cytometry data.

    PubMed

    Melchiotti, Rossella; Gracio, Filipe; Kordasti, Shahram; Todd, Alan K; de Rinaldis, Emanuele

    2017-01-01

    Manual gating has been traditionally applied to cytometry data sets to identify cells based on protein expression. The advent of mass cytometry allows for a higher number of proteins to be simultaneously measured on cells, therefore providing a means to define cell clusters in a high dimensional expression space. This enhancement, whilst opening unprecedented opportunities for single cell-level analyses, makes the incremental replacement of manual gating with automated clustering a compelling need. To this aim many methods have been implemented and their successful applications demonstrated in different settings. However, the reproducibility of automatically generated clusters is proving challenging and an analytical framework to distinguish spurious clusters from more stable entities, and presumably more biologically relevant ones, is still missing. One way to estimate cell clusters' stability is the evaluation of their consistent re-occurrence within- and between-algorithms, a metric that is commonly used to evaluate results from gene expression. Herein we report the usage and importance of cluster stability evaluations, when applied to results generated from three popular clustering algorithms - SPADE, FLOCK and PhenoGraph - run on four different data sets. These algorithms were shown to generate clusters with various degrees of statistical stability, many of them being unstable. By comparing the results of automated clustering with manually gated populations, we illustrate how information on cluster stability can assist towards a more rigorous and informed interpretation of clustering results. We also explore the relationships between statistical stability and other properties such as clusters' compactness and isolation, demonstrating that whilst cluster stability is linked to other properties it cannot be reliably predicted by any of them. Our study proposes the introduction of cluster stability as a necessary checkpoint for cluster interpretation and contributes to the construction of a more systematic and standardized analytical framework for the assessment of cytometry clustering results. © 2016 International Society for Advancement of Cytometry. © 2016 International Society for Advancement of Cytometry.

  5. Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit, Gap Statistics and Model Explorer.

    PubMed

    Giancarlo, Raffaele; Scaturro, Davide; Utro, Filippo

    2008-10-29

    Inferring cluster structure in microarray datasets is a fundamental task for the so-called -omic sciences. It is also a fundamental question in Statistics, Data Analysis and Classification, in particular with regard to the prediction of the number of clusters in a dataset, usually established via internal validation measures. Despite the wealth of internal measures available in the literature, new ones have been recently proposed, some of them specifically for microarray data. We consider five such measures: Clest, Consensus (Consensus Clustering), FOM (Figure of Merit), Gap (Gap Statistics) and ME (Model Explorer), in addition to the classic WCSS (Within Cluster Sum-of-Squares) and KL (Krzanowski and Lai index). We perform extensive experiments on six benchmark microarray datasets, using both Hierarchical and K-means clustering algorithms, and we provide an analysis assessing both the intrinsic ability of a measure to predict the correct number of clusters in a dataset and its merit relative to the other measures. We pay particular attention both to precision and speed. Moreover, we also provide various fast approximation algorithms for the computation of Gap, FOM and WCSS. The main result is a hierarchy of those measures in terms of precision and speed, highlighting some of their merits and limitations not reported before in the literature. Based on our analysis, we draw several conclusions for the use of those internal measures on microarray data. We report the main ones. Consensus is by far the best performer in terms of predictive power and remarkably algorithm-independent. Unfortunately, on large datasets, it may be of no use because of its non-trivial computer time demand (weeks on a state of the art PC). FOM is the second best performer although, quite surprisingly, it may not be competitive in this scenario: it has essentially the same predictive power of WCSS but it is from 6 to 100 times slower in time, depending on the dataset. The approximation algorithms for the computation of FOM, Gap and WCSS perform very well, i.e., they are faster while still granting a very close approximation of FOM and WCSS. The approximation algorithm for the computation of Gap deserves to be singled-out since it has a predictive power far better than Gap, it is competitive with the other measures, but it is at least two order of magnitude faster in time with respect to Gap. Another important novel conclusion that can be drawn from our analysis is that all the measures we have considered show severe limitations on large datasets, either due to computational demand (Consensus, as already mentioned, Clest and Gap) or to lack of precision (all of the other measures, including their approximations). The software and datasets are available under the GNU GPL on the supplementary material web page.

  6. Robust statistical methods for hit selection in RNA interference high-throughput screening experiments.

    PubMed

    Zhang, Xiaohua Douglas; Yang, Xiting Cindy; Chung, Namjin; Gates, Adam; Stec, Erica; Kunapuli, Priya; Holder, Dan J; Ferrer, Marc; Espeseth, Amy S

    2006-04-01

    RNA interference (RNAi) high-throughput screening (HTS) experiments carried out using large (>5000 short interfering [si]RNA) libraries generate a huge amount of data. In order to use these data to identify the most effective siRNAs tested, it is critical to adopt and develop appropriate statistical methods. To address the questions in hit selection of RNAi HTS, we proposed a quartile-based method which is robust to outliers, true hits and nonsymmetrical data. We compared it with the more traditional tests, mean +/- k standard deviation (SD) and median +/- 3 median of absolute deviation (MAD). The results suggested that the quartile-based method selected more hits than mean +/- k SD under the same preset error rate. The number of hits selected by median +/- k MAD was close to that by the quartile-based method. Further analysis suggested that the quartile-based method had the greatest power in detecting true hits, especially weak or moderate true hits. Our investigation also suggested that platewise analysis (determining effective siRNAs on a plate-by-plate basis) can adjust for systematic errors in different plates, while an experimentwise analysis, in which effective siRNAs are identified in an analysis of the entire experiment, cannot. However, experimentwise analysis may detect a cluster of true positive hits placed together in one or several plates, while platewise analysis may not. To display hit selection results, we designed a specific figure called a plate-well series plot. We thus suggest the following strategy for hit selection in RNAi HTS experiments. First, choose the quartile-based method, or median +/- k MAD, for identifying effective siRNAs. Second, perform the chosen method experimentwise on transformed/normalized data, such as percentage inhibition, to check the possibility of hit clusters. If a cluster of selected hits are observed, repeat the analysis based on untransformed data to determine whether the cluster is due to an artifact in the data. If no clusters of hits are observed, select hits by performing platewise analysis on transformed data. Third, adopt the plate-well series plot to visualize both the data and the hit selection results, as well as to check for artifacts.

  7. Cognitive Clusters in Specific Learning Disorder.

    PubMed

    Poletti, Michele; Carretta, Elisa; Bonvicini, Laura; Giorgi-Rossi, Paolo

    The heterogeneity among children with learning disabilities still represents a barrier and a challenge in their conceptualization. Although a dimensional approach has been gaining support, the categorical approach is still the most adopted, as in the recent fifth edition of the Diagnostic and Statistical Manual of Mental Disorders. The introduction of the single overarching diagnostic category of specific learning disorder (SLD) could underemphasize interindividual clinical differences regarding intracategory cognitive functioning and learning proficiency, according to current models of multiple cognitive deficits at the basis of neurodevelopmental disorders. The characterization of specific cognitive profiles associated with an already manifest SLD could help identify possible early cognitive markers of SLD risk and distinct trajectories of atypical cognitive development leading to SLD. In this perspective, we applied a cluster analysis to identify groups of children with a Diagnostic and Statistical Manual-based diagnosis of SLD with similar cognitive profiles and to describe the association between clusters and SLD subtypes. A sample of 205 children with a diagnosis of SLD were enrolled. Cluster analyses (agglomerative hierarchical and nonhierarchical iterative clustering technique) were used successively on 10 core subtests of the Wechsler Intelligence Scale for Children-Fourth Edition. The 4-cluster solution was adopted, and external validation found differences in terms of SLD subtype frequencies and learning proficiency among clusters. Clinical implications of these findings are discussed, tracing directions for further studies.

  8. A Data Analytics Approach to Discovering Unique Microstructural Configurations Susceptible to Fatigue

    NASA Astrophysics Data System (ADS)

    Jha, S. K.; Brockman, R. A.; Hoffman, R. M.; Sinha, V.; Pilchak, A. L.; Porter, W. J.; Buchanan, D. J.; Larsen, J. M.; John, R.

    2018-05-01

    Principal component analysis and fuzzy c-means clustering algorithms were applied to slip-induced strain and geometric metric data in an attempt to discover unique microstructural configurations and their frequencies of occurrence in statistically representative instantiations of a titanium alloy microstructure. Grain-averaged fatigue indicator parameters were calculated for the same instantiation. The fatigue indicator parameters strongly correlated with the spatial location of the microstructural configurations in the principal components space. The fuzzy c-means clustering method identified clusters of data that varied in terms of their average fatigue indicator parameters. Furthermore, the number of points in each cluster was inversely correlated to the average fatigue indicator parameter. This analysis demonstrates that data-driven methods have significant potential for providing unbiased determination of unique microstructural configurations and their frequencies of occurrence in a given volume from the point of view of strain localization and fatigue crack initiation.

  9. Efficient and Flexible Climate Analysis with Python in a Cloud-Based Distributed Computing Framework

    NASA Astrophysics Data System (ADS)

    Gannon, C.

    2017-12-01

    As climate models become progressively more advanced, and spatial resolution further improved through various downscaling projects, climate projections at a local level are increasingly insightful and valuable. However, the raw size of climate datasets presents numerous hurdles for analysts wishing to develop customized climate risk metrics or perform site-specific statistical analysis. Four Twenty Seven, a climate risk consultancy, has implemented a Python-based distributed framework to analyze large climate datasets in the cloud. With the freedom afforded by efficiently processing these datasets, we are able to customize and continually develop new climate risk metrics using the most up-to-date data. Here we outline our process for using Python packages such as XArray and Dask to evaluate netCDF files in a distributed framework, StarCluster to operate in a cluster-computing environment, cloud computing services to access publicly hosted datasets, and how this setup is particularly valuable for generating climate change indicators and performing localized statistical analysis.

  10. SparRec: An effective matrix completion framework of missing data imputation for GWAS

    NASA Astrophysics Data System (ADS)

    Jiang, Bo; Ma, Shiqian; Causey, Jason; Qiao, Linbo; Hardin, Matthew Price; Bitts, Ian; Johnson, Daniel; Zhang, Shuzhong; Huang, Xiuzhen

    2016-10-01

    Genome-wide association studies present computational challenges for missing data imputation, while the advances of genotype technologies are generating datasets of large sample sizes with sample sets genotyped on multiple SNP chips. We present a new framework SparRec (Sparse Recovery) for imputation, with the following properties: (1) The optimization models of SparRec, based on low-rank and low number of co-clusters of matrices, are different from current statistics methods. While our low-rank matrix completion (LRMC) model is similar to Mendel-Impute, our matrix co-clustering factorization (MCCF) model is completely new. (2) SparRec, as other matrix completion methods, is flexible to be applied to missing data imputation for large meta-analysis with different cohorts genotyped on different sets of SNPs, even when there is no reference panel. This kind of meta-analysis is very challenging for current statistics based methods. (3) SparRec has consistent performance and achieves high recovery accuracy even when the missing data rate is as high as 90%. Compared with Mendel-Impute, our low-rank based method achieves similar accuracy and efficiency, while the co-clustering based method has advantages in running time. The testing results show that SparRec has significant advantages and competitive performance over other state-of-the-art existing statistics methods including Beagle and fastPhase.

  11. Country clustering applied to the water & sanitation sector: a new tool with potential applications in research & policy

    PubMed Central

    Onda, Kyle; Crocker, Jonny; Kayser, Georgia Lyn; Bartram, Jamie

    2013-01-01

    The fields of global health and international development commonly cluster countries by geography and income to target resources and describe progress. For any given sector of interest, a range of relevant indicators can serve as a more appropriate basis for classification. We create a new typology of country clusters specific to the water and sanitation (WatSan) sector based on similarities across multiple WatSan-related indicators. After a literature review and consultation with experts in the WatSan sector, nine indicators were selected. Indicator selection was based on relevance to and suggested influence on national water and sanitation service delivery, and to maximize data availability across as many countries as possible. A hierarchical clustering method and a gap statistic analysis were used to group countries into a natural number of relevant clusters. Two stages of clustering resulted in five clusters, representing 156 countries or 6.75 billion people. The five clusters were not well explained by income or geography, and were unique from existing country clusters used in international development. Analysis of these five clusters revealed that they were more compact and well separated than United Nations and World Bank country clusters. This analysis and resulting country typology suggest that previous geography- or income-based country groupings can be improved upon for applications in the WatSan sector by utilizing globally available WatSan-related indicators. Potential applications include guiding and discussing research, informing policy, improving resource targeting, describing sector progress, and identifying critical knowledge gaps in the WatSan sector. PMID:24054545

  12. Population Genomics and the Statistical Values of Race: An Interdisciplinary Perspective on the Biological Classification of Human Populations and Implications for Clinical Genetic Epidemiological Research

    PubMed Central

    Maglo, Koffi N.; Mersha, Tesfaye B.; Martin, Lisa J.

    2016-01-01

    The biological status and biomedical significance of the concept of race as applied to humans continue to be contentious issues despite the use of advanced statistical and clustering methods to determine continental ancestry. It is thus imperative for researchers to understand the limitations as well as potential uses of the concept of race in biology and biomedicine. This paper deals with the theoretical assumptions behind cluster analysis in human population genomics. Adopting an interdisciplinary approach, it demonstrates that the hypothesis that attributes the clustering of human populations to “frictional” effects of landform barriers at continental boundaries is empirically incoherent. It then contrasts the scientific status of the “cluster” and “cline” constructs in human population genomics, and shows how cluster may be instrumentally produced. It also shows how statistical values of race vindicate Darwin's argument that race is evolutionarily meaningless. Finally, the paper explains why, due to spatiotemporal parameters, evolutionary forces, and socio-cultural factors influencing population structure, continental ancestry may be pragmatically relevant to global and public health genomics. Overall, this work demonstrates that, from a biological systematic and evolutionary taxonomical perspective, human races/continental groups or clusters have no natural meaning or objective biological reality. In fact, the utility of racial categorizations in research and in clinics can be explained by spatiotemporal parameters, socio-cultural factors, and evolutionary forces affecting disease causation and treatment response. PMID:26925096

  13. Kappa statistic for clustered matched-pair data.

    PubMed

    Yang, Zhao; Zhou, Ming

    2014-07-10

    Kappa statistic is widely used to assess the agreement between two procedures in the independent matched-pair data. For matched-pair data collected in clusters, on the basis of the delta method and sampling techniques, we propose a nonparametric variance estimator for the kappa statistic without within-cluster correlation structure or distributional assumptions. The results of an extensive Monte Carlo simulation study demonstrate that the proposed kappa statistic provides consistent estimation and the proposed variance estimator behaves reasonably well for at least a moderately large number of clusters (e.g., K ≥50). Compared with the variance estimator ignoring dependence within a cluster, the proposed variance estimator performs better in maintaining the nominal coverage probability when the intra-cluster correlation is fair (ρ ≥0.3), with more pronounced improvement when ρ is further increased. To illustrate the practical application of the proposed estimator, we analyze two real data examples of clustered matched-pair data. Copyright © 2014 John Wiley & Sons, Ltd.

  14. Statistical indicators of collective behavior and functional clusters in gene networks of yeast

    NASA Astrophysics Data System (ADS)

    Živković, J.; Tadić, B.; Wick, N.; Thurner, S.

    2006-03-01

    We analyze gene expression time-series data of yeast (S. cerevisiae) measured along two full cell-cycles. We quantify these data by using q-exponentials, gene expression ranking and a temporal mean-variance analysis. We construct gene interaction networks based on correlation coefficients and study the formation of the corresponding giant components and minimum spanning trees. By coloring genes according to their cell function we find functional clusters in the correlation networks and functional branches in the associated trees. Our results suggest that a percolation point of functional clusters can be identified on these gene expression correlation networks.

  15. Update on ONC's Substellar IMF: A Second Peak in the Brown Dwarf Regime

    NASA Astrophysics Data System (ADS)

    Drass, Holger; Bayo, A.; Chini, R.; Haas, M.

    2017-06-01

    The Orion Nebular Cluster (ONC) has become the prototype cluster for studying the Initial Mass Function (IMF). In a deep JHK survey of the ONC with HAWK-I we detected a large population of 900 Brown Dwarfs and Planetary Mass Object candidates presenting a pronounced second peak in the substellar IMF. One of the most obvious issues of this result is the verification of cluster membership. The analysis so far was mainly based on statistical consideration. In this presentation I will show the results from using different high-resolution extinction map to determine the ONC membership.

  16. Spatial Analysis of Hemorrhagic Fever with Renal Syndrome in Zibo City, China, 2009–2012

    PubMed Central

    Wang, Ling; Yang, Shuxia; Zhang, Ling; Cao, Haixia; Zhang, Yan; Hu, Haodong; Zhai, Shenyong

    2013-01-01

    Background Hemorrhagic fever with renal syndrome (HFRS) is highly endemic in mainland China, where human cases account for 90% of the total global cases. Zibo City is one of the most serious affected areas in Shandong Province China with the HFRS incidence increasing sharply from 2009 to 2012. However, the hotspots of HFRS in Zibo remained unclear. Thus, a spatial analysis was conducted with the aim to explore the spatial, spatial-temporal and seasonal patterns of HFRS in Zibo from 2009 to 2012, and to provide guidance for formulating regional prevention and control strategies. Methods The study was based on the reported cases of HFRS from the National Notifiable Disease Surveillance System. Annualized incidence maps and seasonal incidence maps were produced to analyze the spatial and seasonal distribution of HFRS in Zibo City. Then spatial scan statistics and space-time scan statistics were conducted to identify clusters of HFRS. Results There were 200 cases reported in Zibo City during the 4-year study period. One most likely cluster and one secondary cluster for high incidence of HFRS were identified by the space-time analysis. And the most likely cluster was found to exist at Yiyuan County in October to December 2012. The human infections in the fall and winter reflected a seasonal characteristic pattern of Hantaan virus (HTNV) transmission. The secondary cluster was detected at the center of Zibo in May to June 2009, presenting a seasonal characteristic of Seoul virus (SEOV) transmission. Conclusion To control and prevent HFRS in Zibo city, the comprehensive preventive strategy should be implemented in the southern areas of Zibo in autumn and in the northern areas of Zibo in spring. PMID:23840719

  17. Source Evaluation and Trace Metal Contamination in Benthic Sediments from Equatorial Ecosystems Using Multivariate Statistical Techniques

    PubMed Central

    Benson, Nsikak U.; Asuquo, Francis E.; Williams, Akan B.; Essien, Joseph P.; Ekong, Cyril I.; Akpabio, Otobong; Olajire, Abaas A.

    2016-01-01

    Trace metals (Cd, Cr, Cu, Ni and Pb) concentrations in benthic sediments were analyzed through multi-step fractionation scheme to assess the levels and sources of contamination in estuarine, riverine and freshwater ecosystems in Niger Delta (Nigeria). The degree of contamination was assessed using the individual contamination factors (ICF) and global contamination factor (GCF). Multivariate statistical approaches including principal component analysis (PCA), cluster analysis and correlation test were employed to evaluate the interrelationships and associated sources of contamination. The spatial distribution of metal concentrations followed the pattern Pb>Cu>Cr>Cd>Ni. Ecological risk index by ICF showed significant potential mobility and bioavailability for Cu, Cu and Ni. The ICF contamination trend in the benthic sediments at all studied sites was Cu>Cr>Ni>Cd>Pb. The principal component and agglomerative clustering analyses indicate that trace metals contamination in the ecosystems was influenced by multiple pollution sources. PMID:27257934

  18. Low-level processing for real-time image analysis

    NASA Technical Reports Server (NTRS)

    Eskenazi, R.; Wilf, J. M.

    1979-01-01

    A system that detects object outlines in television images in real time is described. A high-speed pipeline processor transforms the raw image into an edge map and a microprocessor, which is integrated into the system, clusters the edges, and represents them as chain codes. Image statistics, useful for higher level tasks such as pattern recognition, are computed by the microprocessor. Peak intensity and peak gradient values are extracted within a programmable window and are used for iris and focus control. The algorithms implemented in hardware and the pipeline processor architecture are described. The strategy for partitioning functions in the pipeline was chosen to make the implementation modular. The microprocessor interface allows flexible and adaptive control of the feature extraction process. The software algorithms for clustering edge segments, creating chain codes, and computing image statistics are also discussed. A strategy for real time image analysis that uses this system is given.

  19. Minimal spanning tree algorithm for γ-ray source detection in sparse photon images: cluster parameters and selection strategies

    DOE PAGES

    Campana, R.; Bernieri, E.; Massaro, E.; ...

    2013-05-22

    We present that the minimal spanning tree (MST) algorithm is a graph-theoretical cluster-finding method. We previously applied it to γ-ray bidimensional images, showing that it is quite sensitive in finding faint sources. Possible sources are associated with the regions where the photon arrival directions clusterize. MST selects clusters starting from a particular “tree” connecting all the point of the image and performing a cut based on the angular distance between photons, with a number of events higher than a given threshold. In this paper, we show how a further filtering, based on some parameters linked to the cluster properties, canmore » be applied to reduce spurious detections. We find that the most efficient parameter for this secondary selection is the magnitudeM of a cluster, defined as the product of its number of events by its clustering degree. We test the sensitivity of the method by means of simulated and real Fermi-Large Area Telescope (LAT) fields. Our results show that √M is strongly correlated with other statistical significance parameters, derived from a wavelet based algorithm and maximum likelihood (ML) analysis, and that it can be used as a good estimator of statistical significance of MST detections. Finally, we apply the method to a 2-year LAT image at energies higher than 3 GeV, and we show the presence of new clusters, likely associated with BL Lac objects.« less

  20. Data-driven inference for the spatial scan statistic.

    PubMed

    Almeida, Alexandre C L; Duarte, Anderson R; Duczmal, Luiz H; Oliveira, Fernando L P; Takahashi, Ricardo H C

    2011-08-02

    Kulldorff's spatial scan statistic for aggregated area maps searches for clusters of cases without specifying their size (number of areas) or geographic location in advance. Their statistical significance is tested while adjusting for the multiple testing inherent in such a procedure. However, as is shown in this work, this adjustment is not done in an even manner for all possible cluster sizes. A modification is proposed to the usual inference test of the spatial scan statistic, incorporating additional information about the size of the most likely cluster found. A new interpretation of the results of the spatial scan statistic is done, posing a modified inference question: what is the probability that the null hypothesis is rejected for the original observed cases map with a most likely cluster of size k, taking into account only those most likely clusters of size k found under null hypothesis for comparison? This question is especially important when the p-value computed by the usual inference process is near the alpha significance level, regarding the correctness of the decision based in this inference. A practical procedure is provided to make more accurate inferences about the most likely cluster found by the spatial scan statistic.

  1. Combined Analyses of Bacterial, Fungal and Nematode Communities in Andosolic Agricultural Soils in Japan

    PubMed Central

    Bao, Zhihua; Ikunaga, Yoko; Matsushita, Yuko; Morimoto, Sho; Takada-Hoshino, Yuko; Okada, Hiroaki; Oba, Hirosuke; Takemoto, Shuhei; Niwa, Shigeru; Ohigashi, Kentaro; Suzuki, Chika; Nagaoka, Kazunari; Takenaka, Makoto; Urashima, Yasufumi; Sekiguchi, Hiroyuki; Kushida, Atsuhiko; Toyota, Koki; Saito, Masanori; Tsushima, Seiya

    2012-01-01

    We simultaneously examined the bacteria, fungi and nematode communities in Andosols from four agro-geographical sites in Japan using polymerase chain reaction-denaturing gradient gel electrophoresis (PCR-DGGE) and statistical analyses to test the effects of environmental factors including soil properties on these communities depending on geographical sites. Statistical analyses such as Principal component analysis (PCA) and Redundancy analysis (RDA) revealed that the compositions of the three soil biota communities were strongly affected by geographical sites, which were in turn strongly associated with soil characteristics such as total C (TC), total N (TN), C/N ratio and annual mean soil temperature (ST). In particular, the TC, TN and C/N ratio had stronger effects on bacterial and fungal communities than on the nematode community. Additionally, two-way cluster analysis using the combined DGGE profile also indicated that all soil samples were classified into four clusters corresponding to the four sites, showing high site specificity of soil samples, and all DNA bands were classified into four clusters, showing the coexistence of specific DGGE bands of bacteria, fungi and nematodes in Andosol fields. The results of this study suggest that geography relative to soil properties has a simultaneous impact on soil microbial and nematode community compositions. This is the first combined profile analysis of bacteria, fungi and nematodes at different sites with agricultural Andosols. PMID:22223474

  2. Combined analyses of bacterial, fungal and nematode communities in andosolic agricultural soils in Japan.

    PubMed

    Bao, Zhihua; Ikunaga, Yoko; Matsushita, Yuko; Morimoto, Sho; Takada-Hoshino, Yuko; Okada, Hiroaki; Oba, Hirosuke; Takemoto, Shuhei; Niwa, Shigeru; Ohigashi, Kentaro; Suzuki, Chika; Nagaoka, Kazunari; Takenaka, Makoto; Urashima, Yasufumi; Sekiguchi, Hiroyuki; Kushida, Atsuhiko; Toyota, Koki; Saito, Masanori; Tsushima, Seiya

    2012-01-01

    We simultaneously examined the bacteria, fungi and nematode communities in Andosols from four agro-geographical sites in Japan using polymerase chain reaction-denaturing gradient gel electrophoresis (PCR-DGGE) and statistical analyses to test the effects of environmental factors including soil properties on these communities depending on geographical sites. Statistical analyses such as Principal component analysis (PCA) and Redundancy analysis (RDA) revealed that the compositions of the three soil biota communities were strongly affected by geographical sites, which were in turn strongly associated with soil characteristics such as total C (TC), total N (TN), C/N ratio and annual mean soil temperature (ST). In particular, the TC, TN and C/N ratio had stronger effects on bacterial and fungal communities than on the nematode community. Additionally, two-way cluster analysis using the combined DGGE profile also indicated that all soil samples were classified into four clusters corresponding to the four sites, showing high site specificity of soil samples, and all DNA bands were classified into four clusters, showing the coexistence of specific DGGE bands of bacteria, fungi and nematodes in Andosol fields. The results of this study suggest that geography relative to soil properties has a simultaneous impact on soil microbial and nematode community compositions. This is the first combined profile analysis of bacteria, fungi and nematodes at different sites with agricultural Andosols.

  3. Spatial scan statistics for detection of multiple clusters with arbitrary shapes.

    PubMed

    Lin, Pei-Sheng; Kung, Yi-Hung; Clayton, Murray

    2016-12-01

    In applying scan statistics for public health research, it would be valuable to develop a detection method for multiple clusters that accommodates spatial correlation and covariate effects in an integrated model. In this article, we connect the concepts of the likelihood ratio (LR) scan statistic and the quasi-likelihood (QL) scan statistic to provide a series of detection procedures sufficiently flexible to apply to clusters of arbitrary shape. First, we use an independent scan model for detection of clusters and then a variogram tool to examine the existence of spatial correlation and regional variation based on residuals of the independent scan model. When the estimate of regional variation is significantly different from zero, a mixed QL estimating equation is developed to estimate coefficients of geographic clusters and covariates. We use the Benjamini-Hochberg procedure (1995) to find a threshold for p-values to address the multiple testing problem. A quasi-deviance criterion is used to regroup the estimated clusters to find geographic clusters with arbitrary shapes. We conduct simulations to compare the performance of the proposed method with other scan statistics. For illustration, the method is applied to enterovirus data from Taiwan. © 2016, The International Biometric Society.

  4. A scan statistic to extract causal gene clusters from case-control genome-wide rare CNV data.

    PubMed

    Nishiyama, Takeshi; Takahashi, Kunihiko; Tango, Toshiro; Pinto, Dalila; Scherer, Stephen W; Takami, Satoshi; Kishino, Hirohisa

    2011-05-26

    Several statistical tests have been developed for analyzing genome-wide association data by incorporating gene pathway information in terms of gene sets. Using these methods, hundreds of gene sets are typically tested, and the tested gene sets often overlap. This overlapping greatly increases the probability of generating false positives, and the results obtained are difficult to interpret, particularly when many gene sets show statistical significance. We propose a flexible statistical framework to circumvent these problems. Inspired by spatial scan statistics for detecting clustering of disease occurrence in the field of epidemiology, we developed a scan statistic to extract disease-associated gene clusters from a whole gene pathway. Extracting one or a few significant gene clusters from a global pathway limits the overall false positive probability, which results in increased statistical power, and facilitates the interpretation of test results. In the present study, we applied our method to genome-wide association data for rare copy-number variations, which have been strongly implicated in common diseases. Application of our method to a simulated dataset demonstrated the high accuracy of this method in detecting disease-associated gene clusters in a whole gene pathway. The scan statistic approach proposed here shows a high level of accuracy in detecting gene clusters in a whole gene pathway. This study has provided a sound statistical framework for analyzing genome-wide rare CNV data by incorporating topological information on the gene pathway.

  5. Descriptive epidemiology of typhoid fever during an epidemic in Harare, Zimbabwe, 2012.

    PubMed

    Polonsky, Jonathan A; Martínez-Pino, Isabel; Nackers, Fabienne; Chonzi, Prosper; Manangazira, Portia; Van Herp, Michel; Maes, Peter; Porten, Klaudia; Luquero, Francisco J

    2014-01-01

    Typhoid fever remains a significant public health problem in developing countries. In October 2011, a typhoid fever epidemic was declared in Harare, Zimbabwe - the fourth enteric infection epidemic since 2008. To orient control activities, we described the epidemiology and spatiotemporal clustering of the epidemic in Dzivaresekwa and Kuwadzana, the two most affected suburbs of Harare. A typhoid fever case-patient register was analysed to describe the epidemic. To explore clustering, we constructed a dataset comprising GPS coordinates of case-patient residences and randomly sampled residential locations (spatial controls). The scale and significance of clustering was explored with Ripley K functions. Cluster locations were determined by a random labelling technique and confirmed using Kulldorff's spatial scan statistic. We analysed data from 2570 confirmed and suspected case-patients, and found significant spatiotemporal clustering of typhoid fever in two non-overlapping areas, which appeared to be linked to environmental sources. Peak relative risk was more than six times greater than in areas lying outside the cluster ranges. Clusters were identified in similar geographical ranges by both random labelling and Kulldorff's spatial scan statistic. The spatial scale at which typhoid fever clustered was highly localised, with significant clustering at distances up to 4.5 km and peak levels at approximately 3.5 km. The epicentre of infection transmission shifted from one cluster to the other during the course of the epidemic. This study demonstrated highly localised clustering of typhoid fever during an epidemic in an urban African setting, and highlights the importance of spatiotemporal analysis for making timely decisions about targetting prevention and control activities and reinforcing treatment during epidemics. This approach should be integrated into existing surveillance systems to facilitate early detection of epidemics and identify their spatial range.

  6. Descriptive Epidemiology of Typhoid Fever during an Epidemic in Harare, Zimbabwe, 2012

    PubMed Central

    Polonsky, Jonathan A.; Martínez-Pino, Isabel; Nackers, Fabienne; Chonzi, Prosper; Manangazira, Portia; Van Herp, Michel; Maes, Peter; Porten, Klaudia; Luquero, Francisco J.

    2014-01-01

    Background Typhoid fever remains a significant public health problem in developing countries. In October 2011, a typhoid fever epidemic was declared in Harare, Zimbabwe - the fourth enteric infection epidemic since 2008. To orient control activities, we described the epidemiology and spatiotemporal clustering of the epidemic in Dzivaresekwa and Kuwadzana, the two most affected suburbs of Harare. Methods A typhoid fever case-patient register was analysed to describe the epidemic. To explore clustering, we constructed a dataset comprising GPS coordinates of case-patient residences and randomly sampled residential locations (spatial controls). The scale and significance of clustering was explored with Ripley K functions. Cluster locations were determined by a random labelling technique and confirmed using Kulldorff's spatial scan statistic. Principal Findings We analysed data from 2570 confirmed and suspected case-patients, and found significant spatiotemporal clustering of typhoid fever in two non-overlapping areas, which appeared to be linked to environmental sources. Peak relative risk was more than six times greater than in areas lying outside the cluster ranges. Clusters were identified in similar geographical ranges by both random labelling and Kulldorff's spatial scan statistic. The spatial scale at which typhoid fever clustered was highly localised, with significant clustering at distances up to 4.5 km and peak levels at approximately 3.5 km. The epicentre of infection transmission shifted from one cluster to the other during the course of the epidemic. Conclusions This study demonstrated highly localised clustering of typhoid fever during an epidemic in an urban African setting, and highlights the importance of spatiotemporal analysis for making timely decisions about targetting prevention and control activities and reinforcing treatment during epidemics. This approach should be integrated into existing surveillance systems to facilitate early detection of epidemics and identify their spatial range. PMID:25486292

  7. Analysis of risk factors for cluster behavior of dental implant failures.

    PubMed

    Chrcanovic, Bruno Ramos; Kisch, Jenö; Albrektsson, Tomas; Wennerberg, Ann

    2017-08-01

    Some studies indicated that implant failures are commonly concentrated in few patients. To identify and analyze cluster behavior of dental implant failures among subjects of a retrospective study. This retrospective study included patients receiving at least three implants only. Patients presenting at least three implant failures were classified as presenting a cluster behavior. Univariate and multivariate logistic regression models and generalized estimating equations analysis evaluated the effect of explanatory variables on the cluster behavior. There were 1406 patients with three or more implants (8337 implants, 592 failures). Sixty-seven (4.77%) patients presented cluster behavior, with 56.8% of all implant failures. The intake of antidepressants and bruxism were identified as potential negative factors exerting a statistically significant influence on a cluster behavior at the patient-level. The negative factors at the implant-level were turned implants, short implants, poor bone quality, age of the patient, the intake of medicaments to reduce the acid gastric production, smoking, and bruxism. A cluster pattern among patients with implant failure is highly probable. Factors of interest as predictors for implant failures could be a number of systemic and local factors, although a direct causal relationship cannot be ascertained. © 2017 Wiley Periodicals, Inc.

  8. Characterization of spatial and temporal variability in hydrochemistry of Johor Straits, Malaysia.

    PubMed

    Abdullah, Pauzi; Abdullah, Sharifah Mastura Syed; Jaafar, Othman; Mahmud, Mastura; Khalik, Wan Mohd Afiq Wan Mohd

    2015-12-15

    Characterization of hydrochemistry changes in Johor Straits within 5 years of monitoring works was successfully carried out. Water quality data sets (27 stations and 19 parameters) collected in this area were interpreted subject to multivariate statistical analysis. Cluster analysis grouped all the stations into four clusters ((Dlink/Dmax) × 100<90) and two clusters ((Dlink/Dmax) × 100<80) for site and period similarities. Principal component analysis rendered six significant components (eigenvalue>1) that explained 82.6% of the total variance of the data set. Classification matrix of discriminant analysis assigned 88.9-92.6% and 83.3-100% correctness in spatial and temporal variability, respectively. Times series analysis then confirmed that only four parameters were not significant over time change. Therefore, it is imperative that the environmental impact of reclamation and dredging works, municipal or industrial discharge, marine aquaculture and shipping activities in this area be effectively controlled and managed. Copyright © 2015 Elsevier Ltd. All rights reserved.

  9. Demographic characterization and spatial cluster analysis of human Salmonella 1,4,[5],12:i:- infections in Portugal: A 10year study.

    PubMed

    Seixas, R; Nunes, T; Machado, J; Tavares, L; Owen, S P; Bernardo, F; Oliveira, M

    Salmonella 1,4,[5],12:i:- is presently considered one of the major serovars responsible for human salmonellosis worldwide. Due to its recent emergence, studies assessing the demographic characterization and spatial epidemiology of salmonellosis 1,4,[5],12:i:- at local- or country-level are lacking. In this study, a analysis was conducted over a 10year period, from 2000 to the first quarter of 2011 at the Portuguese National Laboratory in Portugal mainland, with a total of 215 Salmonella 1,4,[5],12:i:- serotyped isolates obtained from human infections by a passive surveillance system. Data regarding source, year and month of sampling, gender, age, district and municipality of the patients were registered. Descriptive statistical analysis and a spatial scan statistic combined with a geographic information system were employed to characterize the epidemiology and identify spatial clusters. Results showed that most districts have reports of Salmonella 1,4,[5],12:i:-, with a higher number of cases at the Portuguese coastland, including districts like Porto (n=60, 27.9%), Lisboa (n=29, 13.5%) and Aveiro (n=28, 13.0%). An increased incidence was observed in the period from 2004 to 2011 and most infections occurred during May and October. Spatial analysis revealed 4 clusters of higher than expected infection rates. Three were located in the north of Portugal, including two at the coastland (Cluster 1 [RR=3.58, p≤0.001] and 4 [RR=10.42 p≤0.230]), and one at the countryside (Cluster 3 [RR=17.76, p≤0.001]). A larger cluster was detected involving the center and south of Portugal (Cluster 2 [RR=4.85, p≤0.001]). The present study was elaborated with data provided by a passive surveillance system, which may originate an underestimation of disease burden. However, this is the first report describing the incidence and the distribution of areas with higher risk of infection in Portugal, revealing that Salmonella 1,4,[5],12:i:- displayed a significant geographic clustering and these areas should be further evaluated to identify risk factors in order to establish prevention programs. Copyright © 2017 The Authors. Published by Elsevier Ltd.. All rights reserved.

  10. Effect of spatial smoothing on t-maps: arguments for going back from t-maps to masked contrast images.

    PubMed

    Reimold, Matthias; Slifstein, Mark; Heinz, Andreas; Mueller-Schauenburg, Wolfgang; Bares, Roland

    2006-06-01

    Voxelwise statistical analysis has become popular in explorative functional brain mapping with fMRI or PET. Usually, results are presented as voxelwise levels of significance (t-maps), and for clusters that survive correction for multiple testing the coordinates of the maximum t-value are reported. Before calculating a voxelwise statistical test, spatial smoothing is required to achieve a reasonable statistical power. Little attention is being given to the fact that smoothing has a nonlinear effect on the voxel variances and thus the local characteristics of a t-map, which becomes most evident after smoothing over different types of tissue. We investigated the related artifacts, for example, white matter peaks whose position depend on the relative variance (variance over contrast) of the surrounding regions, and suggest improving spatial precision with 'masked contrast images': color-codes are attributed to the voxelwise contrast, and significant clusters (e.g., detected with statistical parametric mapping, SPM) are enlarged by including contiguous pixels with a contrast above the mean contrast in the original cluster, provided they satisfy P < 0.05. The potential benefit is demonstrated with simulations and data from a [11C]Carfentanil PET study. We conclude that spatial smoothing may lead to critical, sometimes-counterintuitive artifacts in t-maps, especially in subcortical brain regions. If significant clusters are detected, for example, with SPM, the suggested method is one way to improve spatial precision and may give the investigator a more direct sense of the underlying data. Its simplicity and the fact that no further assumptions are needed make it a useful complement for standard methods of statistical mapping.

  11. Description and typology of intensive Chios dairy sheep farms in Greece.

    PubMed

    Gelasakis, A I; Valergakis, G E; Arsenos, G; Banos, G

    2012-06-01

    The aim was to assess the intensified dairy sheep farming systems of the Chios breed in Greece, establishing a typology that may properly describe and characterize them. The study included the total of the 66 farms of the Chios sheep breeders' cooperative Macedonia. Data were collected using a structured direct questionnaire for in-depth interviews, including questions properly selected to obtain a general description of farm characteristics and overall management practices. A multivariate statistical analysis was used on the data to obtain the most appropriate typology. Initially, principal component analysis was used to produce uncorrelated variables (principal components), which would be used for the consecutive cluster analysis. The number of clusters was decided using hierarchical cluster analysis, whereas, the farms were allocated in 4 clusters using k-means cluster analysis. The identified clusters were described and afterward compared using one-way ANOVA or a chi-squared test. The main differences were evident on land availability and use, facility and equipment availability and type, expansion rates, and application of preventive flock health programs. In general, cluster 1 included newly established, intensive, well-equipped, specialized farms and cluster 2 included well-established farms with balanced sheep and feed/crop production. In cluster 3 were assigned small flock farms focusing more on arable crops than on sheep farming with a tendency to evolve toward cluster 2, whereas cluster 4 included farms representing a rather conservative form of Chios sheep breeding with low/intermediate inputs and choosing not to focus on feed/crop production. In the studied set of farms, 4 different farmer attitudes were evident: 1) farming disrupts sheep breeding; feed should be purchased and economies of scale will decrease costs (mainly cluster 1), 2) only exercise/pasture land is necessary; at least part of the feed (pasture) must be home-grown to decrease costs (clusters 1 and 4), 3) providing pasture to sheep is essential; on-farm feed production decreases costs (mainly cluster 3), and 4) large-scale farming (feed production and cash crops) does not disrupt sheep breeding; all feed must be produced on-farm to decrease costs (mainly cluster 3). Conducting a profitability analysis among different clusters, exploring and discovering the most beneficial levels of intensified management and capital investment should now be considered. Copyright © 2012 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  12. The remarkable geographical pattern of gastric cancer mortality in Ecuador.

    PubMed

    Montero-Oleas, Nadia; Núñez-González, Solange; Simancas-Racines, Daniel

    2017-12-01

    This study was aimed to describe the gastric cancer mortality trend, and to analyze the spatial distribution of gastric cancer mortality in Ecuador, between 2004 and 2015. Data were collected from the National Institute of Statistics and Census (INEC) database. Crude gastric cancer mortality rates, standardized mortality ratios (SMRs) and indirect standardized mortality rates (ISMRs) were calculated per 100,000 persons. For time trend analysis, joinpoint regression was used. The annual percentage rate change (APC) and the average annual percent change (AAPC) was computed for each province. Spatial age-adjusted analysis was used to detect high risk clusters of gastric cancer mortality, from 2010 to 2015, using Kulldorff spatial scan statistics. In Ecuador, between 2004 and 2015, gastric cancer caused a total of 19,115 deaths: 10,679 in men and 8436 in women. When crude rates were analyzed, a significant decline was detected (AAPC: -1.8%; p<0.001). ISMR also decreased, but this change was not statistically significant (APC: -0.53%; p=0.36). From 2004 to 2007 and from 2008 to 2011 the province with the highest ISMR was Carchi; and, from 2012 to 2015, was Cotopaxi. The most likely high occurrence cluster included Bolívar, Los Ríos, Chimborazo, Tungurahua, and Cotopaxi provinces, with a relative risk of 1.34 (p<0.001). There is a substantial geographic variation in gastric cancer mortality rates among Ecuadorian provinces. The spatial analysis indicates the presence of high occurrence clusters throughout the Andes Mountains. Copyright © 2017 The Authors. Published by Elsevier Ltd.. All rights reserved.

  13. A new scoring system in Cystic Fibrosis: statistical tools for database analysis - a preliminary report.

    PubMed

    Hafen, G M; Hurst, C; Yearwood, J; Smith, J; Dzalilov, Z; Robinson, P J

    2008-10-05

    Cystic fibrosis is the most common fatal genetic disorder in the Caucasian population. Scoring systems for assessment of Cystic fibrosis disease severity have been used for almost 50 years, without being adapted to the milder phenotype of the disease in the 21st century. The aim of this current project is to develop a new scoring system using a database and employing various statistical tools. This study protocol reports the development of the statistical tools in order to create such a scoring system. The evaluation is based on the Cystic Fibrosis database from the cohort at the Royal Children's Hospital in Melbourne. Initially, unsupervised clustering of the all data records was performed using a range of clustering algorithms. In particular incremental clustering algorithms were used. The clusters obtained were characterised using rules from decision trees and the results examined by clinicians. In order to obtain a clearer definition of classes expert opinion of each individual's clinical severity was sought. After data preparation including expert-opinion of an individual's clinical severity on a 3 point-scale (mild, moderate and severe disease), two multivariate techniques were used throughout the analysis to establish a method that would have a better success in feature selection and model derivation: 'Canonical Analysis of Principal Coordinates' and 'Linear Discriminant Analysis'. A 3-step procedure was performed with (1) selection of features, (2) extracting 5 severity classes out of a 3 severity class as defined per expert-opinion and (3) establishment of calibration datasets. (1) Feature selection: CAP has a more effective "modelling" focus than DA.(2) Extraction of 5 severity classes: after variables were identified as important in discriminating contiguous CF severity groups on the 3-point scale as mild/moderate and moderate/severe, Discriminant Function (DF) was used to determine the new groups mild, intermediate moderate, moderate, intermediate severe and severe disease. (3) Generated confusion tables showed a misclassification rate of 19.1% for males and 16.5% for females, with a majority of misallocations into adjacent severity classes particularly for males. Our preliminary data show that using CAP for detection of selection features and Linear DA to derive the actual model in a CF database might be helpful in developing a scoring system. However, there are several limitations, particularly more data entry points are needed to finalize a score and the statistical tools have further to be refined and validated, with re-running the statistical methods in the larger dataset.

  14. Propensity score to detect baseline imbalance in cluster randomized trials: the role of the c-statistic.

    PubMed

    Leyrat, Clémence; Caille, Agnès; Foucher, Yohann; Giraudeau, Bruno

    2016-01-22

    Despite randomization, baseline imbalance and confounding bias may occur in cluster randomized trials (CRTs). Covariate imbalance may jeopardize the validity of statistical inferences if they occur on prognostic factors. Thus, the diagnosis of a such imbalance is essential to adjust statistical analysis if required. We developed a tool based on the c-statistic of the propensity score (PS) model to detect global baseline covariate imbalance in CRTs and assess the risk of confounding bias. We performed a simulation study to assess the performance of the proposed tool and applied this method to analyze the data from 2 published CRTs. The proposed method had good performance for large sample sizes (n =500 per arm) and when the number of unbalanced covariates was not too small as compared with the total number of baseline covariates (≥40% of unbalanced covariates). We also provide a strategy for pre selection of the covariates needed to be included in the PS model to enhance imbalance detection. The proposed tool could be useful in deciding whether covariate adjustment is required before performing statistical analyses of CRTs.

  15. Statistical interpretation of chromatic indicators in correlation to phytochemical profile of a sulfur dioxide-free mulberry (Morus nigra) wine submitted to non-thermal maturation processes.

    PubMed

    Tchabo, William; Ma, Yongkun; Kwaw, Emmanuel; Zhang, Haining; Xiao, Lulu; Apaliya, Maurice T

    2018-01-15

    The four different methods of color measurement of wine proposed by Boulton, Giusti, Glories and Commission International de l'Eclairage (CIE) were applied to assess the statistical relationship between the phytochemical profile and chromatic characteristics of sulfur dioxide-free mulberry (Morus nigra) wine submitted to non-thermal maturation processes. The alteration in chromatic properties and phenolic composition of non-thermal aged mulberry wine were examined, aided by the used of Pearson correlation, cluster and principal component analysis. The results revealed a positive effect of non-thermal processes on phytochemical families of wines. From Pearson correlation analysis relationships between chromatic indexes and flavonols as well as anthocyanins were established. Cluster analysis highlighted similarities between Boulton and Giusti parameters, as well as Glories and CIE parameters in the assessment of chromatic properties of wines. Finally, principal component analysis was able to discriminate wines subjected to different maturation techniques on the basis of their chromatic and phenolics characteristics. Copyright © 2017. Published by Elsevier Ltd.

  16. Space-Time Analysis of Testicular Cancer Clusters Using Residential Histories: A Case-Control Study in Denmark

    PubMed Central

    Sloan, Chantel D.; Nordsborg, Rikke B.; Jacquez, Geoffrey M.; Raaschou-Nielsen, Ole; Meliker, Jaymie R.

    2015-01-01

    Though the etiology is largely unknown, testicular cancer incidence has seen recent significant increases in northern Europe and throughout many Western regions. The most common cancer in males under age 40, age period cohort models have posited exposures in the in utero environment or in early childhood as possible causes of increased risk of testicular cancer. Some of these factors may be tied to geography through being associated with behavioral, cultural, sociodemographic or built environment characteristics. If so, this could result in detectable geographic clusters of cases that could lead to hypotheses regarding environmental targets for intervention. Given a latency period between exposure to an environmental carcinogen and testicular cancer diagnosis, mobility histories are beneficial for spatial cluster analyses. Nearest-neighbor based Q-statistics allow for the incorporation of changes in residency in spatial disease cluster detection. Using these methods, a space-time cluster analysis was conducted on a population-wide case-control population selected from the Danish Cancer Registry with mobility histories since 1971 extracted from the Danish Civil Registration System. Cases (N=3297) were diagnosed between 1991 and 2003, and two sets of controls (N=3297 for each set) matched on sex and date of birth were included in the study. We also examined spatial patterns in maternal residential history for those cases and controls born in 1971 or later (N= 589 case-control pairs). Several small clusters were detected when aligning individuals by year prior to diagnosis, age at diagnosis and calendar year of diagnosis. However, the largest of these clusters contained only 2 statistically significant individuals at their center, and were not replicated in SaTScan spatial-only analyses which are less susceptible to multiple testing bias. We found little evidence of local clusters in residential histories of testicular cancer cases in this Danish population. PMID:25756204

  17. Space-time analysis of testicular cancer clusters using residential histories: a case-control study in Denmark.

    PubMed

    Sloan, Chantel D; Nordsborg, Rikke B; Jacquez, Geoffrey M; Raaschou-Nielsen, Ole; Meliker, Jaymie R

    2015-01-01

    Though the etiology is largely unknown, testicular cancer incidence has seen recent significant increases in northern Europe and throughout many Western regions. The most common cancer in males under age 40, age period cohort models have posited exposures in the in utero environment or in early childhood as possible causes of increased risk of testicular cancer. Some of these factors may be tied to geography through being associated with behavioral, cultural, sociodemographic or built environment characteristics. If so, this could result in detectable geographic clusters of cases that could lead to hypotheses regarding environmental targets for intervention. Given a latency period between exposure to an environmental carcinogen and testicular cancer diagnosis, mobility histories are beneficial for spatial cluster analyses. Nearest-neighbor based Q-statistics allow for the incorporation of changes in residency in spatial disease cluster detection. Using these methods, a space-time cluster analysis was conducted on a population-wide case-control population selected from the Danish Cancer Registry with mobility histories since 1971 extracted from the Danish Civil Registration System. Cases (N=3297) were diagnosed between 1991 and 2003, and two sets of controls (N=3297 for each set) matched on sex and date of birth were included in the study. We also examined spatial patterns in maternal residential history for those cases and controls born in 1971 or later (N= 589 case-control pairs). Several small clusters were detected when aligning individuals by year prior to diagnosis, age at diagnosis and calendar year of diagnosis. However, the largest of these clusters contained only 2 statistically significant individuals at their center, and were not replicated in SaTScan spatial-only analyses which are less susceptible to multiple testing bias. We found little evidence of local clusters in residential histories of testicular cancer cases in this Danish population.

  18. Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials

    PubMed Central

    Andridge, Rebecca. R.

    2011-01-01

    In cluster randomized trials (CRTs), identifiable clusters rather than individuals are randomized to study groups. Resulting data often consist of a small number of clusters with correlated observations within a treatment group. Missing data often present a problem in the analysis of such trials, and multiple imputation (MI) has been used to create complete data sets, enabling subsequent analysis with well-established analysis methods for CRTs. We discuss strategies for accounting for clustering when multiply imputing a missing continuous outcome, focusing on estimation of the variance of group means as used in an adjusted t-test or ANOVA. These analysis procedures are congenial to (can be derived from) a mixed effects imputation model; however, this imputation procedure is not yet available in commercial statistical software. An alternative approach that is readily available and has been used in recent studies is to include fixed effects for cluster, but the impact of using this convenient method has not been studied. We show that under this imputation model the MI variance estimator is positively biased and that smaller ICCs lead to larger overestimation of the MI variance. Analytical expressions for the bias of the variance estimator are derived in the case of data missing completely at random (MCAR), and cases in which data are missing at random (MAR) are illustrated through simulation. Finally, various imputation methods are applied to data from the Detroit Middle School Asthma Project, a recent school-based CRT, and differences in inference are compared. PMID:21259309

  19. Novel approach to characterising individuals with low back-related leg pain: cluster identification with latent class analysis and 12-month follow-up.

    PubMed

    Stynes, Siobhán; Konstantinou, Kika; Ogollah, Reuben; Hay, Elaine M; Dunn, Kate M

    2018-04-01

    Traditionally, low back-related leg pain (LBLP) is diagnosed clinically as referred leg pain or sciatica (nerve root involvement). However, within the spectrum of LBLP, we hypothesised that there may be other unrecognised patient subgroups. This study aimed to identify clusters of patients with LBLP using latent class analysis and describe their clinical course. The study population was 609 LBLP primary care consulters. Variables from clinical assessment were included in the latent class analysis. Characteristics of the statistically identified clusters were compared, and their clinical course over 1 year was described. A 5 cluster solution was optimal. Cluster 1 (n = 104) had mild leg pain severity and was considered to represent a referred leg pain group with no clinical signs, suggesting nerve root involvement (sciatica). Cluster 2 (n = 122), cluster 3 (n = 188), and cluster 4 (n = 69) had mild, moderate, and severe pain and disability, respectively, and response to clinical assessment items suggested categories of mild, moderate, and severe sciatica. Cluster 5 (n = 126) had high pain and disability, longer pain duration, and more comorbidities and was difficult to map to a clinical diagnosis. Most improvement for pain and disability was seen in the first 4 months for all clusters. At 12 months, the proportion of patients reporting recovery ranged from 27% for cluster 5 to 45% for cluster 2 (mild sciatica). This is the first study that empirically shows the variability in profile and clinical course of patients with LBLP including sciatica. More homogenous groups were identified, which could be considered in future clinical and research settings.

  20. Advances in Statistical Methods for Substance Abuse Prevention Research

    PubMed Central

    MacKinnon, David P.; Lockwood, Chondra M.

    2010-01-01

    The paper describes advances in statistical methods for prevention research with a particular focus on substance abuse prevention. Standard analysis methods are extended to the typical research designs and characteristics of the data collected in prevention research. Prevention research often includes longitudinal measurement, clustering of data in units such as schools or clinics, missing data, and categorical as well as continuous outcome variables. Statistical methods to handle these features of prevention data are outlined. Developments in mediation, moderation, and implementation analysis allow for the extraction of more detailed information from a prevention study. Advancements in the interpretation of prevention research results include more widespread calculation of effect size and statistical power, the use of confidence intervals as well as hypothesis testing, detailed causal analysis of research findings, and meta-analysis. The increased availability of statistical software has contributed greatly to the use of new methods in prevention research. It is likely that the Internet will continue to stimulate the development and application of new methods. PMID:12940467

  1. Statistical analysis of 4 types of neck whiplash injuries based on classical meridian theory.

    PubMed

    Chen, Yemeng; Zhao, Yan; Xue, Xiaolin; Li, Hui; Wu, Xiuyan; Zhang, Qunce; Zheng, Xin; Wang, Tianfang

    2015-01-01

    As one component of the Chinese medicine meridian system, the meridian sinew (Jingjin, (see text), tendino-musculo) is specially described as being for acupuncture treatment of the musculoskeletal system because of its dynamic attributes and tender point correlations. In recent decades, the therapeutic importance of the sinew meridian has become revalued in clinical application. Based on this theory, the authors have established therapeutic strategies of acupuncture treatment in Whiplash-Associated Disorders (WAD) by categorizing four types of neck symptom presentations. The advantage of this new system is to make it much easier for the clinician to find effective acupuncture points. This study attempts to prove the significance of the proposed therapeutic strategies by analyzing data collected from a clinical survey of various WAD using non-supervised statistical methods, such as correlation analysis, factor analysis, and cluster analysis. The clinical survey data have successfully verified discrete characteristics of four neck syndromes, based upon the range of motion (ROM) and tender point location findings. A summary of the relationships among the symptoms of the four neck syndromes has shown the correlation coefficient as having a statistical significance (P < 0.01 or P < 0.05), especially with regard to ROM. Furthermore, factor and cluster analyses resulted in a total of 11 categories of general symptoms, which implies syndrome factors are more related to the Liver, as originally described in classical theory. The hypothesis of meridian sinew syndromes in WAD is clearly supported by the statistical analysis of the clinical trials. This new discovery should be beneficial in improving therapeutic outcomes.

  2. Classification and statistical analysis of mine spoils chemical composition from Oliete basin (Teruel, NE Spain)

    NASA Astrophysics Data System (ADS)

    Meseguer, S.; Sanfeliu, T.; Jordán, M. M.

    2009-02-01

    The Oliete basin (Early Cretaceous, NE Teruel, Spain) is one of the most important areas for the supply of mine spoils used as ball clays for the production of white and red stoneware in the Spanish ceramic industry of wall and floor tiles. This study corresponds to the second part of the paper published recently by Meseguer et al. (Environ Geol 2008) about the use of mine spoils from Teruel coal mining district. The present study shows a statistical data analysis from chemical data (major, minor and trace elements). The performed statistical analysis of chemical data included descriptive statistics and cluster analysis (with ANOVA and Scheffé methods). The cluster analysis of chemical data provided three main groups: C3 with the highest mean SiO2 content (66%) and lowest mean Al2O3 content (20%); C2 with lower SiO2 content (48%) and higher mean Al2O3 content (28%); and C1 with medium values for the SiO2 and Al2O3 mean content. The main applications of these materials are refractory, white and red ceramics, stoneware, heavy ceramics (including red earthenware, bricks and roof tiles), and components of white Portland cement and aluminous cement. Clays from group 2 are used in refractories (with higher kaolinite content, and constrictions to CaO + MgO and K2O + Na2O contents). All materials can be used in fine ceramics (white or red, according to the Fe2O3 + TiO2 content).

  3. Person mobility in the design and analysis of cluster-randomized cohort prevention trials.

    PubMed

    Vuchinich, Sam; Flay, Brian R; Aber, Lawrence; Bickman, Leonard

    2012-06-01

    Person mobility is an inescapable fact of life for most cluster-randomized (e.g., schools, hospitals, clinic, cities, state) cohort prevention trials. Mobility rates are an important substantive consideration in estimating the effects of an intervention. In cluster-randomized trials, mobility rates are often correlated with ethnicity, poverty and other variables associated with disparity. This raises the possibility that estimated intervention effects may generalize to only the least mobile segments of a population and, thus, create a threat to external validity. Such mobility can also create threats to the internal validity of conclusions from randomized trials. Researchers must decide how to deal with persons who leave study clusters during a trial (dropouts), persons and clusters that do not comply with an assigned intervention, and persons who enter clusters during a trial (late entrants), in addition to the persons who remain for the duration of a trial (stayers). Statistical techniques alone cannot solve the key issues of internal and external validity raised by the phenomenon of person mobility. This commentary presents a systematic, Campbellian-type analysis of person mobility in cluster-randomized cohort prevention trials. It describes four approaches for dealing with dropouts, late entrants and stayers with respect to data collection, analysis and generalizability. The questions at issue are: 1) From whom should data be collected at each wave of data collection? 2) Which cases should be included in the analyses of an intervention effect? and 3) To what populations can trial results be generalized? The conclusions lead to recommendations for the design and analysis of future cluster-randomized cohort prevention trials.

  4. The Technical and Biological Reproducibility of Matrix-Assisted Laser Desorption Ionization-Time of Flight Mass Spectrometry (MALDI-TOF MS) Based Typing: Employment of Bioinformatics in a Multicenter Study.

    PubMed

    Oberle, Michael; Wohlwend, Nadia; Jonas, Daniel; Maurer, Florian P; Jost, Geraldine; Tschudin-Sutter, Sarah; Vranckx, Katleen; Egli, Adrian

    2016-01-01

    The technical, biological, and inter-center reproducibility of matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI TOF MS) typing data has not yet been explored. The aim of this study is to compare typing data from multiple centers employing bioinformatics using bacterial strains from two past outbreaks and non-related strains. Participants received twelve extended spectrum betalactamase-producing E. coli isolates and followed the same standard operating procedure (SOP) including a full-protein extraction protocol. All laboratories provided visually read spectra via flexAnalysis (Bruker, Germany). Raw data from each laboratory allowed calculating the technical and biological reproducibility between centers using BioNumerics (Applied Maths NV, Belgium). Technical and biological reproducibility ranged between 96.8-99.4% and 47.6-94.4%, respectively. The inter-center reproducibility showed a comparable clustering among identical isolates. Principal component analysis indicated a higher tendency to cluster within the same center. Therefore, we used a discriminant analysis, which completely separated the clusters. Next, we defined a reference center and performed a statistical analysis to identify specific peaks to identify the outbreak clusters. Finally, we used a classifier algorithm and a linear support vector machine on the determined peaks as classifier. A validation showed that within the set of the reference center, the identification of the cluster was 100% correct with a large contrast between the score with the correct cluster and the next best scoring cluster. Based on the sufficient technical and biological reproducibility of MALDI-TOF MS based spectra, detection of specific clusters is possible from spectra obtained from different centers. However, we believe that a shared SOP and a bioinformatics approach are required to make the analysis robust and reliable.

  5. The association between content of the elements S, Cl, K, Fe, Cu, Zn and Br in normal and cirrhotic liver tissue from Danes and Greenlandic Inuit examined by dual hierarchical clustering analysis.

    PubMed

    Laursen, Jens; Milman, Nils; Pind, Niels; Pedersen, Henrik; Mulvad, Gert

    2014-01-01

    Meta-analysis of previous studies evaluating associations between content of elements sulphur (S), chlorine (Cl), potassium (K), iron (Fe), copper (Cu), zinc (Zn) and bromine (Br) in normal and cirrhotic autopsy liver tissue samples. Normal liver samples from 45 Greenlandic Inuit, median age 60 years and from 71 Danes, median age 61 years. Cirrhotic liver samples from 27 Danes, median age 71 years. Element content was measured using X-ray fluorescence spectrometry. Dual hierarchical clustering analysis, creating a dual dendrogram, one clustering element contents according to calculated similarities, one clustering elements according to correlation coefficients between the element contents, both using Euclidian distance and Ward Procedure. One dendrogram separated subjects in 7 clusters showing no differences in ethnicity, gender or age. The analysis discriminated between elements in normal and cirrhotic livers. The other dendrogram clustered elements in four clusters: sulphur and chlorine; copper and bromine; potassium and zinc; iron. There were significant correlations between the elements in normal liver samples: S was associated with Cl, K, Br and Zn; Cl with S and Br; K with S, Br and Zn; Cu with Br. Zn with S and K. Br with S, Cl, K and Cu. Fe did not show significant associations with any other element. In contrast to simple statistical methods, which analyses content of elements separately one by one, dual hierarchical clustering analysis incorporates all elements at the same time and can be used to examine the linkage and interplay between multiple elements in tissue samples. Copyright © 2013 Elsevier GmbH. All rights reserved.

  6. A measurement of CMB cluster lensing with SPT and DES year 1 data

    NASA Astrophysics Data System (ADS)

    Baxter, E. J.; Raghunathan, S.; Crawford, T. M.; Fosalba, P.; Hou, Z.; Holder, G. P.; Omori, Y.; Patil, S.; Rozo, E.; Abbott, T. M. C.; Annis, J.; Aylor, K.; Benoit-Lévy, A.; Benson, B. A.; Bertin, E.; Bleem, L.; Buckley-Geer, E.; Burke, D. L.; Carlstrom, J.; Carnero Rosell, A.; Carrasco Kind, M.; Carretero, J.; Chang, C. L.; Cho, H.-M.; Crites, A. T.; Crocce, M.; Cunha, C. E.; da Costa, L. N.; D'Andrea, C. B.; Davis, C.; de Haan, T.; Desai, S.; Dietrich, J. P.; Dobbs, M. A.; Dodelson, S.; Doel, P.; Drlica-Wagner, A.; Estrada, J.; Everett, W. B.; Fausti Neto, A.; Flaugher, B.; Frieman, J.; García-Bellido, J.; George, E. M.; Gaztanaga, E.; Giannantonio, T.; Gruen, D.; Gruendl, R. A.; Gschwend, J.; Gutierrez, G.; Halverson, N. W.; Harrington, N. L.; Hartley, W. G.; Holzapfel, W. L.; Honscheid, K.; Hrubes, J. D.; Jain, B.; James, D. J.; Jarvis, M.; Jeltema, T.; Knox, L.; Krause, E.; Kuehn, K.; Kuhlmann, S.; Kuropatkin, N.; Lahav, O.; Lee, A. T.; Leitch, E. M.; Li, T. S.; Lima, M.; Luong-Van, D.; Manzotti, A.; March, M.; Marrone, D. P.; Marshall, J. L.; Martini, P.; McMahon, J. J.; Melchior, P.; Menanteau, F.; Meyer, S. S.; Miller, C. J.; Miquel, R.; Mocanu, L. M.; Mohr, J. J.; Natoli, T.; Nord, B.; Ogando, R. L. C.; Padin, S.; Plazas, A. A.; Pryke, C.; Rapetti, D.; Reichardt, C. L.; Romer, A. K.; Roodman, A.; Ruhl, J. E.; Rykoff, E.; Sako, M.; Sanchez, E.; Sayre, J. T.; Scarpine, V.; Schaffer, K. K.; Schindler, R.; Schubnell, M.; Sevilla-Noarbe, I.; Shirokoff, E.; Smith, M.; Smith, R. C.; Soares-Santos, M.; Sobreira, F.; Staniszewski, Z.; Stark, A.; Story, K.; Suchyta, E.; Tarle, G.; Thomas, D.; Troxel, M. A.; Vanderlinde, K.; Vieira, J. D.; Walker, A. R.; Williamson, R.; Zhang, Y.; Zuntz, J.

    2018-05-01

    Clusters of galaxies gravitationally lens the cosmic microwave background (CMB) radiation, resulting in a distinct imprint in the CMB on arcminute scales. Measurement of this effect offers a promising way to constrain the masses of galaxy clusters, particularly those at high redshift. We use CMB maps from the South Pole Telescope Sunyaev-Zel'dovich (SZ) survey to measure the CMB lensing signal around galaxy clusters identified in optical imaging from first year observations of the Dark Energy Survey. The cluster catalogue used in this analysis contains 3697 members with mean redshift of \\bar{z} = 0.45. We detect lensing of the CMB by the galaxy clusters at 8.1σ significance. Using the measured lensing signal, we constrain the amplitude of the relation between cluster mass and optical richness to roughly 17 {per cent} precision, finding good agreement with recent constraints obtained with galaxy lensing. The error budget is dominated by statistical noise but includes significant contributions from systematic biases due to the thermal SZ effect and cluster miscentring.

  7. Health-risk behaviour in Croatia.

    PubMed

    Bécue-Bertaut, Mónica; Kern, Josipa; Hernández-Maldonado, Maria-Luisa; Juresa, Vesna; Vuletic, Silvije

    2008-02-01

    To identify the health-risk behaviour of various homogeneous clusters of individuals. The study was conducted in 13 of the 20 Croatian counties and in Zagreb, the Croatian capital. In the first stage, general practices were selected in each county. The second-stage sample was created by drawing a random subsample of 10% of the patients registered at each selected general practice. The sample was divided into seven homogenous clusters using statistical methodology, combining multiple factor analysis with a hybrid clustering method. Seven homogeneous clusters were identified, three composed of males and four composed of females, based on statistically significant differences between selected characteristics (P<0.001). Although, in general, self-assessed health declined with age, significant variations were observed within specific age intervals. Higher levels of self-assessed health were associated with higher levels of education and/or socio-economic status. Many individuals, especially females, who self-reported poor health were heavy consumers of sleeping pills. Males and females reported different health-risk behaviours related to lifestyle, diet and use of the healthcare system. Heavy alcohol and tobacco use, unhealthy diet, risky physical activity and non-use of the healthcare system influenced self-assessed health in males. Females were slightly less satisfied with their health than males of the same age and educational level. Even highly educated females who took preventive healthcare tests and ate a healthy diet reported a less satisfactory self-assessed level of health than expected. Sociodemographic characteristics, life style, self-assessed health and use of the healthcare system were used in the identification of seven homogeneous population clusters. A comprehensive analysis of these clusters suggests health-related prevention and intervention efforts geared towards specific populations.

  8. Targeting regional pediatric congenital hearing loss using a spatial scan statistic.

    PubMed

    Bush, Matthew L; Christian, Warren Jay; Bianchi, Kristin; Lester, Cathy; Schoenberg, Nancy

    2015-01-01

    Congenital hearing loss is a common problem, and timely identification and intervention are paramount for language development. Patients from rural regions may have many barriers to timely diagnosis and intervention. The purpose of this study was to examine the spatial and hospital-based distribution of failed infant hearing screening testing and pediatric congenital hearing loss throughout Kentucky. Data on live births and audiological reporting of infant hearing loss results in Kentucky from 2009 to 2011 were analyzed. The authors used spatial scan statistics to identify high-rate clusters of failed newborn screening tests and permanent congenital hearing loss (PCHL), based on the total number of live births per county. The authors conducted further analyses on PCHL and failed newborn hearing screening tests, based on birth hospital data and method of screening. The authors observed four statistically significant (p < 0.05) high-rate clusters with failed newborn hearing screenings in Kentucky, including two in the Appalachian region. Hospitals using two-stage otoacoustic emission testing demonstrated higher rates of failed screening (p = 0.009) than those using two-stage automated auditory brainstem response testing. A significant cluster of high rate of PCHL was observed in Western Kentucky. Five of the 54 birthing hospitals were found to have higher relative risk of PCHL, and two of those hospitals are located in a very rural region of Western Kentucky within the cluster. This spatial analysis in children in Kentucky has identified specific regions throughout the state with high rates of congenital hearing loss and failed newborn hearing screening tests. Further investigation regarding causative factors is warranted. This method of analysis can be useful in the setting of hearing health disparities to focus efforts on regions facing high incidence of congenital hearing loss.

  9. Cluster-guided imaging-based CFD analysis of airflow and particle deposition in asthmatic human lungs

    NASA Astrophysics Data System (ADS)

    Choi, Jiwoong; Leblanc, Lawrence; Choi, Sanghun; Haghighi, Babak; Hoffman, Eric; Lin, Ching-Long

    2017-11-01

    The goal of this study is to assess inter-subject variability in delivery of orally inhaled drug products to small airways in asthmatic lungs. A recent multiscale imaging-based cluster analysis (MICA) of computed tomography (CT) lung images in an asthmatic cohort identified four clusters with statistically distinct structural and functional phenotypes associating with unique clinical biomarkers. Thus, we aimed to address inter-subject variability via inter-cluster variability. We selected a representative subject from each of the 4 asthma clusters as well as 1 male and 1 female healthy controls, and performed computational fluid and particle simulations on CT-based airway models of these subjects. The results from one severe and one non-severe asthmatic cluster subjects characterized by segmental airway constriction had increased particle deposition efficiency, as compared with the other two cluster subjects (one non-severe and one severe asthmatics) without airway constriction. Constriction-induced jets impinging on distal bifurcations led to excessive particle deposition. The results emphasize the impact of airway constriction on regional particle deposition rather than disease severity, demonstrating the potential of using cluster membership to tailor drug delivery. NIH Grants U01HL114494 and S10-RR022421, and FDA Grant U01FD005837. XSEDE.

  10. HICOSMO - X-ray analysis of a complete sample of galaxy clusters

    NASA Astrophysics Data System (ADS)

    Schellenberger, G.; Reiprich, T.

    2017-10-01

    Galaxy clusters are known to be the largest virialized objects in the Universe. Based on the theory of structure formation one can use them as cosmological probes, since they originate from collapsed overdensities in the early Universe and witness its history. The X-ray regime provides the unique possibility to measure in detail the most massive visible component, the intra cluster medium. Using Chandra observations of a local sample of 64 bright clusters (HIFLUGCS) we provide total (hydrostatic) and gas mass estimates of each cluster individually. Making use of the completeness of the sample we quantify two interesting cosmological parameters by a Bayesian cosmological likelihood analysis. We find Ω_{M}=0.3±0.01 and σ_{8}=0.79±0.03 (statistical uncertainties) using our default analysis strategy combining both, a mass function analysis and the gas mass fraction results. The main sources of biases that we discuss and correct here are (1) the influence of galaxy groups (higher incompleteness in parent samples and a differing behavior of the L_{x} - M relation), (2) the hydrostatic mass bias (as determined by recent hydrodynamical simulations), (3) the extrapolation of the total mass (comparing various methods), (4) the theoretical halo mass function and (5) other cosmological (non-negligible neutrino mass), and instrumental (calibration) effects.

  11. Alteration mapping at Goldfield, Nevada, by cluster and discriminant analysis of Landsat digital data. [mapping of hydrothermally altered volcanic rocks

    NASA Technical Reports Server (NTRS)

    Ballew, G.

    1977-01-01

    The ability of Landsat multispectral digital data to differentiate among 62 combinations of rock and alteration types at the Goldfield mining district of Western Nevada was investigated by using statistical techniques of cluster and discriminant analysis. Multivariate discriminant analysis was not effective in classifying each of the 62 groups, with classification results essentially the same whether data of four channels alone or combined with six ratios of channels were used. Bivariate plots of group means revealed a cluster of three groups including mill tailings, basalt and all other rock and alteration types. Automatic hierarchical clustering based on the fourth dimensional Mahalanobis distance between group means of 30 groups having five or more samples was performed using Johnson's HICLUS program. The results of the cluster analysis revealed hierarchies of mill tailings vs. natural materials, basalt vs. non-basalt, highly reflectant rocks vs. other rocks and exclusively unaltered rocks vs. predominantly altered rocks. The hierarchies were used to determine the order in which sets of multiple discriminant analyses were to be performed and the resulting discriminant functions were used to produce a map of geology and alteration which has an overall accuracy of 70 percent for discriminating exclusively altered rocks from predominantly altered rocks.

  12. Geographic clusters in underimmunization and vaccine refusal.

    PubMed

    Lieu, Tracy A; Ray, G Thomas; Klein, Nicola P; Chung, Cindy; Kulldorff, Martin

    2015-02-01

    Parental refusal and delay of childhood vaccines has increased in recent years and is believed to cluster in some communities. Such clusters could pose public health risks and barriers to achieving immunization quality benchmarks. Our aims were to (1) describe geographic clusters of underimmunization and vaccine refusal, (2) compare clusters of underimmunization with different vaccines, and (3) evaluate whether vaccine refusal clusters may pose barriers to achieving high immunization rates. We analyzed electronic health records among children born between 2000 and 2011 with membership in Kaiser Permanente Northern California. The study population included 154,424 children in 13 counties with continuous membership from birth to 36 months of age. We used spatial scan statistics to identify clusters of underimmunization (having missed 1 or more vaccines by 36 months of age) and vaccine refusal (based on International Classification of Diseases, Ninth Revision, Clinical Modification codes). We identified 5 statistically significant clusters of underimmunization among children who turned 36 months old during 2010-2012. The underimmunization rate within clusters ranged from 18% to 23%, and the rate outside them was 11%. Children in the most statistically significant cluster had 1.58 (P < .001) times the rate of underimmunization as others. Underimmunization with measles, mumps, rubella vaccine and varicella vaccines clustered in similar geographic areas. Vaccine refusal also clustered, with rates of 5.5% to 13.5% within clusters, compared with 2.6% outside them. Underimmunization and vaccine refusal cluster geographically. Spatial scan statistics may be a useful tool to identify locations with challenges to achieving high immunization rates, which deserve focused intervention. Copyright © 2015 by the American Academy of Pediatrics.

  13. Detection of Clostridium difficile infection clusters, using the temporal scan statistic, in a community hospital in southern Ontario, Canada, 2006-2011.

    PubMed

    Faires, Meredith C; Pearl, David L; Ciccotelli, William A; Berke, Olaf; Reid-Smith, Richard J; Weese, J Scott

    2014-05-12

    In hospitals, Clostridium difficile infection (CDI) surveillance relies on unvalidated guidelines or threshold criteria to identify outbreaks. This can result in false-positive and -negative cluster alarms. The application of statistical methods to identify and understand CDI clusters may be a useful alternative or complement to standard surveillance techniques. The objectives of this study were to investigate the utility of the temporal scan statistic for detecting CDI clusters and determine if there are significant differences in the rate of CDI cases by month, season, and year in a community hospital. Bacteriology reports of patients identified with a CDI from August 2006 to February 2011 were collected. For patients detected with CDI from March 2010 to February 2011, stool specimens were obtained. Clostridium difficile isolates were characterized by ribotyping and investigated for the presence of toxin genes by PCR. CDI clusters were investigated using a retrospective temporal scan test statistic. Statistically significant clusters were compared to known CDI outbreaks within the hospital. A negative binomial regression model was used to identify associations between year, season, month and the rate of CDI cases. Overall, 86 CDI cases were identified. Eighteen specimens were analyzed and nine ribotypes were classified with ribotype 027 (n = 6) the most prevalent. The temporal scan statistic identified significant CDI clusters at the hospital (n = 5), service (n = 6), and ward (n = 4) levels (P ≤ 0.05). Three clusters were concordant with the one C. difficile outbreak identified by hospital personnel. Two clusters were identified as potential outbreaks. The negative binomial model indicated years 2007-2010 (P ≤ 0.05) had decreased CDI rates compared to 2006 and spring had an increased CDI rate compared to the fall (P = 0.023). Application of the temporal scan statistic identified several clusters, including potential outbreaks not detected by hospital personnel. The identification of time periods with decreased or increased CDI rates may have been a result of specific hospital events. Understanding the clustering of CDIs can aid in the interpretation of surveillance data and lead to the development of better early detection systems.

  14. Unequal cluster sizes in stepped-wedge cluster randomised trials: a systematic review.

    PubMed

    Kristunas, Caroline; Morris, Tom; Gray, Laura

    2017-11-15

    To investigate the extent to which cluster sizes vary in stepped-wedge cluster randomised trials (SW-CRT) and whether any variability is accounted for during the sample size calculation and analysis of these trials. Any, not limited to healthcare settings. Any taking part in an SW-CRT published up to March 2016. The primary outcome is the variability in cluster sizes, measured by the coefficient of variation (CV) in cluster size. Secondary outcomes include the difference between the cluster sizes assumed during the sample size calculation and those observed during the trial, any reported variability in cluster sizes and whether the methods of sample size calculation and methods of analysis accounted for any variability in cluster sizes. Of the 101 included SW-CRTs, 48% mentioned that the included clusters were known to vary in size, yet only 13% of these accounted for this during the calculation of the sample size. However, 69% of the trials did use a method of analysis appropriate for when clusters vary in size. Full trial reports were available for 53 trials. The CV was calculated for 23 of these: the median CV was 0.41 (IQR: 0.22-0.52). Actual cluster sizes could be compared with those assumed during the sample size calculation for 14 (26%) of the trial reports; the cluster sizes were between 29% and 480% of that which had been assumed. Cluster sizes often vary in SW-CRTs. Reporting of SW-CRTs also remains suboptimal. The effect of unequal cluster sizes on the statistical power of SW-CRTs needs further exploration and methods appropriate to studies with unequal cluster sizes need to be employed. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  15. Large-Scale Genomic Analysis of Codon Usage in Dengue Virus and Evaluation of Its Phylogenetic Dependence

    PubMed Central

    Lara-Ramírez, Edgar E.; Salazar, Ma Isabel; López-López, María de Jesús; Salas-Benito, Juan Santiago; Sánchez-Varela, Alejandro

    2014-01-01

    The increasing number of dengue virus (DENV) genome sequences available allows identifying the contributing factors to DENV evolution. In the present study, the codon usage in serotypes 1–4 (DENV1–4) has been explored for 3047 sequenced genomes using different statistics methods. The correlation analysis of total GC content (GC) with GC content at the three nucleotide positions of codons (GC1, GC2, and GC3) as well as the effective number of codons (ENC, ENCp) versus GC3 plots revealed mutational bias and purifying selection pressures as the major forces influencing the codon usage, but with distinct pressure on specific nucleotide position in the codon. The correspondence analysis (CA) and clustering analysis on relative synonymous codon usage (RSCU) within each serotype showed similar clustering patterns to the phylogenetic analysis of nucleotide sequences for DENV1–4. These clustering patterns are strongly related to the virus geographic origin. The phylogenetic dependence analysis also suggests that stabilizing selection acts on the codon usage bias. Our analysis of a large scale reveals new feature on DENV genomic evolution. PMID:25136631

  16. A note on the kappa statistic for clustered dichotomous data.

    PubMed

    Zhou, Ming; Yang, Zhao

    2014-06-30

    The kappa statistic is widely used to assess the agreement between two raters. Motivated by a simulation-based cluster bootstrap method to calculate the variance of the kappa statistic for clustered physician-patients dichotomous data, we investigate its special correlation structure and develop a new simple and efficient data generation algorithm. For the clustered physician-patients dichotomous data, based on the delta method and its special covariance structure, we propose a semi-parametric variance estimator for the kappa statistic. An extensive Monte Carlo simulation study is performed to evaluate the performance of the new proposal and five existing methods with respect to the empirical coverage probability, root-mean-square error, and average width of the 95% confidence interval for the kappa statistic. The variance estimator ignoring the dependence within a cluster is generally inappropriate, and the variance estimators from the new proposal, bootstrap-based methods, and the sampling-based delta method perform reasonably well for at least a moderately large number of clusters (e.g., the number of clusters K ⩾50). The new proposal and sampling-based delta method provide convenient tools for efficient computations and non-simulation-based alternatives to the existing bootstrap-based methods. Moreover, the new proposal has acceptable performance even when the number of clusters is as small as K = 25. To illustrate the practical application of all the methods, one psychiatric research data and two simulated clustered physician-patients dichotomous data are analyzed. Copyright © 2014 John Wiley & Sons, Ltd.

  17. Bootstrap-based methods for estimating standard errors in Cox's regression analyses of clustered event times.

    PubMed

    Xiao, Yongling; Abrahamowicz, Michal

    2010-03-30

    We propose two bootstrap-based methods to correct the standard errors (SEs) from Cox's model for within-cluster correlation of right-censored event times. The cluster-bootstrap method resamples, with replacement, only the clusters, whereas the two-step bootstrap method resamples (i) the clusters, and (ii) individuals within each selected cluster, with replacement. In simulations, we evaluate both methods and compare them with the existing robust variance estimator and the shared gamma frailty model, which are available in statistical software packages. We simulate clustered event time data, with latent cluster-level random effects, which are ignored in the conventional Cox's model. For cluster-level covariates, both proposed bootstrap methods yield accurate SEs, and type I error rates, and acceptable coverage rates, regardless of the true random effects distribution, and avoid serious variance under-estimation by conventional Cox-based standard errors. However, the two-step bootstrap method over-estimates the variance for individual-level covariates. We also apply the proposed bootstrap methods to obtain confidence bands around flexible estimates of time-dependent effects in a real-life analysis of cluster event times.

  18. Cluster: A New Application for Spatial Analysis of Pixelated Data for Epiphytotics.

    PubMed

    Nelson, Scot C; Corcoja, Iulian; Pethybridge, Sarah J

    2017-12-01

    Spatial analysis of epiphytotics is essential to develop and test hypotheses about pathogen ecology, disease dynamics, and to optimize plant disease management strategies. Data collection for spatial analysis requires substantial investment in time to depict patterns in various frames and hierarchies. We developed a new approach for spatial analysis of pixelated data in digital imagery and incorporated the method in a stand-alone desktop application called Cluster. The user isolates target entities (clusters) by designating up to 24 pixel colors as nontargets and moves a threshold slider to visualize the targets. The app calculates the percent area occupied by targeted pixels, identifies the centroids of targeted clusters, and computes the relative compass angle of orientation for each cluster. Users can deselect anomalous clusters manually and/or automatically by specifying a size threshold value to exclude smaller targets from the analysis. Up to 1,000 stochastic simulations randomly place the centroids of each cluster in ranked order of size (largest to smallest) within each matrix while preserving their calculated angles of orientation for the long axes. A two-tailed probability t test compares the mean inter-cluster distances for the observed versus the values derived from randomly simulated maps. This is the basis for statistical testing of the null hypothesis that the clusters are randomly distributed within the frame of interest. These frames can assume any shape, from natural (e.g., leaf) to arbitrary (e.g., a rectangular or polygonal field). Cluster summarizes normalized attributes of clusters, including pixel number, axis length, axis width, compass orientation, and the length/width ratio, available to the user as a downloadable spreadsheet. Each simulated map may be saved as an image and inspected. Provided examples demonstrate the utility of Cluster to analyze patterns at various spatial scales in plant pathology and ecology and highlight the limitations, trade-offs, and considerations for the sensitivities of variables and the biological interpretations of results. The Cluster app is available as a free download for Apple computers at iTunes, with a link to a user guide website.

  19. Unsupervised classification of remote multispectral sensing data

    NASA Technical Reports Server (NTRS)

    Su, M. Y.

    1972-01-01

    The new unsupervised classification technique for classifying multispectral remote sensing data which can be either from the multispectral scanner or digitized color-separation aerial photographs consists of two parts: (a) a sequential statistical clustering which is a one-pass sequential variance analysis and (b) a generalized K-means clustering. In this composite clustering technique, the output of (a) is a set of initial clusters which are input to (b) for further improvement by an iterative scheme. Applications of the technique using an IBM-7094 computer on multispectral data sets over Purdue's Flight Line C-1 and the Yellowstone National Park test site have been accomplished. Comparisons between the classification maps by the unsupervised technique and the supervised maximum liklihood technique indicate that the classification accuracies are in agreement.

  20. Country clustering applied to the water and sanitation sector: a new tool with potential applications in research and policy.

    PubMed

    Onda, Kyle; Crocker, Jonny; Kayser, Georgia Lyn; Bartram, Jamie

    2014-03-01

    The fields of global health and international development commonly cluster countries by geography and income to target resources and describe progress. For any given sector of interest, a range of relevant indicators can serve as a more appropriate basis for classification. We create a new typology of country clusters specific to the water and sanitation (WatSan) sector based on similarities across multiple WatSan-related indicators. After a literature review and consultation with experts in the WatSan sector, nine indicators were selected. Indicator selection was based on relevance to and suggested influence on national water and sanitation service delivery, and to maximize data availability across as many countries as possible. A hierarchical clustering method and a gap statistic analysis were used to group countries into a natural number of relevant clusters. Two stages of clustering resulted in five clusters, representing 156 countries or 6.75 billion people. The five clusters were not well explained by income or geography, and were distinct from existing country clusters used in international development. Analysis of these five clusters revealed that they were more compact and well separated than United Nations and World Bank country clusters. This analysis and resulting country typology suggest that previous geography- or income-based country groupings can be improved upon for applications in the WatSan sector by utilizing globally available WatSan-related indicators. Potential applications include guiding and discussing research, informing policy, improving resource targeting, describing sector progress, and identifying critical knowledge gaps in the WatSan sector. Copyright © 2013 Elsevier GmbH. All rights reserved.

  1. Quasi-Likelihood Techniques in a Logistic Regression Equation for Identifying Simulium damnosum s.l. Larval Habitats Intra-cluster Covariates in Togo.

    PubMed

    Jacob, Benjamin G; Novak, Robert J; Toe, Laurent; Sanfo, Moussa S; Afriyie, Abena N; Ibrahim, Mohammed A; Griffith, Daniel A; Unnasch, Thomas R

    2012-01-01

    The standard methods for regression analyses of clustered riverine larval habitat data of Simulium damnosum s.l. a major black-fly vector of Onchoceriasis, postulate models relating observational ecological-sampled parameter estimators to prolific habitats without accounting for residual intra-cluster error correlation effects. Generally, this correlation comes from two sources: (1) the design of the random effects and their assumed covariance from the multiple levels within the regression model; and, (2) the correlation structure of the residuals. Unfortunately, inconspicuous errors in residual intra-cluster correlation estimates can overstate precision in forecasted S.damnosum s.l. riverine larval habitat explanatory attributes regardless how they are treated (e.g., independent, autoregressive, Toeplitz, etc). In this research, the geographical locations for multiple riverine-based S. damnosum s.l. larval ecosystem habitats sampled from 2 pre-established epidemiological sites in Togo were identified and recorded from July 2009 to June 2010. Initially the data was aggregated into proc genmod. An agglomerative hierarchical residual cluster-based analysis was then performed. The sampled clustered study site data was then analyzed for statistical correlations using Monthly Biting Rates (MBR). Euclidean distance measurements and terrain-related geomorphological statistics were then generated in ArcGIS. A digital overlay was then performed also in ArcGIS using the georeferenced ground coordinates of high and low density clusters stratified by Annual Biting Rates (ABR). This data was overlain onto multitemporal sub-meter pixel resolution satellite data (i.e., QuickBird 0.61m wavbands ). Orthogonal spatial filter eigenvectors were then generated in SAS/GIS. Univariate and non-linear regression-based models (i.e., Logistic, Poisson and Negative Binomial) were also employed to determine probability distributions and to identify statistically significant parameter estimators from the sampled data. Thereafter, Durbin-Watson test statistics were used to test the null hypothesis that the regression residuals were not autocorrelated against the alternative that the residuals followed an autoregressive process in AUTOREG. Bayesian uncertainty matrices were also constructed employing normal priors for each of the sampled estimators in PROC MCMC. The residuals revealed both spatially structured and unstructured error effects in the high and low ABR-stratified clusters. The analyses also revealed that the estimators, levels of turbidity and presence of rocks were statistically significant for the high-ABR-stratified clusters, while the estimators distance between habitats and floating vegetation were important for the low-ABR-stratified cluster. Varying and constant coefficient regression models, ABR- stratified GIS-generated clusters, sub-meter resolution satellite imagery, a robust residual intra-cluster diagnostic test, MBR-based histograms, eigendecomposition spatial filter algorithms and Bayesian matrices can enable accurate autoregressive estimation of latent uncertainity affects and other residual error probabilities (i.e., heteroskedasticity) for testing correlations between georeferenced S. damnosum s.l. riverine larval habitat estimators. The asymptotic distribution of the resulting residual adjusted intra-cluster predictor error autocovariate coefficients can thereafter be established while estimates of the asymptotic variance can lead to the construction of approximate confidence intervals for accurately targeting productive S. damnosum s.l habitats based on spatiotemporal field-sampled count data.

  2. Multivariate Statistical Analysis of MSL APXS Bulk Geochemical Data

    NASA Astrophysics Data System (ADS)

    Hamilton, V. E.; Edwards, C. S.; Thompson, L. M.; Schmidt, M. E.

    2014-12-01

    We apply cluster and factor analyses to bulk chemical data of 130 soil and rock samples measured by the Alpha Particle X-ray Spectrometer (APXS) on the Mars Science Laboratory (MSL) rover Curiosity through sol 650. Multivariate approaches such as principal components analysis (PCA), cluster analysis, and factor analysis compliment more traditional approaches (e.g., Harker diagrams), with the advantage of simultaneously examining the relationships between multiple variables for large numbers of samples. Principal components analysis has been applied with success to APXS, Pancam, and Mössbauer data from the Mars Exploration Rovers. Factor analysis and cluster analysis have been applied with success to thermal infrared (TIR) spectral data of Mars. Cluster analyses group the input data by similarity, where there are a number of different methods for defining similarity (hierarchical, density, distribution, etc.). For example, without any assumptions about the chemical contributions of surface dust, preliminary hierarchical and K-means cluster analyses clearly distinguish the physically adjacent rock targets Windjana and Stephen as being distinctly different than lithologies observed prior to Curiosity's arrival at The Kimberley. In addition, they are separated from each other, consistent with chemical trends observed in variation diagrams but without requiring assumptions about chemical relationships. We will discuss the variation in cluster analysis results as a function of clustering method and pre-processing (e.g., log transformation, correction for dust cover) and implications for interpreting chemical data. Factor analysis shares some similarities with PCA, and examines the variability among observed components of a dataset so as to reveal variations attributable to unobserved components. Factor analysis has been used to extract the TIR spectra of components that are typically observed in mixtures and only rarely in isolation; there is the potential for similar results with data from APXS. These techniques offer new ways to understand the chemical relationships between the materials interrogated by Curiosity, and potentially their relation to materials observed by APXS instruments on other landed missions.

  3. Multi-particle correlations in transverse momenta from statistical clusters

    NASA Astrophysics Data System (ADS)

    Bialas, Andrzej; Bzdak, Adam

    2016-09-01

    We evaluate n-particle (n = 2 , 3 , 4 , 5) transverse momentum correlations for pions and kaons following from the decay of statistical clusters. These correlation functions could provide strong constraints on a possible existence of thermal clusters in the process of particle production.

  4. Precipitation Cluster Distributions: Current Climate Storm Statistics and Projected Changes Under Global Warming

    NASA Astrophysics Data System (ADS)

    Quinn, Kevin Martin

    The total amount of precipitation integrated across a precipitation cluster (contiguous precipitating grid cells exceeding a minimum rain rate) is a useful measure of the aggregate size of the disturbance, expressed as the rate of water mass lost or latent heat released, i.e. the power of the disturbance. Probability distributions of cluster power are examined during boreal summer (May-September) and winter (January-March) using satellite-retrieved rain rates from the Tropical Rainfall Measuring Mission (TRMM) 3B42 and Special Sensor Microwave Imager and Sounder (SSM/I and SSMIS) programs, model output from the High Resolution Atmospheric Model (HIRAM, roughly 0.25-0.5 0 resolution), seven 1-2° resolution members of the Coupled Model Intercomparison Project Phase 5 (CMIP5) experiment, and National Center for Atmospheric Research Large Ensemble (NCAR LENS). Spatial distributions of precipitation-weighted centroids are also investigated in observations (TRMM-3B42) and climate models during winter as a metric for changes in mid-latitude storm tracks. Observed probability distributions for both seasons are scale-free from the smallest clusters up to a cutoff scale at high cluster power, after which the probability density drops rapidly. When low rain rates are excluded by choosing a minimum rain rate threshold in defining clusters, the models accurately reproduce observed cluster power statistics and winter storm tracks. Changes in behavior in the tail of the distribution, above the cutoff, are important for impacts since these quantify the frequency of the most powerful storms. End-of-century cluster power distributions and storm track locations are investigated in these models under a "business as usual" global warming scenario. The probability of high cluster power events increases by end-of-century across all models, by up to an order of magnitude for the highest-power events for which statistics can be computed. For the three models in the suite with continuous time series of high resolution output, there is substantial variability on when these probability increases for the most powerful precipitation clusters become detectable, ranging from detectable within the observational period to statistically significant trends emerging only after 2050. A similar analysis of National Centers for Environmental Prediction (NCEP) Reanalysis 2 and SSM/I-SSMIS rain rate retrievals in the recent observational record does not yield reliable evidence of trends in high-power cluster probabilities at this time. Large impacts to mid-latitude storm tracks are projected over the West Coast and eastern North America, with no less than 8 of the 9 models examined showing large increases by end-of-century in the probability density of the most powerful storms, ranging up to a factor of 6.5 in the highest range bin for which historical statistics are computed. However, within these regional domains, there is considerable variation among models in pinpointing exactly where the largest increases will occur.

  5. Identifying irregularly shaped crime hot-spots using a multiobjective evolutionary algorithm

    NASA Astrophysics Data System (ADS)

    Wu, Xiaolan; Grubesic, Tony H.

    2010-12-01

    Spatial cluster detection techniques are widely used in criminology, geography, epidemiology, and other fields. In particular, spatial scan statistics are popular and efficient techniques for detecting areas of elevated crime or disease events. The majority of spatial scan approaches attempt to delineate geographic zones by evaluating the significance of clusters using likelihood ratio statistics tested with the Poisson distribution. While this can be effective, many scan statistics give preference to circular clusters, diminishing their ability to identify elongated and/or irregular shaped clusters. Although adjusting the shape of the scan window can mitigate some of these problems, both the significance of irregular clusters and their spatial structure must be accounted for in a meaningful way. This paper utilizes a multiobjective evolutionary algorithm to find clusters with maximum significance while quantitatively tracking their geographic structure. Crime data for the city of Cincinnati are utilized to demonstrate the advantages of the new approach and highlight its benefits versus more traditional scan statistics.

  6. Statistical uncertainty of extreme wind storms over Europe derived from a probabilistic clustering technique

    NASA Astrophysics Data System (ADS)

    Walz, Michael; Leckebusch, Gregor C.

    2016-04-01

    Extratropical wind storms pose one of the most dangerous and loss intensive natural hazards for Europe. However, due to only 50 years of high quality observational data, it is difficult to assess the statistical uncertainty of these sparse events just based on observations. Over the last decade seasonal ensemble forecasts have become indispensable in quantifying the uncertainty of weather prediction on seasonal timescales. In this study seasonal forecasts are used in a climatological context: By making use of the up to 51 ensemble members, a broad and physically consistent statistical base can be created. This base can then be used to assess the statistical uncertainty of extreme wind storm occurrence more accurately. In order to determine the statistical uncertainty of storms with different paths of progression, a probabilistic clustering approach using regression mixture models is used to objectively assign storm tracks (either based on core pressure or on extreme wind speeds) to different clusters. The advantage of this technique is that the entire lifetime of a storm is considered for the clustering algorithm. Quadratic curves are found to describe the storm tracks most accurately. Three main clusters (diagonal, horizontal or vertical progression of the storm track) can be identified, each of which have their own particulate features. Basic storm features like average velocity and duration are calculated and compared for each cluster. The main benefit of this clustering technique, however, is to evaluate if the clusters show different degrees of uncertainty, e.g. more (less) spread for tracks approaching Europe horizontally (diagonally). This statistical uncertainty is compared for different seasonal forecast products.

  7. Towards Accurate Modelling of Galaxy Clustering on Small Scales: Testing the Standard ΛCDM + Halo Model

    NASA Astrophysics Data System (ADS)

    Sinha, Manodeep; Berlind, Andreas A.; McBride, Cameron K.; Scoccimarro, Roman; Piscionere, Jennifer A.; Wibking, Benjamin D.

    2018-04-01

    Interpreting the small-scale clustering of galaxies with halo models can elucidate the connection between galaxies and dark matter halos. Unfortunately, the modelling is typically not sufficiently accurate for ruling out models statistically. It is thus difficult to use the information encoded in small scales to test cosmological models or probe subtle features of the galaxy-halo connection. In this paper, we attempt to push halo modelling into the "accurate" regime with a fully numerical mock-based methodology and careful treatment of statistical and systematic errors. With our forward-modelling approach, we can incorporate clustering statistics beyond the traditional two-point statistics. We use this modelling methodology to test the standard ΛCDM + halo model against the clustering of SDSS DR7 galaxies. Specifically, we use the projected correlation function, group multiplicity function and galaxy number density as constraints. We find that while the model fits each statistic separately, it struggles to fit them simultaneously. Adding group statistics leads to a more stringent test of the model and significantly tighter constraints on model parameters. We explore the impact of varying the adopted halo definition and cosmological model and find that changing the cosmology makes a significant difference. The most successful model we tried (Planck cosmology with Mvir halos) matches the clustering of low luminosity galaxies, but exhibits a 2.3σ tension with the clustering of luminous galaxies, thus providing evidence that the "standard" halo model needs to be extended. This work opens the door to adding interesting freedom to the halo model and including additional clustering statistics as constraints.

  8. An analysis of pilot error-related aircraft accidents

    NASA Technical Reports Server (NTRS)

    Kowalsky, N. B.; Masters, R. L.; Stone, R. B.; Babcock, G. L.; Rypka, E. W.

    1974-01-01

    A multidisciplinary team approach to pilot error-related U.S. air carrier jet aircraft accident investigation records successfully reclaimed hidden human error information not shown in statistical studies. New analytic techniques were developed and applied to the data to discover and identify multiple elements of commonality and shared characteristics within this group of accidents. Three techniques of analysis were used: Critical element analysis, which demonstrated the importance of a subjective qualitative approach to raw accident data and surfaced information heretofore unavailable. Cluster analysis, which was an exploratory research tool that will lead to increased understanding and improved organization of facts, the discovery of new meaning in large data sets, and the generation of explanatory hypotheses. Pattern recognition, by which accidents can be categorized by pattern conformity after critical element identification by cluster analysis.

  9. Identification of different nutritional status groups in institutionalized elderly people by cluster analysis.

    PubMed

    López-Contreras, María José; López, Maria Ángeles; Canteras, Manuel; Candela, María Emilia; Zamora, Salvador; Pérez-Llamas, Francisca

    2014-03-01

    To apply a cluster analysis to groups of individuals of similar characteristics in an attempt to identify undernutrition or the risk of undernutrition in this population. A cross-sectional study. Seven public nursing homes in the province of Murcia, on the Mediterranean coast of Spain. 205 subjects aged 65 and older (131 women and 74 men). Dietary intake (energy and nutrients), anthropometric (body mass index, skinfold thickness, mid-arm muscle circumference, mid-arm muscle area, corrected arm muscle area, waist to hip ratio) and biochemical and haematological (serum albumin, transferrin, total cholesterol, total lymphocyte count). Variables were analyzed by cluster analysis. The results of the cluster analysis, including intake, anthropometric and analytical data showed that, of the 205 elderly subjects, 66 (32.2%) were over - weight/obese, 72 (35.1%) had an adequate nutritional status and 67 (32.7%) were undernourished or at risk of undernutrition. The undernourished or at risk of undernutrition group showed the lowest values for dietary intake and the anthropometric and analytical parameters measured. Our study shows that cluster analysis is a useful statistical method for assessing the nutritional status of institutionalized elderly populations. In contrast, use of the specific reference values frequently described in the literature might fail to detect real cases of undernourishment or those at risk of undernutrition. Copyright AULA MEDICA EDICIONES 2014. Published by AULA MEDICA. All rights reserved.

  10. Non-specific filtering of beta-distributed data.

    PubMed

    Wang, Xinhui; Laird, Peter W; Hinoue, Toshinori; Groshen, Susan; Siegmund, Kimberly D

    2014-06-19

    Non-specific feature selection is a dimension reduction procedure performed prior to cluster analysis of high dimensional molecular data. Not all measured features are expected to show biological variation, so only the most varying are selected for analysis. In DNA methylation studies, DNA methylation is measured as a proportion, bounded between 0 and 1, with variance a function of the mean. Filtering on standard deviation biases the selection of probes to those with mean values near 0.5. We explore the effect this has on clustering, and develop alternate filter methods that utilize a variance stabilizing transformation for Beta distributed data and do not share this bias. We compared results for 11 different non-specific filters on eight Infinium HumanMethylation data sets, selected to span a variety of biological conditions. We found that for data sets having a small fraction of samples showing abnormal methylation of a subset of normally unmethylated CpGs, a characteristic of the CpG island methylator phenotype in cancer, a novel filter statistic that utilized a variance-stabilizing transformation for Beta distributed data outperformed the common filter of using standard deviation of the DNA methylation proportion, or its log-transformed M-value, in its ability to detect the cancer subtype in a cluster analysis. However, the standard deviation filter always performed among the best for distinguishing subgroups of normal tissue. The novel filter and standard deviation filter tended to favour features in different genome contexts; for the same data set, the novel filter always selected more features from CpG island promoters and the standard deviation filter always selected more features from non-CpG island intergenic regions. Interestingly, despite selecting largely non-overlapping sets of features, the two filters did find sample subsets that overlapped for some real data sets. We found two different filter statistics that tended to prioritize features with different characteristics, each performed well for identifying clusters of cancer and non-cancer tissue, and identifying a cancer CpG island hypermethylation phenotype. Since cluster analysis is for discovery, we would suggest trying both filters on any new data sets, evaluating the overlap of features selected and clusters discovered.

  11. Evaluating and implementing temporal, spatial, and spatio-temporal methods for outbreak detection in a local syndromic surveillance system

    PubMed Central

    Lall, Ramona; Levin-Rector, Alison; Sell, Jessica; Paladini, Marc; Konty, Kevin J.; Olson, Don; Weiss, Don

    2017-01-01

    The New York City Department of Health and Mental Hygiene has operated an emergency department syndromic surveillance system since 2001, using temporal and spatial scan statistics run on a daily basis for cluster detection. Since the system was originally implemented, a number of new methods have been proposed for use in cluster detection. We evaluated six temporal and four spatial/spatio-temporal detection methods using syndromic surveillance data spiked with simulated injections. The algorithms were compared on several metrics, including sensitivity, specificity, positive predictive value, coherence, and timeliness. We also evaluated each method’s implementation, programming time, run time, and the ease of use. Among the temporal methods, at a set specificity of 95%, a Holt-Winters exponential smoother performed the best, detecting 19% of the simulated injects across all shapes and sizes, followed by an autoregressive moving average model (16%), a generalized linear model (15%), a modified version of the Early Aberration Reporting System’s C2 algorithm (13%), a temporal scan statistic (11%), and a cumulative sum control chart (<2%). Of the spatial/spatio-temporal methods we tested, a spatial scan statistic detected 3% of all injects, a Bayes regression found 2%, and a generalized linear mixed model and a space-time permutation scan statistic detected none at a specificity of 95%. Positive predictive value was low (<7%) for all methods. Overall, the detection methods we tested did not perform well in identifying the temporal and spatial clusters of cases in the inject dataset. The spatial scan statistic, our current method for spatial cluster detection, performed slightly better than the other tested methods across different inject magnitudes and types. Furthermore, we found the scan statistics, as applied in the SaTScan software package, to be the easiest to program and implement for daily data analysis. PMID:28886112

  12. Evaluating and implementing temporal, spatial, and spatio-temporal methods for outbreak detection in a local syndromic surveillance system.

    PubMed

    Mathes, Robert W; Lall, Ramona; Levin-Rector, Alison; Sell, Jessica; Paladini, Marc; Konty, Kevin J; Olson, Don; Weiss, Don

    2017-01-01

    The New York City Department of Health and Mental Hygiene has operated an emergency department syndromic surveillance system since 2001, using temporal and spatial scan statistics run on a daily basis for cluster detection. Since the system was originally implemented, a number of new methods have been proposed for use in cluster detection. We evaluated six temporal and four spatial/spatio-temporal detection methods using syndromic surveillance data spiked with simulated injections. The algorithms were compared on several metrics, including sensitivity, specificity, positive predictive value, coherence, and timeliness. We also evaluated each method's implementation, programming time, run time, and the ease of use. Among the temporal methods, at a set specificity of 95%, a Holt-Winters exponential smoother performed the best, detecting 19% of the simulated injects across all shapes and sizes, followed by an autoregressive moving average model (16%), a generalized linear model (15%), a modified version of the Early Aberration Reporting System's C2 algorithm (13%), a temporal scan statistic (11%), and a cumulative sum control chart (<2%). Of the spatial/spatio-temporal methods we tested, a spatial scan statistic detected 3% of all injects, a Bayes regression found 2%, and a generalized linear mixed model and a space-time permutation scan statistic detected none at a specificity of 95%. Positive predictive value was low (<7%) for all methods. Overall, the detection methods we tested did not perform well in identifying the temporal and spatial clusters of cases in the inject dataset. The spatial scan statistic, our current method for spatial cluster detection, performed slightly better than the other tested methods across different inject magnitudes and types. Furthermore, we found the scan statistics, as applied in the SaTScan software package, to be the easiest to program and implement for daily data analysis.

  13. Spatial and temporal changes in household structure locations using high-resolution satellite imagery for population assessment: an analysis in southern Zambia, 2006-2011.

    PubMed

    Shields, Timothy; Pinchoff, Jessie; Lubinda, Jailos; Hamapumbu, Harry; Searle, Kelly; Kobayashi, Tamaki; Thuma, Philip E; Moss, William J; Curriero, Frank C

    2016-05-31

    Satellite imagery is increasingly available at high spatial resolution and can be used for various purposes in public health research and programme implementation. Comparing a census generated from two satellite images of the same region in rural southern Zambia obtained four and a half years apart identified patterns of household locations and change over time. The length of time that a satellite image-based census is accurate determines its utility. Households were enumerated manually from satellite images obtained in 2006 and 2011 of the same area. Spatial statistics were used to describe clustering, cluster detection, and spatial variation in the location of households. A total of 3821 household locations were enumerated in 2006 and 4256 in 2011, a net change of 435 houses (11.4% increase). Comparison of the images indicated that 971 (25.4%) structures were added and 536 (14.0%) removed. Further analysis suggested similar household clustering in the two images and no substantial difference in concentration of households across the study area. Cluster detection analysis identified a small area where significantly more household structures were removed than expected; however, the amount of change was of limited practical significance. These findings suggest that random sampling of households for study participation would not induce geographic bias if based on a 4.5-year-old image in this region. Application of spatial statistical methods provides insights into the population distribution changes between two time periods and can be helpful in assessing the accuracy of satellite imagery.

  14. Cluster analysis of quantitative parametric maps from DCE-MRI: application in evaluating heterogeneity of tumor response to antiangiogenic treatment.

    PubMed

    Longo, Dario Livio; Dastrù, Walter; Consolino, Lorena; Espak, Miklos; Arigoni, Maddalena; Cavallo, Federica; Aime, Silvio

    2015-07-01

    The objective of this study was to compare a clustering approach to conventional analysis methods for assessing changes in pharmacokinetic parameters obtained from dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) during antiangiogenic treatment in a breast cancer model. BALB/c mice bearing established transplantable her2+ tumors were treated with a DNA-based antiangiogenic vaccine or with an empty plasmid (untreated group). DCE-MRI was carried out by administering a dose of 0.05 mmol/kg of Gadocoletic acid trisodium salt, a Gd-based blood pool contrast agent (CA) at 1T. Changes in pharmacokinetic estimates (K(trans) and vp) in a nine-day interval were compared between treated and untreated groups on a voxel-by-voxel analysis. The tumor response to therapy was assessed by a clustering approach and compared with conventional summary statistics, with sub-regions analysis and with histogram analysis. Both the K(trans) and vp estimates, following blood-pool CA injection, showed marked and spatial heterogeneous changes with antiangiogenic treatment. Averaged values for the whole tumor region, as well as from the rim/core sub-regions analysis were unable to assess the antiangiogenic response. Histogram analysis resulted in significant changes only in the vp estimates (p<0.05). The proposed clustering approach depicted marked changes in both the K(trans) and vp estimates, with significant spatial heterogeneity in vp maps in response to treatment (p<0.05), provided that DCE-MRI data are properly clustered in three or four sub-regions. This study demonstrated the value of cluster analysis applied to pharmacokinetic DCE-MRI parametric maps for assessing tumor response to antiangiogenic therapy. Copyright © 2015 Elsevier Inc. All rights reserved.

  15. 20 Years Spatial-Temporal Analysis of Dengue Fever and Hemorrhagic Fever in Mexico.

    PubMed

    Hernández-Gaytán, Sendy Isarel; Díaz-Vásquez, Francisco Javier; Duran-Arenas, Luis Gerardo; López Cervantes, Malaquías; Rothenberg, Stephen J

    2017-10-01

    Dengue Fever (DF) is a human vector-borne disease and a major public health problem worldwide. In Mexico, DF and Dengue Hemorrhagic Fever (DHF) cases have increased in recent years. The aim of this study was to identify variations in the spatial distribution of DF and DHF cases over time using space-time statistical analysis and geographic information systems. Official data of DF and DHF cases were obtained in 32 states from 1995-2015. Space-time scan statistics were used to determine the space-time clusters of DF and DHF cases nationwide, and a geographic information system was used to display the location of clusters. A total of 885,748 DF cases was registered of which 13.4% (n = 119,174) correspond to DHF in the 32 states from 1995-2015. The most likely cluster of DF (relative risk = 25.5) contained the states of Jalisco, Colima, and Nayarit, on the Pacific coast in 2009, and the most likely cluster of DHF (relative risk = 8.5) was in the states of Chiapas, Tabasco, Campeche, Oaxaca, Veracruz, Quintana Roo, Yucatán, Puebla, Morelos, and Guerrero principally on the Gulf coast over 2006-2015. The geographic distribution of DF and DHF cases has increased in recent years and cases are significantly clustered in two coastal areas (Pacific and Gulf of Mexico). This provides the basis for further investigation of risk factors as well as interventions in specific areas. Copyright © 2018 IMSS. Published by Elsevier Inc. All rights reserved.

  16. Finding approximate gene clusters with Gecko 3.

    PubMed

    Winter, Sascha; Jahn, Katharina; Wehner, Stefanie; Kuchenbecker, Leon; Marz, Manja; Stoye, Jens; Böcker, Sebastian

    2016-11-16

    Gene-order-based comparison of multiple genomes provides signals for functional analysis of genes and the evolutionary process of genome organization. Gene clusters are regions of co-localized genes on genomes of different species. The rapid increase in sequenced genomes necessitates bioinformatics tools for finding gene clusters in hundreds of genomes. Existing tools are often restricted to few (in many cases, only two) genomes, and often make restrictive assumptions such as short perfect conservation, conserved gene order or monophyletic gene clusters. We present Gecko 3, an open-source software for finding gene clusters in hundreds of bacterial genomes, that comes with an easy-to-use graphical user interface. The underlying gene cluster model is intuitive, can cope with low degrees of conservation as well as misannotations and is complemented by a sound statistical evaluation. To evaluate the biological benefit of Gecko 3 and to exemplify our method, we search for gene clusters in a dataset of 678 bacterial genomes using Synechocystis sp. PCC 6803 as a reference. We confirm detected gene clusters reviewing the literature and comparing them to a database of operons; we detect two novel clusters, which were confirmed by publicly available experimental RNA-Seq data. The computational analysis is carried out on a laptop computer in <40 min. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  17. Implementation of novel statistical procedures and other advanced approaches to improve analysis of CASA data.

    PubMed

    Ramón, M; Martínez-Pastor, F

    2018-04-23

    Computer-aided sperm analysis (CASA) produces a wealth of data that is frequently ignored. The use of multiparametric statistical methods can help explore these datasets, unveiling the subpopulation structure of sperm samples. In this review we analyse the significance of the internal heterogeneity of sperm samples and its relevance. We also provide a brief description of the statistical tools used for extracting sperm subpopulations from the datasets, namely unsupervised clustering (with non-hierarchical, hierarchical and two-step methods) and the most advanced supervised methods, based on machine learning. The former method has allowed exploration of subpopulation patterns in many species, whereas the latter offering further possibilities, especially considering functional studies and the practical use of subpopulation analysis. We also consider novel approaches, such as the use of geometric morphometrics or imaging flow cytometry. Finally, although the data provided by CASA systems provides valuable information on sperm samples by applying clustering analyses, there are several caveats. Protocols for capturing and analysing motility or morphometry should be standardised and adapted to each experiment, and the algorithms should be open in order to allow comparison of results between laboratories. Moreover, we must be aware of new technology that could change the paradigm for studying sperm motility and morphology.

  18. A geographic analysis of individual and environmental risk factors for hypospadias births

    PubMed Central

    Winston, Jennifer J; Meyer, Robert E; Emch, Michael E

    2014-01-01

    Background Hypospadias is a relatively common birth defect affecting the male urinary tract. We explored the etiology of hypospadias by examining its spatial distribution in North Carolina and the spatial clustering of residuals from individual and environmental risk factors. Methods We used data collected by the North Carolina Birth Defects Monitoring Program from 2003-2005 to estimate local Moran's I statistics to identify geographic clustering of overall and severe hypospadias, using 995 overall cases and 16,013 controls. We conducted logistic regression and local Moran's I statistics on standardized residuals to consider the contribution of individual variables (maternal age, maternal race/ethnicity, maternal education, smoking, parity, and diabetes) and environmental variables (block group land cover) to this clustering. Results Local Moran's I statistics indicated significant clustering of overall and severe hypospadias in eastern central North Carolina. Spatial clustering of hypospadias persisted when controlling for individual factors, but diminished somewhat when controlling for environmental factors. In adjusted models, maternal residence in a block group with more than 5% crop cover was associated with overall hypospadias (OR = 1.22; 95% CI = 1.04 – 1.43); that is living in a block group with greater than 5% crop cover was associated with a 22% increase in the odds of having a baby with hypospadias. Land cover was not associated with severe hypospadias. Conclusions This study illustrates the potential contribution of mapping in generating hypotheses about disease etiology. Results suggest that environmental factors including proximity to agriculture may play some role in the spatial distribution of hypospadias. PMID:25196538

  19. Use of Spatial Epidemiology and Hot Spot Analysis to Target Women Eligible for Prenatal Women, Infants, and Children Services

    PubMed Central

    Krawczyk, Christopher; Gradziel, Pat; Geraghty, Estella M.

    2014-01-01

    Objectives. We used a geographic information system and cluster analyses to determine locations in need of enhanced Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) Program services. Methods. We linked documented births in the 2010 California Birth Statistical Master File with the 2010 data from the WIC Integrated Statewide Information System. Analyses focused on the density of pregnant women who were eligible for but not receiving WIC services in California’s 7049 census tracts. We used incremental spatial autocorrelation and hot spot analyses to identify clusters of WIC-eligible nonparticipants. Results. We detected clusters of census tracts with higher-than-expected densities, compared with the state mean density of WIC-eligible nonparticipants, in 21 of 58 (36.2%) California counties (P < .05). In subsequent county-level analyses, we located neighborhood-level clusters of higher-than-expected densities of eligible nonparticipants in Sacramento, San Francisco, Fresno, and Los Angeles Counties (P < .05). Conclusions. Hot spot analyses provided a rigorous and objective approach to determine the locations of statistically significant clusters of WIC-eligible nonparticipants. Results helped inform WIC program and funding decisions, including the opening of new WIC centers, and offered a novel approach for targeting public health services. PMID:24354821

  20. Exploring the Micro-Social Geography of Children's Interactions in Preschool: A Long-Term Observational Study and Analysis Using Geographic Information Technologies

    ERIC Educational Resources Information Center

    Torrens, Paul M.; Griffin, William A.

    2013-01-01

    The authors describe an observational and analytic methodology for recording and interpreting dynamic microprocesses that occur during social interaction, making use of space--time data collection techniques, spatial-statistical analysis, and visualization. The scheme has three investigative foci: Structure, Activity Composition, and Clustering.…

  1. Pattern Activity Clustering and Evaluation (PACE)

    NASA Astrophysics Data System (ADS)

    Blasch, Erik; Banas, Christopher; Paul, Michael; Bussjager, Becky; Seetharaman, Guna

    2012-06-01

    With the vast amount of network information available on activities of people (i.e. motions, transportation routes, and site visits) there is a need to explore the salient properties of data that detect and discriminate the behavior of individuals. Recent machine learning approaches include methods of data mining, statistical analysis, clustering, and estimation that support activity-based intelligence. We seek to explore contemporary methods in activity analysis using machine learning techniques that discover and characterize behaviors that enable grouping, anomaly detection, and adversarial intent prediction. To evaluate these methods, we describe the mathematics and potential information theory metrics to characterize behavior. A scenario is presented to demonstrate the concept and metrics that could be useful for layered sensing behavior pattern learning and analysis. We leverage work on group tracking, learning and clustering approaches; as well as utilize information theoretical metrics for classification, behavioral and event pattern recognition, and activity and entity analysis. The performance evaluation of activity analysis supports high-level information fusion of user alerts, data queries and sensor management for data extraction, relations discovery, and situation analysis of existing data.

  2. Use of Fouler Transforms to define landscape scales of analysis for disturbances: A case study of thinned and unthinned forest stands

    Treesearch

    J. E. Lundquist; R. A. Sommerfeld

    2002-01-01

    Various disturbances such as disease and management practices cause canopy gaps that change patterns of forest stand structure. This study examined the usefulness of digital image analysis using aerial photos, Fourier Tranforms, and cluster analysis to investigate how different spatial statistics are affected by spatial scale. The specific aims were to: 1) evaluate how...

  3. Identifying technical aliases in SELDI mass spectra of complex mixtures of proteins

    PubMed Central

    2013-01-01

    Background Biomarker discovery datasets created using mass spectrum protein profiling of complex mixtures of proteins contain many peaks that represent the same protein with different charge states. Correlated variables such as these can confound the statistical analyses of proteomic data. Previously we developed an algorithm that clustered mass spectrum peaks that were biologically or technically correlated. Here we demonstrate an algorithm that clusters correlated technical aliases only. Results In this paper, we propose a preprocessing algorithm that can be used for grouping technical aliases in mass spectrometry protein profiling data. The stringency of the variance allowed for clustering is customizable, thereby affecting the number of peaks that are clustered. Subsequent analysis of the clusters, instead of individual peaks, helps reduce difficulties associated with technically-correlated data, and can aid more efficient biomarker identification. Conclusions This software can be used to pre-process and thereby decrease the complexity of protein profiling proteomics data, thus simplifying the subsequent analysis of biomarkers by decreasing the number of tests. The software is also a practical tool for identifying which features to investigate further by purification, identification and confirmation. PMID:24010718

  4. Prediction of strontium bromide laser efficiency using cluster and decision tree analysis

    NASA Astrophysics Data System (ADS)

    Iliev, Iliycho; Gocheva-Ilieva, Snezhana; Kulin, Chavdar

    2018-01-01

    Subject of investigation is a new high-powered strontium bromide (SrBr2) vapor laser emitting in multiline region of wavelengths. The laser is an alternative to the atom strontium lasers and electron free lasers, especially at the line 6.45 μm which line is used in surgery for medical processing of biological tissues and bones with minimal damage. In this paper the experimental data from measurements of operational and output characteristics of the laser are statistically processed by means of cluster analysis and tree-based regression techniques. The aim is to extract the more important relationships and dependences from the available data which influence the increase of the overall laser efficiency. There are constructed and analyzed a set of cluster models. It is shown by using different cluster methods that the seven investigated operational characteristics (laser tube diameter, length, supplied electrical power, and others) and laser efficiency are combined in 2 clusters. By the built regression tree models using Classification and Regression Trees (CART) technique there are obtained dependences to predict the values of efficiency, and especially the maximum efficiency with over 95% accuracy.

  5. Using data mining to segment healthcare markets from patients' preference perspectives.

    PubMed

    Liu, Sandra S; Chen, Jie

    2009-01-01

    This paper aims to provide an example of how to use data mining techniques to identify patient segments regarding preferences for healthcare attributes and their demographic characteristics. Data were derived from a number of individuals who received in-patient care at a health network in 2006. Data mining and conventional hierarchical clustering with average linkage and Pearson correlation procedures are employed and compared to show how each procedure best determines segmentation variables. Data mining tools identified three differentiable segments by means of cluster analysis. These three clusters have significantly different demographic profiles. The study reveals, when compared with traditional statistical methods, that data mining provides an efficient and effective tool for market segmentation. When there are numerous cluster variables involved, researchers and practitioners need to incorporate factor analysis for reducing variables to clearly and meaningfully understand clusters. Interests and applications in data mining are increasing in many businesses. However, this technology is seldom applied to healthcare customer experience management. The paper shows that efficient and effective application of data mining methods can aid the understanding of patient healthcare preferences.

  6. Analysis of the effects of the global financial crisis on the Turkish economy, using hierarchical methods

    NASA Astrophysics Data System (ADS)

    Kantar, Ersin; Keskin, Mustafa; Deviren, Bayram

    2012-04-01

    We have analyzed the topology of 50 important Turkish companies for the period 2006-2010 using the concept of hierarchical methods (the minimal spanning tree (MST) and hierarchical tree (HT)). We investigated the statistical reliability of links between companies in the MST by using the bootstrap technique. We also used the average linkage cluster analysis (ALCA) technique to observe the cluster structures much better. The MST and HT are known as useful tools to perceive and detect global structure, taxonomy, and hierarchy in financial data. We obtained four clusters of companies according to their proximity. We also observed that the Banks and Holdings cluster always forms in the centre of the MSTs for the periods 2006-2007, 2008, and 2009-2010. The clusters match nicely with their common production activities or their strong interrelationship. The effects of the Automobile sector increased after the global financial crisis due to the temporary incentives provided by the Turkish government. We find that Turkish companies were not very affected by the global financial crisis.

  7. Exploring the individual patterns of spiritual well-being in people newly diagnosed with advanced cancer: a cluster analysis.

    PubMed

    Bai, Mei; Dixon, Jane; Williams, Anna-Leila; Jeon, Sangchoon; Lazenby, Mark; McCorkle, Ruth

    2016-11-01

    Research shows that spiritual well-being correlates positively with quality of life (QOL) for people with cancer, whereas contradictory findings are frequently reported with respect to the differentiated associations between dimensions of spiritual well-being, namely peace, meaning and faith, and QOL. This study aimed to examine individual patterns of spiritual well-being among patients newly diagnosed with advanced cancer. Cluster analysis was based on the twelve items of the 12-item Functional Assessment of Chronic Illness Therapy-Spiritual Well-Being Scale at Time 1. A combination of hierarchical and k-means (non-hierarchical) clustering methods was employed to jointly determine the number of clusters. Self-rated health, depressive symptoms, peace, meaning and faith, and overall QOL were compared at Time 1 and Time 2. Hierarchical and k-means clustering methods both suggested four clusters. Comparison of the four clusters supported statistically significant and clinically meaningful differences in QOL outcomes among clusters while revealing contrasting relations of faith with QOL. Cluster 1, Cluster 3, and Cluster 4 represented high, medium, and low levels of overall QOL, respectively, with correspondingly high, medium, and low levels of peace, meaning, and faith. Cluster 2 was distinguished from other clusters by its medium levels of overall QOL, peace, and meaning and low level of faith. This study provides empirical support for individual difference in response to a newly diagnosed cancer and brings into focus conceptual and methodological challenges associated with the measure of spiritual well-being, which may partly contribute to the attenuated relation between faith and QOL.

  8. Identifying and Assessing Interesting Subgroups in a Heterogeneous Population.

    PubMed

    Lee, Woojoo; Alexeyenko, Andrey; Pernemalm, Maria; Guegan, Justine; Dessen, Philippe; Lazar, Vladimir; Lehtiö, Janne; Pawitan, Yudi

    2015-01-01

    Biological heterogeneity is common in many diseases and it is often the reason for therapeutic failures. Thus, there is great interest in classifying a disease into subtypes that have clinical significance in terms of prognosis or therapy response. One of the most popular methods to uncover unrecognized subtypes is cluster analysis. However, classical clustering methods such as k-means clustering or hierarchical clustering are not guaranteed to produce clinically interesting subtypes. This could be because the main statistical variability--the basis of cluster generation--is dominated by genes not associated with the clinical phenotype of interest. Furthermore, a strong prognostic factor might be relevant for a certain subgroup but not for the whole population; thus an analysis of the whole sample may not reveal this prognostic factor. To address these problems we investigate methods to identify and assess clinically interesting subgroups in a heterogeneous population. The identification step uses a clustering algorithm and to assess significance we use a false discovery rate- (FDR-) based measure. Under the heterogeneity condition the standard FDR estimate is shown to overestimate the true FDR value, but this is remedied by an improved FDR estimation procedure. As illustrations, two real data examples from gene expression studies of lung cancer are provided.

  9. Fast clustering using adaptive density peak detection.

    PubMed

    Wang, Xiao-Feng; Xu, Yifan

    2017-12-01

    Common limitations of clustering methods include the slow algorithm convergence, the instability of the pre-specification on a number of intrinsic parameters, and the lack of robustness to outliers. A recent clustering approach proposed a fast search algorithm of cluster centers based on their local densities. However, the selection of the key intrinsic parameters in the algorithm was not systematically investigated. It is relatively difficult to estimate the "optimal" parameters since the original definition of the local density in the algorithm is based on a truncated counting measure. In this paper, we propose a clustering procedure with adaptive density peak detection, where the local density is estimated through the nonparametric multivariate kernel estimation. The model parameter is then able to be calculated from the equations with statistical theoretical justification. We also develop an automatic cluster centroid selection method through maximizing an average silhouette index. The advantage and flexibility of the proposed method are demonstrated through simulation studies and the analysis of a few benchmark gene expression data sets. The method only needs to perform in one single step without any iteration and thus is fast and has a great potential to apply on big data analysis. A user-friendly R package ADPclust is developed for public use.

  10. Finding Groups Using Model-based Cluster Analysis: Heterogeneous Emotional Self-regulatory Processes and Heavy Alcohol Use Risk

    PubMed Central

    Mun, Eun-Young; von Eye, Alexander; Bates, Marsha E.; Vaschillo, Evgeny G.

    2010-01-01

    Model-based cluster analysis is a new clustering procedure to investigate population heterogeneity utilizing finite mixture multivariate normal densities. It is an inferentially based, statistically principled procedure that allows comparison of non-nested models using the Bayesian Information Criterion (BIC) to compare multiple models and identify the optimum number of clusters. The current study clustered 36 young men and women based on their baseline heart rate (HR) and HR variability (HRV), chronic alcohol use, and reasons for drinking. Two cluster groups were identified and labeled High Alcohol Risk and Normative groups. Compared to the Normative group, individuals in the High Alcohol Risk group had higher levels of alcohol use and more strongly endorsed disinhibition and suppression reasons for use. The High Alcohol Risk group showed significant HRV changes in response to positive and negative emotional and appetitive picture cues, compared to neutral cues. In contrast, the Normative group showed a significant HRV change only to negative cues. Findings suggest that the individuals with autonomic self-regulatory difficulties may be more susceptible to heavy alcohol use and use alcohol for emotional regulation. PMID:18331138

  11. Methodological approaches in analysing observational data: A practical example on how to address clustering and selection bias.

    PubMed

    Trutschel, Diana; Palm, Rebecca; Holle, Bernhard; Simon, Michael

    2017-11-01

    Because not every scientific question on effectiveness can be answered with randomised controlled trials, research methods that minimise bias in observational studies are required. Two major concerns influence the internal validity of effect estimates: selection bias and clustering. Hence, to reduce the bias of the effect estimates, more sophisticated statistical methods are needed. To introduce statistical approaches such as propensity score matching and mixed models into representative real-world analysis and to conduct the implementation in statistical software R to reproduce the results. Additionally, the implementation in R is presented to allow the results to be reproduced. We perform a two-level analytic strategy to address the problems of bias and clustering: (i) generalised models with different abilities to adjust for dependencies are used to analyse binary data and (ii) the genetic matching and covariate adjustment methods are used to adjust for selection bias. Hence, we analyse the data from two population samples, the sample produced by the matching method and the full sample. The different analysis methods in this article present different results but still point in the same direction. In our example, the estimate of the probability of receiving a case conference is higher in the treatment group than in the control group. Both strategies, genetic matching and covariate adjustment, have their limitations but complement each other to provide the whole picture. The statistical approaches were feasible for reducing bias but were nevertheless limited by the sample used. For each study and obtained sample, the pros and cons of the different methods have to be weighted. Copyright © 2017 The Author(s). Published by Elsevier Ltd.. All rights reserved.

  12. Assessment of the climatic potential for tourism in Iran through biometeorology clustering.

    PubMed

    Roshan, Gholamreza; Yousefi, Robabe; Błażejczyk, Krzysztof

    2018-04-01

    This study presents a spatiotemporal analysis of bioclimatic comfort conditions for Iran using mean daily meteorological data from 1995 to 2014, analyzed through Physiological Equivalent Temperature (PET) index and Universal Thermal Climate Index (UTCI) indices, and bioclimatic clustering. The results of this study demonstrate that due to the climate variability across Iran during the year, there is at any point in time a location with climatic condition suitable for tourism. Mean values demonstrate maxima in bioclimatic comfort indices for the country in late winter and spring and minima for summer. Seven statistically significant clusters in bioclimatic indices were identified. Comparing these with clustering performed on PET and UTCI, the maximum overlaps between the two indices. In the following, the outputs of this research showed that most appropriate bioclimatic clustering for Iran includes seven clusters. These clustering locations according to climatic suitability for tourism provide a valuable contribution to tourism management in the country, particularly through marketing destinations to maximize tourist flow.

  13. Robust continuous clustering

    PubMed Central

    Shah, Sohil Atul

    2017-01-01

    Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank. PMID:28851838

  14. A ground truth based comparative study on clustering of gene expression data.

    PubMed

    Zhu, Yitan; Wang, Zuyi; Miller, David J; Clarke, Robert; Xuan, Jianhua; Hoffman, Eric P; Wang, Yue

    2008-05-01

    Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG toolkit (VIsual Statistical Data Analyzer--VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.

  15. Modeling the Movement of Homicide by Type to Inform Public Health Prevention Efforts.

    PubMed

    Zeoli, April M; Grady, Sue; Pizarro, Jesenia M; Melde, Chris

    2015-10-01

    We modeled the spatiotemporal movement of hotspot clusters of homicide by motive in Newark, New Jersey, to investigate whether different homicide types have different patterns of clustering and movement. We obtained homicide data from the Newark Police Department Homicide Unit's investigative files from 1997 through 2007 (n = 560). We geocoded the address at which each homicide victim was found and recorded the date of and the motive for the homicide. We used cluster detection software to model the spatiotemporal movement of statistically significant homicide clusters by motive, using census tract and month of occurrence as the spatial and temporal units of analysis. Gang-motivated homicides showed evidence of clustering and diffusion through Newark. Additionally, gang-motivated homicide clusters overlapped to a degree with revenge and drug-motivated homicide clusters. Escalating dispute and nonintimate familial homicides clustered; however, there was no evidence of diffusion. Intimate partner and robbery homicides did not cluster. By tracking how homicide types diffuse through communities and determining which places have ongoing or emerging homicide problems by type, we can better inform the deployment of prevention and intervention efforts.

  16. Multivariate analysis: A statistical approach for computations

    NASA Astrophysics Data System (ADS)

    Michu, Sachin; Kaushik, Vandana

    2014-10-01

    Multivariate analysis is a type of multivariate statistical approach commonly used in, automotive diagnosis, education evaluating clusters in finance etc and more recently in the health-related professions. The objective of the paper is to provide a detailed exploratory discussion about factor analysis (FA) in image retrieval method and correlation analysis (CA) of network traffic. Image retrieval methods aim to retrieve relevant images from a collected database, based on their content. The problem is made more difficult due to the high dimension of the variable space in which the images are represented. Multivariate correlation analysis proposes an anomaly detection and analysis method based on the correlation coefficient matrix. Anomaly behaviors in the network include the various attacks on the network like DDOs attacks and network scanning.

  17. Spatial clustering of fatal, and non-fatal, suicide in new South Wales, Australia: implications for evidence-based prevention.

    PubMed

    Torok, Michelle; Konings, Paul; Batterham, Philip J; Christensen, Helen

    2017-10-06

    Rates of suicide appear to be increasing, indicating a critical need for more effective prevention initiatives. To increase the efficacy of future prevention initiatives, we examined the spatial distribution of suicide deaths and suicide attempts in New South Wales (NSW), Australia, to identify where high incidence 'suicide clusters' were occurring. Such clusters represent candidate regions where intervention is critically needed, and likely to have the greatest impact, thus providing an evidence-base for the targeted prioritisation of resources. Analysis is based on official suicide mortality statistics for NSW, provided by the Australian Bureau of Statistics, and hospital separations for non-fatal intentional self-harm, provided through the NSW Health Admitted Patient Data Collection at a Statistical Area 2 (SA2) geography. Geographical Information System (GIS) techniques were applied to detect suicide clusters occurring between 2005 and 2013 (aggregated), for persons aged over 5 years. The final dataset contained 5466 mortality and 86,017 non-fatal intentional self-harm cases. In total, 25 Local Government Areas were identified as primary or secondary likely candidate regions for intervention. Together, these regions contained approximately 200 SA2 level suicide clusters, which represented 46% (n = 39,869) of hospital separations and 43% (n = 2330) of suicide deaths between 2005 and 2013. These clusters primarily converged on the Eastern coastal fringe of NSW. Crude rates of suicide deaths and intentional self-harm differed at the Local Government Areas (LGA) level in NSW. There was a tendency for primary suicide clusters to occur within metropolitan and coastal regions, rather than rural areas. The findings demonstrate the importance of taking geographical variation of suicidal behaviour into account, prior to development and implementation of prevention initiatives, so that such initiatives can target key problem areas where they are likely to have maximal impact.

  18. Vaccines for preventing anthrax.

    PubMed

    Donegan, Sarah; Bellamy, Richard; Gamble, Carrol L

    2009-04-15

    Anthrax is a bacterial zoonosis that occasionally causes human disease and is potentially fatal. Anthrax vaccines include a live-attenuated vaccine, an alum-precipitated cell-free filtrate vaccine, and a recombinant protein vaccine. To evaluate the effectiveness, immunogenicity, and safety of vaccines for preventing anthrax. We searched the following databases (November 2008): Cochrane Infectious Diseases Group Specialized Register; CENTRAL (The Cochrane Library 2008, Issue 4); MEDLINE; EMBASE; LILACS; and mRCT. We also searched reference lists. We included randomized controlled trials (RCTs) of individuals and cluster-RCTs comparing anthrax vaccine with placebo, other (non-anthrax) vaccines, or no intervention; or comparing administration routes or treatment regimens of anthrax vaccine. Two authors independently considered trial eligibility, assessed risk of bias, and extracted data. We presented cases of anthrax and seroconversion rates using risk ratios (RR) and 95% confidence intervals (CI). We summarized immunoglobulin G (IgG) concentrations using geometric means. We carried out a sensitivity analysis to investigate the effect of clustering on the results from one cluster-RCT. No meta-analysis was undertaken. One cluster-RCT (with 157,259 participants) and four RCTs of individuals (1917 participants) met the inclusion criteria. The cluster-RCT from the former USSR showed that, compared with no vaccine, a live-attenuated vaccine (called STI) protected against clinical anthrax whether given by a needleless device (RR 0.16; 102,737 participants, 154 clusters) or the scarification method (RR 0.25; 104,496 participants, 151 clusters). Confidence intervals were statistically significant in unadjusted calculations, but when a small amount of association within clusters was assumed, the differences were not statistically significant. The four RCTs (of individuals) of inactivated vaccines (anthrax vaccine absorbed and recombinant protective antigen) showed a dose response relationship for the anti-protective antigen IgG antibody titre. Intramuscular administration was associated with fewer injection site reactions than subcutaneous injection, and injection site reaction rates were lower when the dosage interval was longer. One cluster-RCT provides limited evidence that a live-attenuated vaccine is effective in preventing cutaneous anthrax. Vaccines based on anthrax antigens are immunogenic in most vaccinees with few adverse events or reactions. Ongoing randomized controlled trials are investigating the immunogenicity and safety of anthrax vaccines.

  19. Socioscape: Real-Time Analysis of Dynamic Heterogeneous Networks In Complex Socio-Cultural Systems

    DTIC Science & Technology

    2015-10-22

    Cluster Mixed-Membership Blockmodel for Time-Evolving Networks, Proceedings of the 14th International Conference on Artifical Intelligence and...Learning With Simultaneous Orthogonal Matching Pursuit, Proceedings of the 13th International Conference on Artifical Intelligence and Statistics

  20. PANDA-view: An easy-to-use tool for statistical analysis and visualization of quantitative proteomics data.

    PubMed

    Chang, Cheng; Xu, Kaikun; Guo, Chaoping; Wang, Jinxia; Yan, Qi; Zhang, Jian; He, Fuchu; Zhu, Yunping

    2018-05-22

    Compared with the numerous software tools developed for identification and quantification of -omics data, there remains a lack of suitable tools for both downstream analysis and data visualization. To help researchers better understand the biological meanings in their -omics data, we present an easy-to-use tool, named PANDA-view, for both statistical analysis and visualization of quantitative proteomics data and other -omics data. PANDA-view contains various kinds of analysis methods such as normalization, missing value imputation, statistical tests, clustering and principal component analysis, as well as the most commonly-used data visualization methods including an interactive volcano plot. Additionally, it provides user-friendly interfaces for protein-peptide-spectrum representation of the quantitative proteomics data. PANDA-view is freely available at https://sourceforge.net/projects/panda-view/. 1987ccpacer@163.com and zhuyunping@gmail.com. Supplementary data are available at Bioinformatics online.

  1. Glaucomatous patterns in Frequency Doubling Technology (FDT) perimetry data identified by unsupervised machine learning classifiers.

    PubMed

    Bowd, Christopher; Weinreb, Robert N; Balasubramanian, Madhusudhanan; Lee, Intae; Jang, Giljin; Yousefi, Siamak; Zangwill, Linda M; Medeiros, Felipe A; Girkin, Christopher A; Liebmann, Jeffrey M; Goldbaum, Michael H

    2014-01-01

    The variational Bayesian independent component analysis-mixture model (VIM), an unsupervised machine-learning classifier, was used to automatically separate Matrix Frequency Doubling Technology (FDT) perimetry data into clusters of healthy and glaucomatous eyes, and to identify axes representing statistically independent patterns of defect in the glaucoma clusters. FDT measurements were obtained from 1,190 eyes with normal FDT results and 786 eyes with abnormal FDT results from the UCSD-based Diagnostic Innovations in Glaucoma Study (DIGS) and African Descent and Glaucoma Evaluation Study (ADAGES). For all eyes, VIM input was 52 threshold test points from the 24-2 test pattern, plus age. FDT mean deviation was -1.00 dB (S.D. = 2.80 dB) and -5.57 dB (S.D. = 5.09 dB) in FDT-normal eyes and FDT-abnormal eyes, respectively (p<0.001). VIM identified meaningful clusters of FDT data and positioned a set of statistically independent axes through the mean of each cluster. The optimal VIM model separated the FDT fields into 3 clusters. Cluster N contained primarily normal fields (1109/1190, specificity 93.1%) and clusters G1 and G2 combined, contained primarily abnormal fields (651/786, sensitivity 82.8%). For clusters G1 and G2 the optimal number of axes were 2 and 5, respectively. Patterns automatically generated along axes within the glaucoma clusters were similar to those known to be indicative of glaucoma. Fields located farther from the normal mean on each glaucoma axis showed increasing field defect severity. VIM successfully separated FDT fields from healthy and glaucoma eyes without a priori information about class membership, and identified familiar glaucomatous patterns of loss.

  2. Intermediate and advanced topics in multilevel logistic regression analysis.

    PubMed

    Austin, Peter C; Merlo, Juan

    2017-09-10

    Multilevel data occur frequently in health services, population and public health, and epidemiologic research. In such research, binary outcomes are common. Multilevel logistic regression models allow one to account for the clustering of subjects within clusters of higher-level units when estimating the effect of subject and cluster characteristics on subject outcomes. A search of the PubMed database demonstrated that the use of multilevel or hierarchical regression models is increasing rapidly. However, our impression is that many analysts simply use multilevel regression models to account for the nuisance of within-cluster homogeneity that is induced by clustering. In this article, we describe a suite of analyses that can complement the fitting of multilevel logistic regression models. These ancillary analyses permit analysts to estimate the marginal or population-average effect of covariates measured at the subject and cluster level, in contrast to the within-cluster or cluster-specific effects arising from the original multilevel logistic regression model. We describe the interval odds ratio and the proportion of opposed odds ratios, which are summary measures of effect for cluster-level covariates. We describe the variance partition coefficient and the median odds ratio which are measures of components of variance and heterogeneity in outcomes. These measures allow one to quantify the magnitude of the general contextual effect. We describe an R 2 measure that allows analysts to quantify the proportion of variation explained by different multilevel logistic regression models. We illustrate the application and interpretation of these measures by analyzing mortality in patients hospitalized with a diagnosis of acute myocardial infarction. © 2017 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. © 2017 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.

  3. Cluster analysis of fasciolosis in dairy cow herds in Munster province of Ireland and detection of major climatic and environmental predictors of the exposure risk.

    PubMed

    Selemetas, Nikolaos; Phelan, Paul; O'Kiely, Padraig; de Waal, Theo

    2015-03-19

    Fasciolosis caused by Fasciola hepatica is a widespread parasitic disease in cattle farms. The aim of this study was to detect clusters of fasciolosis in dairy cow herds in Munster Province, Ireland and to identify significant climatic and environmental predictors of the exposure risk. In total, 1,292 dairy herds across Munster was sampled in September 2012 providing a single bulk tank milk (BTM) sample. The analysis of samples by an in-house antibody-detection enzyme-linked immunosorbent assay (ELISA), showed that 65% of the dairy herds (n = 842) had been exposed to F. hepatica. Using the Getis-Ord Gi* statistic, 16 high-risk and 24 low-risk (P <0.01) clusters of fasciolosis were identified. The spatial distribution of high-risk clusters was more dispersed and mainly located in the northern and western regions of Munster compared to the low-risk clusters that were mostly concentrated in the southern and eastern regions. The most significant classes of variables that could reflect the difference between high-risk and low-risk clusters were the total number of wet-days and rain-days, rainfall, the normalized difference vegetation index (NDVI), temperature and soil type. There was a bigger proportion of well-drained soils among the low-risk clusters, whereas poorly drained soils were more common among the high-risk clusters. These results stress the role of precipitation, grazing, temperature and drainage on the life cycle of F. hepatica in the temperate Irish climate. The findings of this study highlight the importance of cluster analysis for identifying significant differences in climatic and environmental variables between high-risk and low-risk clusters of fasciolosis in Irish dairy herds.

  4. Spectral reflectance of surface soils - A statistical analysis

    NASA Technical Reports Server (NTRS)

    Crouse, K. R.; Henninger, D. L.; Thompson, D. R.

    1983-01-01

    The relationship of the physical and chemical properties of soils to their spectral reflectance as measured at six wavebands of Thematic Mapper (TM) aboard NASA's Landsat-4 satellite was examined. The results of performing regressions of over 20 soil properties on the six TM bands indicated that organic matter, water, clay, cation exchange capacity, and calcium were the properties most readily predicted from TM data. The middle infrared bands, bands 5 and 7, were the best bands for predicting soil properties, and the near infrared band, band 4, was nearly as good. Clustering 234 soil samples on the TM bands and characterizing the clusters on the basis of soil properties revealed several clear relationships between properties and reflectance. Discriminant analysis found organic matter, fine sand, base saturation, sand, extractable acidity, and water to be significant in discriminating among clusters.

  5. A stereoscopic system for viewing the temporal evolution of brain activity clusters in response to linguistic stimuli

    NASA Astrophysics Data System (ADS)

    Forbes, Angus; Villegas, Javier; Almryde, Kyle R.; Plante, Elena

    2014-03-01

    In this paper, we present a novel application, 3D+Time Brain View, for the stereoscopic visualization of functional Magnetic Resonance Imaging (fMRI) data gathered from participants exposed to unfamiliar spoken languages. An analysis technique based on Independent Component Analysis (ICA) is used to identify statistically significant clusters of brain activity and their changes over time during different testing sessions. That is, our system illustrates the temporal evolution of participants' brain activity as they are introduced to a foreign language through displaying these clusters as they change over time. The raw fMRI data is presented as a stereoscopic pair in an immersive environment utilizing passive stereo rendering. The clusters are presented using a ray casting technique for volume rendering. Our system incorporates the temporal information and the results of the ICA into the stereoscopic 3D rendering, making it easier for domain experts to explore and analyze the data.

  6. Fast gene ontology based clustering for microarray experiments.

    PubMed

    Ovaska, Kristian; Laakso, Marko; Hautaniemi, Sampsa

    2008-11-21

    Analysis of a microarray experiment often results in a list of hundreds of disease-associated genes. In order to suggest common biological processes and functions for these genes, Gene Ontology annotations with statistical testing are widely used. However, these analyses can produce a very large number of significantly altered biological processes. Thus, it is often challenging to interpret GO results and identify novel testable biological hypotheses. We present fast software for advanced gene annotation using semantic similarity for Gene Ontology terms combined with clustering and heat map visualisation. The methodology allows rapid identification of genes sharing the same Gene Ontology cluster. Our R based semantic similarity open-source package has a speed advantage of over 2000-fold compared to existing implementations. From the resulting hierarchical clustering dendrogram genes sharing a GO term can be identified, and their differences in the gene expression patterns can be seen from the heat map. These methods facilitate advanced annotation of genes resulting from data analysis.

  7. Clustering gene expression data based on predicted differential effects of GV interaction.

    PubMed

    Pan, Hai-Yan; Zhu, Jun; Han, Dan-Fu

    2005-02-01

    Microarray has become a popular biotechnology in biological and medical research. However, systematic and stochastic variabilities in microarray data are expected and unavoidable, resulting in the problem that the raw measurements have inherent "noise" within microarray experiments. Currently, logarithmic ratios are usually analyzed by various clustering methods directly, which may introduce bias interpretation in identifying groups of genes or samples. In this paper, a statistical method based on mixed model approaches was proposed for microarray data cluster analysis. The underlying rationale of this method is to partition the observed total gene expression level into various variations caused by different factors using an ANOVA model, and to predict the differential effects of GV (gene by variety) interaction using the adjusted unbiased prediction (AUP) method. The predicted GV interaction effects can then be used as the inputs of cluster analysis. We illustrated the application of our method with a gene expression dataset and elucidated the utility of our approach using an external validation.

  8. Penalized likelihood and multi-objective spatial scans for the detection and inference of irregular clusters

    PubMed Central

    2010-01-01

    Background Irregularly shaped spatial clusters are difficult to delineate. A cluster found by an algorithm often spreads through large portions of the map, impacting its geographical meaning. Penalized likelihood methods for Kulldorff's spatial scan statistics have been used to control the excessive freedom of the shape of clusters. Penalty functions based on cluster geometry and non-connectivity have been proposed recently. Another approach involves the use of a multi-objective algorithm to maximize two objectives: the spatial scan statistics and the geometric penalty function. Results & Discussion We present a novel scan statistic algorithm employing a function based on the graph topology to penalize the presence of under-populated disconnection nodes in candidate clusters, the disconnection nodes cohesion function. A disconnection node is defined as a region within a cluster, such that its removal disconnects the cluster. By applying this function, the most geographically meaningful clusters are sifted through the immense set of possible irregularly shaped candidate cluster solutions. To evaluate the statistical significance of solutions for multi-objective scans, a statistical approach based on the concept of attainment function is used. In this paper we compared different penalized likelihoods employing the geometric and non-connectivity regularity functions and the novel disconnection nodes cohesion function. We also build multi-objective scans using those three functions and compare them with the previous penalized likelihood scans. An application is presented using comprehensive state-wide data for Chagas' disease in puerperal women in Minas Gerais state, Brazil. Conclusions We show that, compared to the other single-objective algorithms, multi-objective scans present better performance, regarding power, sensitivity and positive predicted value. The multi-objective non-connectivity scan is faster and better suited for the detection of moderately irregularly shaped clusters. The multi-objective cohesion scan is most effective for the detection of highly irregularly shaped clusters. PMID:21034451

  9. Epidemiological analysis of Salmonella clusters identified by whole genome sequencing, England and Wales 2014.

    PubMed

    Waldram, Alison; Dolan, Gayle; Ashton, Philip M; Jenkins, Claire; Dallman, Timothy J

    2018-05-01

    The unprecedented level of bacterial strain discrimination provided by whole genome sequencing (WGS) presents new challenges with respect to the utility and interpretation of the data. Whole genome sequences from 1445 isolates of Salmonella belonging to the most commonly identified serotypes in England and Wales isolated between April and August 2014 were analysed. Single linkage single nucleotide polymorphism thresholds at the 10, 5 and 0 level were explored for evidence of epidemiological links between clustered cases. Analysis of the WGS data organised 566 of the 1445 isolates into 32 clusters of five or more. A statistically significant epidemiological link was identified for 17 clusters. The clusters were associated with foreign travel (n = 8), consumption of Chinese takeaways (n = 4), chicken eaten at home (n = 2), and one each of the following; eating out, contact with another case in the home and contact with reptiles. In the same time frame, one cluster was detected using traditional outbreak detection methods. WGS can be used for the highly specific and highly sensitive detection of biologically related isolates when epidemiological links are obscured. Improvements in the collection of detailed, standardised exposure information would enhance cluster investigations. Copyright © 2017 Elsevier Ltd. All rights reserved.

  10. Noise-robust unsupervised spike sorting based on discriminative subspace learning with outlier handling.

    PubMed

    Keshtkaran, Mohammad Reza; Yang, Zhi

    2017-06-01

    Spike sorting is a fundamental preprocessing step for many neuroscience studies which rely on the analysis of spike trains. Most of the feature extraction and dimensionality reduction techniques that have been used for spike sorting give a projection subspace which is not necessarily the most discriminative one. Therefore, the clusters which appear inherently separable in some discriminative subspace may overlap if projected using conventional feature extraction approaches leading to a poor sorting accuracy especially when the noise level is high. In this paper, we propose a noise-robust and unsupervised spike sorting algorithm based on learning discriminative spike features for clustering. The proposed algorithm uses discriminative subspace learning to extract low dimensional and most discriminative features from the spike waveforms and perform clustering with automatic detection of the number of the clusters. The core part of the algorithm involves iterative subspace selection using linear discriminant analysis and clustering using Gaussian mixture model with outlier detection. A statistical test in the discriminative subspace is proposed to automatically detect the number of the clusters. Comparative results on publicly available simulated and real in vivo datasets demonstrate that our algorithm achieves substantially improved cluster distinction leading to higher sorting accuracy and more reliable detection of clusters which are highly overlapping and not detectable using conventional feature extraction techniques such as principal component analysis or wavelets. By providing more accurate information about the activity of more number of individual neurons with high robustness to neural noise and outliers, the proposed unsupervised spike sorting algorithm facilitates more detailed and accurate analysis of single- and multi-unit activities in neuroscience and brain machine interface studies.

  11. Noise-robust unsupervised spike sorting based on discriminative subspace learning with outlier handling

    NASA Astrophysics Data System (ADS)

    Keshtkaran, Mohammad Reza; Yang, Zhi

    2017-06-01

    Objective. Spike sorting is a fundamental preprocessing step for many neuroscience studies which rely on the analysis of spike trains. Most of the feature extraction and dimensionality reduction techniques that have been used for spike sorting give a projection subspace which is not necessarily the most discriminative one. Therefore, the clusters which appear inherently separable in some discriminative subspace may overlap if projected using conventional feature extraction approaches leading to a poor sorting accuracy especially when the noise level is high. In this paper, we propose a noise-robust and unsupervised spike sorting algorithm based on learning discriminative spike features for clustering. Approach. The proposed algorithm uses discriminative subspace learning to extract low dimensional and most discriminative features from the spike waveforms and perform clustering with automatic detection of the number of the clusters. The core part of the algorithm involves iterative subspace selection using linear discriminant analysis and clustering using Gaussian mixture model with outlier detection. A statistical test in the discriminative subspace is proposed to automatically detect the number of the clusters. Main results. Comparative results on publicly available simulated and real in vivo datasets demonstrate that our algorithm achieves substantially improved cluster distinction leading to higher sorting accuracy and more reliable detection of clusters which are highly overlapping and not detectable using conventional feature extraction techniques such as principal component analysis or wavelets. Significance. By providing more accurate information about the activity of more number of individual neurons with high robustness to neural noise and outliers, the proposed unsupervised spike sorting algorithm facilitates more detailed and accurate analysis of single- and multi-unit activities in neuroscience and brain machine interface studies.

  12. Space-Time Cluster Analysis to Detect Innovative Clinical Practices: A Case Study of Aripiprazole in the Department of Veterans Affairs.

    PubMed

    Penfold, Robert B; Burgess, James F; Lee, Austin F; Li, Mingfei; Miller, Christopher J; Nealon Seibert, Marjorie; Semla, Todd P; Mohr, David C; Kazis, Lewis E; Bauer, Mark S

    2018-02-01

    To identify space-time clusters of changes in prescribing aripiprazole for bipolar disorder among providers in the VA. VA administrative data from 2002 to 2010 were used to identify prescriptions of aripiprazole for bipolar disorder. Prescriber characteristics were obtained using the Personnel and Accounting Integrated Database. We conducted a retrospective space-time cluster analysis using the space-time permutation statistic. All VA service users with a diagnosis of bipolar disorder were included in the patient population. Individuals with any schizophrenia spectrum diagnoses were excluded. We also identified all clinicians who wrote a prescription for any bipolar disorder medication. The study population included 32,630 prescribers. Of these, 8,643 wrote qualifying prescriptions. We identified three clusters of aripiprazole prescribing centered in Massachusetts, Ohio, and the Pacific Northwest. Clusters were associated with prescribing by VA-employed (vs. contracted) prescribers. Nurses with prescribing privileges were more likely to make a prescription for aripiprazole in cluster locations compared with psychiatrists. Primary care physicians were less likely. Early prescribing of aripiprazole for bipolar disorder clustered geographically and was associated with prescriber subgroups. These methods support prospective surveillance of practice changes and identification of associated health system characteristics. © Health Research and Educational Trust.

  13. Study on Adaptive Parameter Determination of Cluster Analysis in Urban Management Cases

    NASA Astrophysics Data System (ADS)

    Fu, J. Y.; Jing, C. F.; Du, M. Y.; Fu, Y. L.; Dai, P. P.

    2017-09-01

    The fine management for cities is the important way to realize the smart city. The data mining which uses spatial clustering analysis for urban management cases can be used in the evaluation of urban public facilities deployment, and support the policy decisions, and also provides technical support for the fine management of the city. Aiming at the problem that DBSCAN algorithm which is based on the density-clustering can not realize parameter adaptive determination, this paper proposed the optimizing method of parameter adaptive determination based on the spatial analysis. Firstly, making analysis of the function Ripley's K for the data set to realize adaptive determination of global parameter MinPts, which means setting the maximum aggregation scale as the range of data clustering. Calculating every point object's highest frequency K value in the range of Eps which uses K-D tree and setting it as the value of clustering density to realize the adaptive determination of global parameter MinPts. Then, the R language was used to optimize the above process to accomplish the precise clustering of typical urban management cases. The experimental results based on the typical case of urban management in XiCheng district of Beijing shows that: The new DBSCAN clustering algorithm this paper presents takes full account of the data's spatial and statistical characteristic which has obvious clustering feature, and has a better applicability and high quality. The results of the study are not only helpful for the formulation of urban management policies and the allocation of urban management supervisors in XiCheng District of Beijing, but also to other cities and related fields.

  14. Towards accurate modelling of galaxy clustering on small scales: testing the standard ΛCDM + halo model

    NASA Astrophysics Data System (ADS)

    Sinha, Manodeep; Berlind, Andreas A.; McBride, Cameron K.; Scoccimarro, Roman; Piscionere, Jennifer A.; Wibking, Benjamin D.

    2018-07-01

    Interpreting the small-scale clustering of galaxies with halo models can elucidate the connection between galaxies and dark matter haloes. Unfortunately, the modelling is typically not sufficiently accurate for ruling out models statistically. It is thus difficult to use the information encoded in small scales to test cosmological models or probe subtle features of the galaxy-halo connection. In this paper, we attempt to push halo modelling into the `accurate' regime with a fully numerical mock-based methodology and careful treatment of statistical and systematic errors. With our forward-modelling approach, we can incorporate clustering statistics beyond the traditional two-point statistics. We use this modelling methodology to test the standard Λ cold dark matter (ΛCDM) + halo model against the clustering of Sloan Digital Sky Survey (SDSS) seventh data release (DR7) galaxies. Specifically, we use the projected correlation function, group multiplicity function, and galaxy number density as constraints. We find that while the model fits each statistic separately, it struggles to fit them simultaneously. Adding group statistics leads to a more stringent test of the model and significantly tighter constraints on model parameters. We explore the impact of varying the adopted halo definition and cosmological model and find that changing the cosmology makes a significant difference. The most successful model we tried (Planck cosmology with Mvir haloes) matches the clustering of low-luminosity galaxies, but exhibits a 2.3σ tension with the clustering of luminous galaxies, thus providing evidence that the `standard' halo model needs to be extended. This work opens the door to adding interesting freedom to the halo model and including additional clustering statistics as constraints.

  15. Estimating multilevel logistic regression models when the number of clusters is low: a comparison of different statistical software procedures.

    PubMed

    Austin, Peter C

    2010-04-22

    Multilevel logistic regression models are increasingly being used to analyze clustered data in medical, public health, epidemiological, and educational research. Procedures for estimating the parameters of such models are available in many statistical software packages. There is currently little evidence on the minimum number of clusters necessary to reliably fit multilevel regression models. We conducted a Monte Carlo study to compare the performance of different statistical software procedures for estimating multilevel logistic regression models when the number of clusters was low. We examined procedures available in BUGS, HLM, R, SAS, and Stata. We found that there were qualitative differences in the performance of different software procedures for estimating multilevel logistic models when the number of clusters was low. Among the likelihood-based procedures, estimation methods based on adaptive Gauss-Hermite approximations to the likelihood (glmer in R and xtlogit in Stata) or adaptive Gaussian quadrature (Proc NLMIXED in SAS) tended to have superior performance for estimating variance components when the number of clusters was small, compared to software procedures based on penalized quasi-likelihood. However, only Bayesian estimation with BUGS allowed for accurate estimation of variance components when there were fewer than 10 clusters. For all statistical software procedures, estimation of variance components tended to be poor when there were only five subjects per cluster, regardless of the number of clusters.

  16. Integrated GIS and multivariate statistical analysis for regional scale assessment of heavy metal soil contamination: A critical review.

    PubMed

    Hou, Deyi; O'Connor, David; Nathanail, Paul; Tian, Li; Ma, Yan

    2017-12-01

    Heavy metal soil contamination is associated with potential toxicity to humans or ecotoxicity. Scholars have increasingly used a combination of geographical information science (GIS) with geostatistical and multivariate statistical analysis techniques to examine the spatial distribution of heavy metals in soils at a regional scale. A review of such studies showed that most soil sampling programs were based on grid patterns and composite sampling methodologies. Many programs intended to characterize various soil types and land use types. The most often used sampling depth intervals were 0-0.10 m, or 0-0.20 m, below surface; and the sampling densities used ranged from 0.0004 to 6.1 samples per km 2 , with a median of 0.4 samples per km 2 . The most widely used spatial interpolators were inverse distance weighted interpolation and ordinary kriging; and the most often used multivariate statistical analysis techniques were principal component analysis and cluster analysis. The review also identified several determining and correlating factors in heavy metal distribution in soils, including soil type, soil pH, soil organic matter, land use type, Fe, Al, and heavy metal concentrations. The major natural and anthropogenic sources of heavy metals were found to derive from lithogenic origin, roadway and transportation, atmospheric deposition, wastewater and runoff from industrial and mining facilities, fertilizer application, livestock manure, and sewage sludge. This review argues that the full potential of integrated GIS and multivariate statistical analysis for assessing heavy metal distribution in soils on a regional scale has not yet been fully realized. It is proposed that future research be conducted to map multivariate results in GIS to pinpoint specific anthropogenic sources, to analyze temporal trends in addition to spatial patterns, to optimize modeling parameters, and to expand the use of different multivariate analysis tools beyond principal component analysis (PCA) and cluster analysis (CA). Copyright © 2017 Elsevier Ltd. All rights reserved.

  17. Gap Shape Classification using Landscape Indices and Multivariate Statistics

    PubMed Central

    Wu, Chih-Da; Cheng, Chi-Chuan; Chang, Che-Chang; Lin, Chinsu; Chang, Kun-Cheng; Chuang, Yung-Chung

    2016-01-01

    This study proposed a novel methodology to classify the shape of gaps using landscape indices and multivariate statistics. Patch-level indices were used to collect the qualified shape and spatial configuration characteristics for canopy gaps in the Lienhuachih Experimental Forest in Taiwan in 1998 and 2002. Non-hierarchical cluster analysis was used to assess the optimal number of gap clusters and canonical discriminant analysis was used to generate the discriminant functions for canopy gap classification. The gaps for the two periods were optimally classified into three categories. In general, gap type 1 had a more complex shape, gap type 2 was more elongated and gap type 3 had the largest gaps that were more regular in shape. The results were evaluated using Wilks’ lambda as satisfactory (p < 0.001). The agreement rate of confusion matrices exceeded 96%. Differences in gap characteristics between the classified gap types that were determined using a one-way ANOVA showed a statistical significance in all patch indices (p = 0.00), except for the Euclidean nearest neighbor distance (ENN) in 2002. Taken together, these results demonstrated the feasibility and applicability of the proposed methodology to classify the shape of a gap. PMID:27901127

  18. Gap Shape Classification using Landscape Indices and Multivariate Statistics.

    PubMed

    Wu, Chih-Da; Cheng, Chi-Chuan; Chang, Che-Chang; Lin, Chinsu; Chang, Kun-Cheng; Chuang, Yung-Chung

    2016-11-30

    This study proposed a novel methodology to classify the shape of gaps using landscape indices and multivariate statistics. Patch-level indices were used to collect the qualified shape and spatial configuration characteristics for canopy gaps in the Lienhuachih Experimental Forest in Taiwan in 1998 and 2002. Non-hierarchical cluster analysis was used to assess the optimal number of gap clusters and canonical discriminant analysis was used to generate the discriminant functions for canopy gap classification. The gaps for the two periods were optimally classified into three categories. In general, gap type 1 had a more complex shape, gap type 2 was more elongated and gap type 3 had the largest gaps that were more regular in shape. The results were evaluated using Wilks' lambda as satisfactory (p < 0.001). The agreement rate of confusion matrices exceeded 96%. Differences in gap characteristics between the classified gap types that were determined using a one-way ANOVA showed a statistical significance in all patch indices (p = 0.00), except for the Euclidean nearest neighbor distance (ENN) in 2002. Taken together, these results demonstrated the feasibility and applicability of the proposed methodology to classify the shape of a gap.

  19. Applying spatial analysis tools in public health: an example using SaTScan to detect geographic targets for colorectal cancer screening interventions.

    PubMed

    Sherman, Recinda L; Henry, Kevin A; Tannenbaum, Stacey L; Feaster, Daniel J; Kobetz, Erin; Lee, David J

    2014-03-20

    Epidemiologists are gradually incorporating spatial analysis into health-related research as geocoded cases of disease become widely available and health-focused geospatial computer applications are developed. One health-focused application of spatial analysis is cluster detection. Using cluster detection to identify geographic areas with high-risk populations and then screening those populations for disease can improve cancer control. SaTScan is a free cluster-detection software application used by epidemiologists around the world to describe spatial clusters of infectious and chronic disease, as well as disease vectors and risk factors. The objectives of this article are to describe how spatial analysis can be used in cancer control to detect geographic areas in need of colorectal cancer screening intervention, identify issues commonly encountered by SaTScan users, detail how to select the appropriate methods for using SaTScan, and explain how method selection can affect results. As an example, we used various methods to detect areas in Florida where the population is at high risk for late-stage diagnosis of colorectal cancer. We found that much of our analysis was underpowered and that no single method detected all clusters of statistical or public health significance. However, all methods detected 1 area as high risk; this area is potentially a priority area for a screening intervention. Cluster detection can be incorporated into routine public health operations, but the challenge is to identify areas in which the burden of disease can be alleviated through public health intervention. Reliance on SaTScan's default settings does not always produce pertinent results.

  20. Minimum number of clusters and comparison of analysis methods for cross sectional stepped wedge cluster randomised trials with binary outcomes: A simulation study.

    PubMed

    Barker, Daniel; D'Este, Catherine; Campbell, Michael J; McElduff, Patrick

    2017-03-09

    Stepped wedge cluster randomised trials frequently involve a relatively small number of clusters. The most common frameworks used to analyse data from these types of trials are generalised estimating equations and generalised linear mixed models. A topic of much research into these methods has been their application to cluster randomised trial data and, in particular, the number of clusters required to make reasonable inferences about the intervention effect. However, for stepped wedge trials, which have been claimed by many researchers to have a statistical power advantage over the parallel cluster randomised trial, the minimum number of clusters required has not been investigated. We conducted a simulation study where we considered the most commonly used methods suggested in the literature to analyse cross-sectional stepped wedge cluster randomised trial data. We compared the per cent bias, the type I error rate and power of these methods in a stepped wedge trial setting with a binary outcome, where there are few clusters available and when the appropriate adjustment for a time trend is made, which by design may be confounding the intervention effect. We found that the generalised linear mixed modelling approach is the most consistent when few clusters are available. We also found that none of the common analysis methods for stepped wedge trials were both unbiased and maintained a 5% type I error rate when there were only three clusters. Of the commonly used analysis approaches, we recommend the generalised linear mixed model for small stepped wedge trials with binary outcomes. We also suggest that in a stepped wedge design with three steps, at least two clusters be randomised at each step, to ensure that the intervention effect estimator maintains the nominal 5% significance level and is also reasonably unbiased.

  1. Spatial clusters of suicide in the municipality of São Paulo 1996-2005: an ecological study.

    PubMed

    Bando, Daniel H; Moreira, Rafael S; Pereira, Julio C R; Barrozo, Ligia V

    2012-08-23

    In a classical study, Durkheim mapped suicide rates, wealth, and low family density and realized that they clustered in northern France. Assessing others variables, such as religious society, he constructed a framework for the analysis of the suicide, which still allows international comparisons using the same basic methodology. The present study aims to identify possible significantly clusters of suicide in the city of São Paulo, and then, verify their statistical associations with socio-economic and cultural characteristics. A spatial scan statistical test was performed to analyze the geographical pattern of suicide deaths of residents in the city of São Paulo by Administrative District, from 1996 to 2005. Relative risks and high and/or low clusters were calculated accounting for gender and age as co-variates, were analyzed using spatial scan statistics to identify geographical patterns. Logistic regression was used to estimate associations with socioeconomic variables, considering, the spatial cluster of high suicide rates as the response variable. Drawing from Durkheim's original work, current World Health Organization (WHO) reports and recent reviews, the following independent variables were considered: marital status, income, education, religion, and migration. The mean suicide rate was 4.1/100,000 inhabitant-years. Against this baseline, two clusters were identified: the first, of increased risk (RR=1.66), comprising 18 districts in the central region; the second, of decreased risk (RR=0.78), including 14 districts in the southern region. The downtown area toward the southwestern region of the city displayed the highest risk for suicide, and though the overall risk may be considered low, the rate climbs up to an intermediate level in this region. One logistic regression analysis contrasted the risk cluster (18 districts) against the other remaining 78 districts, testing the effects of socioeconomic-cultural variables. The following categories of proportion of persons within the clusters were identified as risk factors: singles (OR=2.36), migrants (OR=1.50), Catholics (OR=1.37) and higher income (OR=1.06). In a second logistic model, likewise conceived, the following categories of proportion of persons were identified as protective factors: married (OR=0.49) and Evangelical (OR=0.60). This risk/ protection profile is in accordance with the interpretation that, as a social phenomenon, suicide is related to social isolation. Thus, the classical framework put forward by Durkheim seems to still hold, even though its categorical expression requires re-interpretation.

  2. Heavy metal contamination of agricultural soils affected by mining activities around the Ganxi River in Chenzhou, Southern China.

    PubMed

    Ma, Li; Sun, Jing; Yang, Zhaoguang; Wang, Lin

    2015-12-01

    Heavy metal contamination attracted a wide spread attention due to their strong toxicity and persistence. The Ganxi River, located in Chenzhou City, Southern China, has been severely polluted by lead/zinc ore mining activities. This work investigated the heavy metal pollution in agricultural soils around the Ganxi River. The total concentrations of heavy metals were determined by inductively coupled plasma-mass spectrometry. The potential risk associated with the heavy metals in soil was assessed by Nemerow comprehensive index and potential ecological risk index. In both methods, the study area was rated as very high risk. Multivariate statistical methods including Pearson's correlation analysis, hierarchical cluster analysis, and principal component analysis were employed to evaluate the relationships between heavy metals, as well as the correlation between heavy metals and pH, to identify the metal sources. Three distinct clusters have been observed by hierarchical cluster analysis. In principal component analysis, a total of two components were extracted to explain over 90% of the total variance, both of which were associated with anthropogenic sources.

  3. Cumulative trauma, hyperarousal, and suicidality in the general population: a path analysis.

    PubMed

    Briere, John; Godbout, Natacha; Dias, Colin

    2015-01-01

    Although trauma exposure and posttraumatic stress disorder (PTSD) both have been linked to suicidal thoughts and behavior, the underlying basis for this relationship is not clear. In a sample of 357 trauma-exposed individuals from the general population, younger participant age, cumulative trauma exposure, and all three Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, PTSD clusters (reexperiencing, avoidance, and hyperarousal) were correlated with clinical levels of suicidality. However, logistic regression analysis indicated that when all PTSD clusters were considered simultaneously, only hyperarousal continued to be predictive. A path analysis confirmed that posttraumatic hyperarousal (but not other components of PTSD) fully mediated the relationship between extent of trauma exposure and degree of suicidal thoughts and behaviors.

  4. Searching for a Gulf War syndrome using cluster analysis.

    PubMed

    Everitt, B; Ismail, K; David, A S; Wessely, S

    2002-11-01

    Gulf veterans report medically unexplained symptoms more frequently than non-Gulf veterans did. We examined whether Gulf and non-Gulf veterans could be distinguished by their patterns of symptom reporting. A k-means cluster analysis was applied to 500 randomly sampled veterans from each of three United Kingdom military cohorts of veterans; those deployed to the Gulf conflict between 1990 and 1991; to the Bosnia peacekeeping mission between 1992 and 1997; and military personnel who were in active service but not deployed to the Gulf (Era). Sociodemographic, health variables and scores for ten symptom groups were calculated. The gap statistic indicated the five-group solution as one that provided a particularly informative description of the structure in the data. Cluster 1 consisted of low scores for all symptom groups. Cluster 2 had veterans with highest symptom scores for musculoskeletal symptoms and high scores for psychiatric symptoms. Cluster 3 had high scores for psychiatric symptoms and marginally elevated scores for the remaining nine groups symptom groups. Cluster 4 had elevated scores for musculoskeletal symptoms only and cluster 5 was distinguishable from the other clusters in having high scores in all symptom groups, especially psychiatric and musculoskeletal. The findings do not support the existence of a unique syndrome affecting a subgroup of Gulf veterans but emphasize the excess of non-specific self-reported ill health in this group.

  5. Classifying Higher Education Institutions in Korea: A Performance-Based Approach

    ERIC Educational Resources Information Center

    Shin, Jung Cheol

    2009-01-01

    The purpose of this study was to classify higher education institutions according to institutional performance rather than predetermined benchmarks. Institutional performance was defined as research performance and classified using Hierarchical Cluster Analysis, a statistical method that classifies objects according to specified classification…

  6. Statistical and clustering analysis for disturbances: A case study of voltage dips in wind farms

    DOE PAGES

    Garcia-Sanchez, Tania; Gomez-Lazaro, Emilio; Muljadi, Eduard; ...

    2016-01-28

    This study proposes and evaluates an alternative statistical methodology to analyze a large number of voltage dips. For a given voltage dip, a set of lengths is first identified to characterize the root mean square (rms) voltage evolution along the disturbance, deduced from partial linearized time intervals and trajectories. Principal component analysis and K-means clustering processes are then applied to identify rms-voltage patterns and propose a reduced number of representative rms-voltage profiles from the linearized trajectories. This reduced group of averaged rms-voltage profiles enables the representation of a large amount of disturbances, which offers a visual and graphical representation ofmore » their evolution along the events, aspects that were not previously considered in other contributions. The complete process is evaluated on real voltage dips collected in intense field-measurement campaigns carried out in a wind farm in Spain among different years. The results are included in this paper.« less

  7. The application of data mining techniques to oral cancer prognosis.

    PubMed

    Tseng, Wan-Ting; Chiang, Wei-Fan; Liu, Shyun-Yeu; Roan, Jinsheng; Lin, Chun-Nan

    2015-05-01

    This study adopted an integrated procedure that combines the clustering and classification features of data mining technology to determine the differences between the symptoms shown in past cases where patients died from or survived oral cancer. Two data mining tools, namely decision tree and artificial neural network, were used to analyze the historical cases of oral cancer, and their performance was compared with that of logistic regression, the popular statistical analysis tool. Both decision tree and artificial neural network models showed superiority to the traditional statistical model. However, as to clinician, the trees created by the decision tree models are relatively easier to interpret compared to that of the artificial neural network models. Cluster analysis also discovers that those stage 4 patients whose also possess the following four characteristics are having an extremely low survival rate: pN is N2b, level of RLNM is level I-III, AJCC-T is T4, and cells mutate situation (G) is moderate.

  8. Effect of the non-thermal Sunyaev-Zel'dovich effect on the temperature determination of galaxy clusters

    NASA Astrophysics Data System (ADS)

    Marchegiani, P.; Colafrancesco, S.

    2017-08-01

    A recent stacking analysis of Planck HFI data of galaxy clusters led to the derivation of the cluster temperatures using the relativistic corrections to the Sunyaev-Zel'dovich effect (SZE). However, the temperatures of high-temperature clusters, as derived from this analysis, were basically higher than the temperatures derived from X-ray measurements, at a moderate statistical significance of 1.5σ. This discrepancy has been attributed by Hurier to calibration issues. In this paper, we discuss an alternative explanation for this discrepancy in terms of a non-thermal SZE astrophysical component. We find that this explanation can work if non-thermal electrons in galaxy clusters have a low minimum momentum (p1 ˜ 0.5-1), and if their pressure is of the order of 20-30 per cent of the thermal gas pressure. Both these conditions are hard to obtain if the non-thermal electrons are mixed with the hot gas in the intracluster medium, but can be possibly obtained if the non-thermal electrons are mainly confined in bubbles with a high amount of non-thermal plasma and a low amount of thermal plasma, or are in giant radio lobes/relics in the outskirts of the clusters. To derive more precise results on the properties of the non-thermal electrons in clusters, and in view of more solid detections of a discrepancy between X-ray- and SZE-derived cluster temperatures that cannot be explained in other ways, it would be necessary to reproduce the full analysis done by Hurier by systematically adding the non-thermal component of the SZE.

  9. Analysis of the mutations induced by conazole fungicides in vivo.

    PubMed

    Ross, Jeffrey A; Leavitt, Sharon A

    2010-05-01

    The mouse liver tumorigenic conazole fungicides triadimefon and propiconazole have previously been shown to be in vivo mouse liver mutagens in the Big Blue transgenic mutation assay when administered in feed at tumorigenic doses, whereas the non-tumorigenic conazole myclobutanil was not mutagenic. DNA sequencing of the mutants recovered from each treatment group as well as from animals receiving control diet was conducted to gain additional insight into the mode of action by which tumorigenic conazoles induce mutations. Relative dinucleotide mutabilities (RDMs) were calculated for each possible dinucleotide in each treatment group and then examined by multivariate statistical analysis techniques. Unsupervised hierarchical clustering analysis of RDM values segregated two independent control groups together, along with the non-tumorigen myclobutanil. The two tumorigenic conazoles clustered together in a distinct grouping. Partitioning around mediods of RDM values into two clusters also groups the triadimefon and propiconazole together in one cluster and the two control groups and myclobutanil together in a second cluster. Principal component analysis of these results identifies two components that account for 88.3% of the variability in the points. Taken together, these results are consistent with the hypothesis that propiconazole- and triadimefon-induced mutations do not represent clonal expansion of background mutations and support the hypothesis that they arise from the accumulation of reactive electrophilic metabolic intermediates within the liver in vivo.

  10. A Highly Efficient Design Strategy for Regression with Outcome Pooling

    PubMed Central

    Mitchell, Emily M.; Lyles, Robert H.; Manatunga, Amita K.; Perkins, Neil J.; Schisterman, Enrique F.

    2014-01-01

    The potential for research involving biospecimens can be hindered by the prohibitive cost of performing laboratory assays on individual samples. To mitigate this cost, strategies such as randomly selecting a portion of specimens for analysis or randomly pooling specimens prior to performing laboratory assays may be employed. These techniques, while effective in reducing cost, are often accompanied by a considerable loss of statistical efficiency. We propose a novel pooling strategy based on the k-means clustering algorithm to reduce laboratory costs while maintaining a high level of statistical efficiency when predictor variables are measured on all subjects, but the outcome of interest is assessed in pools. We perform simulations motivated by the BioCycle study to compare this k-means pooling strategy with current pooling and selection techniques under simple and multiple linear regression models. While all of the methods considered produce unbiased estimates and confidence intervals with appropriate coverage, pooling under k-means clustering provides the most precise estimates, closely approximating results from the full data and losing minimal precision as the total number of pools decreases. The benefits of k-means clustering evident in the simulation study are then applied to an analysis of the BioCycle dataset. In conclusion, when the number of lab tests is limited by budget, pooling specimens based on k-means clustering prior to performing lab assays can be an effective way to save money with minimal information loss in a regression setting. PMID:25220822

  11. A highly efficient design strategy for regression with outcome pooling.

    PubMed

    Mitchell, Emily M; Lyles, Robert H; Manatunga, Amita K; Perkins, Neil J; Schisterman, Enrique F

    2014-12-10

    The potential for research involving biospecimens can be hindered by the prohibitive cost of performing laboratory assays on individual samples. To mitigate this cost, strategies such as randomly selecting a portion of specimens for analysis or randomly pooling specimens prior to performing laboratory assays may be employed. These techniques, while effective in reducing cost, are often accompanied by a considerable loss of statistical efficiency. We propose a novel pooling strategy based on the k-means clustering algorithm to reduce laboratory costs while maintaining a high level of statistical efficiency when predictor variables are measured on all subjects, but the outcome of interest is assessed in pools. We perform simulations motivated by the BioCycle study to compare this k-means pooling strategy with current pooling and selection techniques under simple and multiple linear regression models. While all of the methods considered produce unbiased estimates and confidence intervals with appropriate coverage, pooling under k-means clustering provides the most precise estimates, closely approximating results from the full data and losing minimal precision as the total number of pools decreases. The benefits of k-means clustering evident in the simulation study are then applied to an analysis of the BioCycle dataset. In conclusion, when the number of lab tests is limited by budget, pooling specimens based on k-means clustering prior to performing lab assays can be an effective way to save money with minimal information loss in a regression setting. Copyright © 2014 John Wiley & Sons, Ltd.

  12. A measurement of CMB cluster lensing with SPT and DES year 1 data

    DOE PAGES

    Baxter, E. J.; Raghunathan, S.; Crawford, T. M.; ...

    2018-02-09

    Clusters of galaxies gravitationally lens the cosmic microwave background (CMB) radiation, resulting in a distinct imprint in the CMB on arcminute scales. Measurement of this effect offers a promising way to constrain the masses of galaxy clusters, particularly those at high redshift. We use CMB maps from the South Pole Telescope Sunyaev-Zel'dovich (SZ) survey to measure the CMB lensing signal around galaxy clusters identified in optical imaging from first year observations of the Dark Energy Survey. The cluster catalog used in this analysis contains 3697 members with mean redshift ofmore » $$\\bar{z} = 0.45$$. We detect lensing of the CMB by the galaxy clusters at $$8.1\\sigma$$ significance. Using the measured lensing signal, we constrain the amplitude of the relation between cluster mass and optical richness to roughly $$17\\%$$ precision, finding good agreement with recent constraints obtained with galaxy lensing. The error budget is dominated by statistical noise but includes significant contributions from systematic biases due to the thermal SZ effect and cluster miscentering.« less

  13. Mapping of terrain by computer clustering techniques using multispectral scanner data and using color aerial film

    NASA Technical Reports Server (NTRS)

    Smedes, H. W.; Linnerud, H. J.; Woolaver, L. B.; Su, M. Y.; Jayroe, R. R.

    1972-01-01

    Two clustering techniques were used for terrain mapping by computer of test sites in Yellowstone National Park. One test was made with multispectral scanner data using a composite technique which consists of (1) a strictly sequential statistical clustering which is a sequential variance analysis, and (2) a generalized K-means clustering. In this composite technique, the output of (1) is a first approximation of the cluster centers. This is the input to (2) which consists of steps to improve the determination of cluster centers by iterative procedures. Another test was made using the three emulsion layers of color-infrared aerial film as a three-band spectrometer. Relative film densities were analyzed using a simple clustering technique in three-color space. Important advantages of the clustering technique over conventional supervised computer programs are (1) human intervention, preparation time, and manipulation of data are reduced, (2) the computer map, gives unbiased indication of where best to select the reference ground control data, (3) use of easy to obtain inexpensive film, and (4) the geometric distortions can be easily rectified by simple standard photogrammetric techniques.

  14. A measurement of CMB cluster lensing with SPT and DES year 1 data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Baxter, E. J.; Raghunathan, S.; Crawford, T. M.

    Clusters of galaxies gravitationally lens the cosmic microwave background (CMB) radiation, resulting in a distinct imprint in the CMB on arcminute scales. Measurement of this effect offers a promising way to constrain the masses of galaxy clusters, particularly those at high redshift. We use CMB maps from the South Pole Telescope Sunyaev-Zel'dovich (SZ) survey to measure the CMB lensing signal around galaxy clusters identified in optical imaging from first year observations of the Dark Energy Survey. The cluster catalog used in this analysis contains 3697 members with mean redshift ofmore » $$\\bar{z} = 0.45$$. We detect lensing of the CMB by the galaxy clusters at $$8.1\\sigma$$ significance. Using the measured lensing signal, we constrain the amplitude of the relation between cluster mass and optical richness to roughly $$17\\%$$ precision, finding good agreement with recent constraints obtained with galaxy lensing. The error budget is dominated by statistical noise but includes significant contributions from systematic biases due to the thermal SZ effect and cluster miscentering.« less

  15. [Analysis of efficacy of radiofrequency obliteration with due regard for the target vein's diameter].

    PubMed

    Shaĭdakov, E V; Grigorian, A G; Iliukhin, E A; Bulatov, V L; Gal'chenko, M I

    2014-01-01

    Data concerning the effect of the target vein's diameter on efficacy of radiofrequency obliteration (RFO) in the current literature are limited. To assess efficacy of RFO and stripping, peculiarities of the postoperative period course with due regard for the diameter of the target veins, to compare the outcomes of RFO and classical phlebectomy in treatment of varicose disease during 1-year follow up by a composite end point. A multicenter prospective non-randomized study based on analysing therapeutic outcomes in a total of 218 patients presenting with varicose disease (C2-C3 according to the CEAP). RFO was performed in 108 patients and phlebectomy in 110 subjects. The results were assessed by means of a composite end point including four components: technical outcome at 1-year follow-up, pain, subcutaneous haemorrhage, and paresthesias. The groups of patients who endured RFO and phlebectomy were subdivided into two subgroups according to the target vein's diameter with a border of 14 mm. Statistical analysis. We used the methods of non-parametric statistics (contingency tables, chi squared test), calculating the odds ratio (OR) for a favourable outcome with a 95% confidential interval. Pain dynamics was assessed by means of intellectual data analysis (cluster analysis). «Phelbectomy ≥ 14 mm» and «RFO ≥ 14 mm». The incidence rate of a good outcome in the subgroups amounted to 20 (30.8%) and 61 (95.3%), respectively. The odds ratio for favourable outcome between the subgroups of RFA and phlebectomy amounted to 45.8; 95% CI (44.5-47.0). "RFA ≥ 14 mm" and "RFA < 14 mm". Favourable outcome rate in the subgroups amounted to 25 (39.1%) and 17 (38.6%), respectively. The differences were not statistically significant, p=0.24. The odds ratio for a good outcome between the RFO subgroups amounted to: OR=0.98; 95% CI (0.18-1.77). Comparative analysis of RFO outcomes between the clinics. Favourable outcome rate in the first clinic was 50 (92.6%), in the second 34 (87.2%), and in the third 13 (86.6%), with the difference being statistically insignificant, p=0.7. The cluster analysis of the pain dynamics after the intervention. The clusters with moderate pain were composed of the patients after phlebectomy. These clusters showed association of pain intensity with increased BMI and greater vein diameter. 1) RFA of great-diameter veins by a favourable outcome by the composite end point (CED) turned out to be superior to the classic phlebectomy. 2) For RFA the incidence rate of a favourable outcome by the CED does not depend on the target vein's diameter. 3) A pronounced pain syndrome after phlebectomy was associated with excessive body weight or obesity and greater diameter of the vein.

  16. Multi-scale visual analysis of time-varying electrocorticography data via clustering of brain regions

    DOE PAGES

    Murugesan, Sugeerth; Bouchard, Kristofer; Chang, Edward; ...

    2017-06-06

    There exists a need for effective and easy-to-use software tools supporting the analysis of complex Electrocorticography (ECoG) data. Understanding how epileptic seizures develop or identifying diagnostic indicators for neurological diseases require the in-depth analysis of neural activity data from ECoG. Such data is multi-scale and is of high spatio-temporal resolution. Comprehensive analysis of this data should be supported by interactive visual analysis methods that allow a scientist to understand functional patterns at varying levels of granularity and comprehend its time-varying behavior. We introduce a novel multi-scale visual analysis system, ECoG ClusterFlow, for the detailed exploration of ECoG data. Our systemmore » detects and visualizes dynamic high-level structures, such as communities, derived from the time-varying connectivity network. The system supports two major views: 1) an overview summarizing the evolution of clusters over time and 2) an electrode view using hierarchical glyph-based design to visualize the propagation of clusters in their spatial, anatomical context. We present case studies that were performed in collaboration with neuroscientists and neurosurgeons using simulated and recorded epileptic seizure data to demonstrate our system's effectiveness. ECoG ClusterFlow supports the comparison of spatio-temporal patterns for specific time intervals and allows a user to utilize various clustering algorithms. Neuroscientists can identify the site of seizure genesis and its spatial progression during various the stages of a seizure. Our system serves as a fast and powerful means for the generation of preliminary hypotheses that can be used as a basis for subsequent application of rigorous statistical methods, with the ultimate goal being the clinical treatment of epileptogenic zones.« less

  17. Statistical Analysis of Large Scale Structure by the Discrete Wavelet Transform

    NASA Astrophysics Data System (ADS)

    Pando, Jesus

    1997-10-01

    The discrete wavelet transform (DWT) is developed as a general statistical tool for the study of large scale structures (LSS) in astrophysics. The DWT is used in all aspects of structure identification including cluster analysis, spectrum and two-point correlation studies, scale-scale correlation analysis and to measure deviations from Gaussian behavior. The techniques developed are demonstrated on 'academic' signals, on simulated models of the Lymanα (Lyα) forests, and on observational data of the Lyα forests. This technique can detect clustering in the Ly-α clouds where traditional techniques such as the two-point correlation function have failed. The position and strength of these clusters in both real and simulated data is determined and it is shown that clusters exist on scales as large as at least 20 h-1 Mpc at significance levels of 2-4 σ. Furthermore, it is found that the strength distribution of the clusters can be used to distinguish between real data and simulated samples even where other traditional methods have failed to detect differences. Second, a method for measuring the power spectrum of a density field using the DWT is developed. All common features determined by the usual Fourier power spectrum can be calculated by the DWT. These features, such as the index of a power law or typical scales, can be detected even when the samples are geometrically complex, the samples are incomplete, or the mean density on larger scales is not known (the infrared uncertainty). Using this method the spectra of Ly-α forests in both simulated and real samples is calculated. Third, a method for measuring hierarchical clustering is introduced. Because hierarchical evolution is characterized by a set of rules of how larger dark matter halos are formed by the merging of smaller halos, scale-scale correlations of the density field should be one of the most sensitive quantities in determining the merging history. We show that these correlations can be completely determined by the correlations between discrete wavelet coefficients on adjacent scales and at nearly the same spatial position, Cj,j+12/cdot2. Scale-scale correlations on two samples of the QSO Ly-α forests absorption spectra are computed. Lastly, higher order statistics are developed to detect deviations from Gaussian behavior. These higher order statistics are necessary to fully characterize the Ly-α forests because the usual 2nd order statistics, such as the two-point correlation function or power spectrum, give inconclusive results. It is shown how this technique takes advantage of the locality of the DWT to circumvent the central limit theorem. A non-Gaussian spectrum is defined and this spectrum reveals not only the magnitude, but the scales of non-Gaussianity. When applied to simulated and observational samples of the Ly-α clouds, it is found that different popular models of structure formation have different spectra while two, independent observational data sets, have the same spectra. Moreover, the non-Gaussian spectra of real data sets are significantly different from the spectra of various possible random samples. (Abstract shortened by UMI.)

  18. Supervised group Lasso with applications to microarray data analysis

    PubMed Central

    Ma, Shuangge; Song, Xiao; Huang, Jian

    2007-01-01

    Background A tremendous amount of efforts have been devoted to identifying genes for diagnosis and prognosis of diseases using microarray gene expression data. It has been demonstrated that gene expression data have cluster structure, where the clusters consist of co-regulated genes which tend to have coordinated functions. However, most available statistical methods for gene selection do not take into consideration the cluster structure. Results We propose a supervised group Lasso approach that takes into account the cluster structure in gene expression data for gene selection and predictive model building. For gene expression data without biological cluster information, we first divide genes into clusters using the K-means approach and determine the optimal number of clusters using the Gap method. The supervised group Lasso consists of two steps. In the first step, we identify important genes within each cluster using the Lasso method. In the second step, we select important clusters using the group Lasso. Tuning parameters are determined using V-fold cross validation at both steps to allow for further flexibility. Prediction performance is evaluated using leave-one-out cross validation. We apply the proposed method to disease classification and survival analysis with microarray data. Conclusion We analyze four microarray data sets using the proposed approach: two cancer data sets with binary cancer occurrence as outcomes and two lymphoma data sets with survival outcomes. The results show that the proposed approach is capable of identifying a small number of influential gene clusters and important genes within those clusters, and has better prediction performance than existing methods. PMID:17316436

  19. Photometric redshifts as a tool for studying the Coma cluster galaxy populations

    NASA Astrophysics Data System (ADS)

    Adami, C.; Ilbert, O.; Pelló, R.; Cuillandre, J. C.; Durret, F.; Mazure, A.; Picat, J. P.; Ulmer, M. P.

    2008-12-01

    Aims: We apply photometric redshift techniques to an investigation of the Coma cluster galaxy luminosity function (GLF) at faint magnitudes, in particular in the u* band where basically no studies are presently available at these magnitudes. Methods: Cluster members were selected based on probability distribution function from photometric redshift calculations applied to deep u^*, B, V, R, I images covering a region of almost 1 deg2 (completeness limit R ~ 24). In the area covered only by the u* image, the GLF was also derived after a statistical background subtraction. Results: Global and local GLFs in the B, V, R, and I bands obtained with photometric redshift selection are consistent with our previous results based on a statistical background subtraction. The GLF in the u* band shows an increase in the faint end slope towards the outer regions of the cluster. The analysis of the multicolor type spatial distribution reveals that late type galaxies are distributed in clumps in the cluster outskirts, where X-ray substructures are also detected and where the GLF in the u* band is steeper. Conclusions: We can reproduce the GLFs computed with classical statistical subtraction methods by applying a photometric redshift technique. The u* GLF slope is steeper in the cluster outskirts, varying from α ~ -1 in the cluster center to α ~ -2 in the cluster periphery. The concentrations of faint late type galaxies in the cluster outskirts could explain these very steep slopes, assuming a short burst of star formation in these galaxies when entering the cluster. Based on observations obtained with MegaPrime/MegaCam, a joint project of CFHT and CEA/DAPNIA, at the Canada-France-Hawaii Telescope (CFHT) which is operated by the National Research Council (NRC) of Canada, the Institut National des Sciences de l'Univers of the Centre National de la Recherche Scientifique (CNRS) of France, and the University of Hawaii. This work is also partly based on data products produced at TERAPIX and the Canadian Astronomy Data Centre as part of the Canada-France-Hawaii Telescope Legacy Survey, a collaborative project of NRC and CNRS. Also based on data from W. M. Keck Observatory which is operated as a scientific partnership between the California Institute of Technology, the University of California, and NASA. It was made possible by the generous financial support of the W. M. Keck Foundation.

  20. An approach to functionally relevant clustering of the protein universe: Active site profile‐based clustering of protein structures and sequences

    PubMed Central

    Knutson, Stacy T.; Westwood, Brian M.; Leuthaeuser, Janelle B.; Turner, Brandon E.; Nguyendac, Don; Shea, Gabrielle; Kumar, Kiran; Hayden, Julia D.; Harper, Angela F.; Brown, Shoshana D.; Morris, John H.; Ferrin, Thomas E.; Babbitt, Patricia C.

    2017-01-01

    Abstract Protein function identification remains a significant problem. Solving this problem at the molecular functional level would allow mechanistic determinant identification—amino acids that distinguish details between functional families within a superfamily. Active site profiling was developed to identify mechanistic determinants. DASP and DASP2 were developed as tools to search sequence databases using active site profiling. Here, TuLIP (Two‐Level Iterative clustering Process) is introduced as an iterative, divisive clustering process that utilizes active site profiling to separate structurally characterized superfamily members into functionally relevant clusters. Underlying TuLIP is the observation that functionally relevant families (curated by Structure‐Function Linkage Database, SFLD) self‐identify in DASP2 searches; clusters containing multiple functional families do not. Each TuLIP iteration produces candidate clusters, each evaluated to determine if it self‐identifies using DASP2. If so, it is deemed a functionally relevant group. Divisive clustering continues until each structure is either a functionally relevant group member or a singlet. TuLIP is validated on enolase and glutathione transferase structures, superfamilies well‐curated by SFLD. Correlation is strong; small numbers of structures prevent statistically significant analysis. TuLIP‐identified enolase clusters are used in DASP2 GenBank searches to identify sequences sharing functional site features. Analysis shows a true positive rate of 96%, false negative rate of 4%, and maximum false positive rate of 4%. F‐measure and performance analysis on the enolase search results and comparison to GEMMA and SCI‐PHY demonstrate that TuLIP avoids the over‐division problem of these methods. Mechanistic determinants for enolase families are evaluated and shown to correlate well with literature results. PMID:28054422

  1. An approach to functionally relevant clustering of the protein universe: Active site profile-based clustering of protein structures and sequences.

    PubMed

    Knutson, Stacy T; Westwood, Brian M; Leuthaeuser, Janelle B; Turner, Brandon E; Nguyendac, Don; Shea, Gabrielle; Kumar, Kiran; Hayden, Julia D; Harper, Angela F; Brown, Shoshana D; Morris, John H; Ferrin, Thomas E; Babbitt, Patricia C; Fetrow, Jacquelyn S

    2017-04-01

    Protein function identification remains a significant problem. Solving this problem at the molecular functional level would allow mechanistic determinant identification-amino acids that distinguish details between functional families within a superfamily. Active site profiling was developed to identify mechanistic determinants. DASP and DASP2 were developed as tools to search sequence databases using active site profiling. Here, TuLIP (Two-Level Iterative clustering Process) is introduced as an iterative, divisive clustering process that utilizes active site profiling to separate structurally characterized superfamily members into functionally relevant clusters. Underlying TuLIP is the observation that functionally relevant families (curated by Structure-Function Linkage Database, SFLD) self-identify in DASP2 searches; clusters containing multiple functional families do not. Each TuLIP iteration produces candidate clusters, each evaluated to determine if it self-identifies using DASP2. If so, it is deemed a functionally relevant group. Divisive clustering continues until each structure is either a functionally relevant group member or a singlet. TuLIP is validated on enolase and glutathione transferase structures, superfamilies well-curated by SFLD. Correlation is strong; small numbers of structures prevent statistically significant analysis. TuLIP-identified enolase clusters are used in DASP2 GenBank searches to identify sequences sharing functional site features. Analysis shows a true positive rate of 96%, false negative rate of 4%, and maximum false positive rate of 4%. F-measure and performance analysis on the enolase search results and comparison to GEMMA and SCI-PHY demonstrate that TuLIP avoids the over-division problem of these methods. Mechanistic determinants for enolase families are evaluated and shown to correlate well with literature results. © 2017 The Authors Protein Science published by Wiley Periodicals, Inc. on behalf of The Protein Society.

  2. Cluster Subcutaneous Allergen Specific Immunotherapy for the Treatment of Allergic Rhinitis: A Systematic Review and Meta-Analysis

    PubMed Central

    Sun, Yueqi; Luo, Xi; Li, Huabin

    2014-01-01

    Background Although allergen specific immunotherapy (SIT) represents the only immune- modifying and curative option available for patients with allergic rhinitis (AR), the optimal schedule for specific subcutaneous immunotherapy (SCIT) is still unknown. The objective of this study is to systematically assess the efficacy and safety of cluster SCIT for patients with AR. Methods By searching PubMed, EMBASE and the Cochrane clinical trials database from 1980 through May 10th, 2013, we collected and analyzed the randomized controlled trials (RCTs) of cluster SCIT to assess its efficacy and safety. Results Eight trials involving 567 participants were included in this systematic review. Our meta-analysis showed that cluster SCIT have similar effect in reduction of both rhinitis symptoms and the requirement for anti-allergic medication compared with conventional SCIT, but when comparing cluster SCIT with placebo, no statistic significance were found in reduction of symptom scores or medication scores. Some caution is required in this interpretation as there was significant heterogeneity between studies. Data relating to Rhinoconjunctivitis Quality of Life Questionnaire (RQLQ) in 3 included studies were analyzed, which consistently point to the efficacy of cluster SCIT in improving quality of life compared to placebo. To assess the safety of cluster SCIT, meta-analysis showed that no differences existed in the incidence of either local adverse reaction or systemic adverse reaction between the cluster group and control group. Conclusion Based on the current limited evidence, we still could not conclude affirmatively that cluster SCIT was a safe and efficacious option for the treatment of AR patients. Further large-scale, well-designed RCTs on this topic are still needed. PMID:24489740

  3. Clustering P-Wave Receiver Functions To Constrain Subsurface Seismic Structure

    NASA Astrophysics Data System (ADS)

    Chai, C.; Larmat, C. S.; Maceira, M.; Ammon, C. J.; He, R.; Zhang, H.

    2017-12-01

    The acquisition of high-quality data from permanent and temporary dense seismic networks provides the opportunity to apply statistical and machine learning techniques to a broad range of geophysical observations. Lekic and Romanowicz (2011) used clustering analysis on tomographic velocity models of the western United States to perform tectonic regionalization and the velocity-profile clusters agree well with known geomorphic provinces. A complementary and somewhat less restrictive approach is to apply cluster analysis directly to geophysical observations. In this presentation, we apply clustering analysis to teleseismic P-wave receiver functions (RFs) continuing efforts of Larmat et al. (2015) and Maceira et al. (2015). These earlier studies validated the approach with surface waves and stacked EARS RFs from the USArray stations. In this study, we experiment with both the K-means and hierarchical clustering algorithms. We also test different distance metrics defined in the vector space of RFs following Lekic and Romanowicz (2011). We cluster data from two distinct data sets. The first, corresponding to the western US, was by smoothing/interpolation of receiver-function wavefield (Chai et al. 2015). Spatial coherence and agreement with geologic region increase with this simpler, spatially smoothed set of observations. The second data set is composed of RFs for more than 800 stations of the China Digital Seismic Network (CSN). Preliminary results show a first order agreement between clusters and tectonic region and each region cluster includes a distinct Ps arrival, which probably reflects differences in crustal thickness. Regionalization remains an important step to characterize a model prior to application of full waveform and/or stochastic imaging techniques because of the computational expense of these types of studies. Machine learning techniques can provide valuable information that can be used to design and characterize formal geophysical inversion, providing information on spatial variability in the subsurface geology.

  4. The cosmological analysis of X-ray cluster surveys - I. A new method for interpreting number counts

    NASA Astrophysics Data System (ADS)

    Clerc, N.; Pierre, M.; Pacaud, F.; Sadibekova, T.

    2012-07-01

    We present a new method aimed at simplifying the cosmological analysis of X-ray cluster surveys. It is based on purely instrumental observable quantities considered in a two-dimensional X-ray colour-magnitude diagram (hardness ratio versus count rate). The basic principle is that even in rather shallow surveys, substantial information on cluster redshift and temperature is present in the raw X-ray data and can be statistically extracted; in parallel, such diagrams can be readily predicted from an ab initio cosmological modelling. We illustrate the methodology for the case of a 100-deg2XMM survey having a sensitivity of ˜10-14 erg s-1 cm-2 and fit at the same time, the survey selection function, the cluster evolutionary scaling relations and the cosmology; our sole assumption - driven by the limited size of the sample considered in the case study - is that the local cluster scaling relations are known. We devote special attention to the realistic modelling of the count-rate measurement uncertainties and evaluate the potential of the method via a Fisher analysis. In the absence of individual cluster redshifts, the count rate and hardness ratio (CR-HR) method appears to be much more efficient than the traditional approach based on cluster counts (i.e. dn/dz, requiring redshifts). In the case where redshifts are available, our method performs similar to the traditional mass function (dn/dM/dz) for the purely cosmological parameters, but constrains better parameters defining the cluster scaling relations and their evolution. A further practical advantage of the CR-HR method is its simplicity: this fully top-down approach totally bypasses the tedious steps consisting in deriving cluster masses from X-ray temperature measurements.

  5. Research Update: Spatially resolved mapping of electronic structure on atomic level by multivariate statistical analysis

    NASA Astrophysics Data System (ADS)

    Belianinov, Alex; Ganesh, Panchapakesan; Lin, Wenzhi; Sales, Brian C.; Sefat, Athena S.; Jesse, Stephen; Pan, Minghu; Kalinin, Sergei V.

    2014-12-01

    Atomic level spatial variability of electronic structure in Fe-based superconductor FeTe0.55Se0.45 (Tc = 15 K) is explored using current-imaging tunneling-spectroscopy. Multivariate statistical analysis of the data differentiates regions of dissimilar electronic behavior that can be identified with the segregation of chalcogen atoms, as well as boundaries between terminations and near neighbor interactions. Subsequent clustering analysis allows identification of the spatial localization of these dissimilar regions. Similar statistical analysis of modeled calculated density of states of chemically inhomogeneous FeTe1-xSex structures further confirms that the two types of chalcogens, i.e., Te and Se, can be identified by their electronic signature and differentiated by their local chemical environment. This approach allows detailed chemical discrimination of the scanning tunneling microscopy data including separation of atomic identities, proximity, and local configuration effects and can be universally applicable to chemically and electronically inhomogeneous surfaces.

  6. Light clusters and pasta phases in warm and dense nuclear matter

    NASA Astrophysics Data System (ADS)

    Avancini, Sidney S.; Ferreira, Márcio; Pais, Helena; Providência, Constança; Röpke, Gerd

    2017-04-01

    The pasta phases are calculated for warm stellar matter in a framework of relativistic mean-field models, including the possibility of light cluster formation. Results from three different semiclassical approaches are compared with a quantum statistical calculation. Light clusters are considered as point-like particles, and their abundances are determined from the minimization of the free energy. The couplings of the light clusters to mesons are determined from experimental chemical equilibrium constants and many-body quantum statistical calculations. The effect of these light clusters on the chemical potentials is also discussed. It is shown that, by including heavy clusters, light clusters are present up to larger nucleonic densities, although with smaller mass fractions.

  7. Major cluster mergers and the location of the brightest cluster galaxy

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Martel, Hugo; Robichaud, Fidèle; Barai, Paramita, E-mail: Hugo.Martel@phy.ulaval.ca

    Using a large N-body cosmological simulation combined with a subgrid treatment of galaxy formation, merging, and tidal destruction, we study the formation and evolution of the galaxy and cluster population in a comoving volume (100 Mpc){sup 3} in a ΛCDM universe. At z = 0, our computational volume contains 1788 clusters with mass M {sub cl} > 1.1 × 10{sup 12} M {sub ☉}, including 18 massive clusters with M {sub cl} > 10{sup 14} M {sub ☉}. It also contains 1, 088, 797 galaxies with mass M {sub gal} ≥ 2 × 10{sup 9} M {sub ☉} and luminositymore » L > 9.5 × 10{sup 5} L {sub ☉}. For each cluster, we identified the brightest cluster galaxy (BCG). We then computed two separate statistics: the fraction f {sub BNC} of clusters in which the BCG is not the closest galaxy to the center of the cluster in projection, and the ratio Δv/σ, where Δv is the difference in radial velocity between the BCG and the whole cluster and σ is the radial velocity dispersion of the cluster. We found that f {sub BNC} increases from 0.05 for low-mass clusters (M {sub cl} ∼ 10{sup 12} M {sub ☉}) to 0.5 for high-mass clusters (M {sub cl} > 10{sup 14} M {sub ☉}) with very little dependence on cluster redshift. Most of this result turns out to be a projection effect and when we consider three-dimensional distances instead of projected distances, f {sub BNC} increases only to 0.2 at high-cluster mass. The values of Δv/σ vary from 0 to 1.8, with median values in the range 0.03-0.15 when considering all clusters, and 0.12-0.31 when considering only massive clusters. These results are consistent with previous observational studies and indicate that the central galaxy paradigm, which states that the BCG should be at rest at the center of the cluster, is usually valid, but exceptions are too common to be ignored. We built merger trees for the 18 most massive clusters in the simulation. Analysis of these trees reveal that 16 of these clusters have experienced 1 or several major or semi-major mergers in the past. These mergers leave each cluster in a non-equilibrium state, but eventually the cluster settles into an equilibrium configuration, unless it is disturbed by another major or semi-major merger. We found evidence that these mergers are responsible for the off-center positions and peculiar velocities of some BCGs. Our results thus support the merging-group scenario, in which some clusters form by the merging of smaller groups in which the galaxies have already formed, including the galaxy destined to become the BCG. Finally, we argue that f {sub BNC} is not a very robust statistics, as it is very sensitive to projection and selection effects, but that Δv/σ is more robust. Still, both statistics exhibit a signature of major mergers between clusters of galaxies.« less

  8. The use of principal component and cluster analysis to differentiate banana peel flours based on their starch and dietary fibre components.

    PubMed

    Ramli, Saifullah; Ismail, Noryati; Alkarkhi, Abbas Fadhl Mubarek; Easa, Azhar Mat

    2010-08-01

    Banana peel flour (BPF) prepared from green or ripe Cavendish and Dream banana fruits were assessed for their total starch (TS), digestible starch (DS), resistant starch (RS), total dietary fibre (TDF), soluble dietary fibre (SDF) and insoluble dietary fibre (IDF). Principal component analysis (PCA) identified that only 1 component was responsible for 93.74% of the total variance in the starch and dietary fibre components that differentiated ripe and green banana flours. Cluster analysis (CA) applied to similar data obtained two statistically significant clusters (green and ripe bananas) to indicate difference in behaviours according to the stages of ripeness based on starch and dietary fibre components. We concluded that the starch and dietary fibre components could be used to discriminate between flours prepared from peels obtained from fruits of different ripeness. The results were also suggestive of the potential of green and ripe BPF as functional ingredients in food.

  9. The Use of Principal Component and Cluster Analysis to Differentiate Banana Peel Flours Based on Their Starch and Dietary Fibre Components

    PubMed Central

    Ramli, Saifullah; Ismail, Noryati; Alkarkhi, Abbas Fadhl Mubarek; Easa, Azhar Mat

    2010-01-01

    Banana peel flour (BPF) prepared from green or ripe Cavendish and Dream banana fruits were assessed for their total starch (TS), digestible starch (DS), resistant starch (RS), total dietary fibre (TDF), soluble dietary fibre (SDF) and insoluble dietary fibre (IDF). Principal component analysis (PCA) identified that only 1 component was responsible for 93.74% of the total variance in the starch and dietary fibre components that differentiated ripe and green banana flours. Cluster analysis (CA) applied to similar data obtained two statistically significant clusters (green and ripe bananas) to indicate difference in behaviours according to the stages of ripeness based on starch and dietary fibre components. We concluded that the starch and dietary fibre components could be used to discriminate between flours prepared from peels obtained from fruits of different ripeness. The results were also suggestive of the potential of green and ripe BPF as functional ingredients in food. PMID:24575193

  10. Transcriptome profiling analysis reveals biomarkers in colon cancer samples of various differentiation

    PubMed Central

    Yu, Tonghu; Zhang, Huaping; Qi, Hong

    2018-01-01

    The aim of the present study was to investigate more colon cancer-related genes in different stages. Gene expression profile E-GEOD-62932 was extracted for differentially expressed gene (DEG) screening. Series test of cluster analysis was used to obtain significant trending models. Based on the Gene Ontology and Kyoto Encyclopedia of Genes and Genomes databases, functional and pathway enrichment analysis were processed and a pathway relation network was constructed. Gene co-expression network and gene signal network were constructed for common DEGs. The DEGs with the same trend were clustered and in total, 16 clusters with statistical significance were obtained. The screened DEGs were enriched into small molecule metabolic process and metabolic pathways. The pathway relation network was constructed with 57 nodes. A total of 328 common DEGs were obtained. Gene signal network was constructed with 71 nodes. Gene co-expression network was constructed with 161 nodes and 211 edges. ABCD3, CPT2, AGL and JAM2 are potential biomarkers for the diagnosis of colon cancer. PMID:29928385

  11. The median hazard ratio: a useful measure of variance and general contextual effects in multilevel survival analysis

    PubMed Central

    Wagner, Philippe; Merlo, Juan

    2016-01-01

    Multilevel data occurs frequently in many research areas like health services research and epidemiology. A suitable way to analyze such data is through the use of multilevel regression models (MLRM). MLRM incorporate cluster‐specific random effects which allow one to partition the total individual variance into between‐cluster variation and between‐individual variation. Statistically, MLRM account for the dependency of the data within clusters and provide correct estimates of uncertainty around regression coefficients. Substantively, the magnitude of the effect of clustering provides a measure of the General Contextual Effect (GCE). When outcomes are binary, the GCE can also be quantified by measures of heterogeneity like the Median Odds Ratio (MOR) calculated from a multilevel logistic regression model. Time‐to‐event outcomes within a multilevel structure occur commonly in epidemiological and medical research. However, the Median Hazard Ratio (MHR) that corresponds to the MOR in multilevel (i.e., ‘frailty’) Cox proportional hazards regression is rarely used. Analogously to the MOR, the MHR is the median relative change in the hazard of the occurrence of the outcome when comparing identical subjects from two randomly selected different clusters that are ordered by risk. We illustrate the application and interpretation of the MHR in a case study analyzing the hazard of mortality in patients hospitalized for acute myocardial infarction at hospitals in Ontario, Canada. We provide R code for computing the MHR. The MHR is a useful and intuitive measure for expressing cluster heterogeneity in the outcome and, thereby, estimating general contextual effects in multilevel survival analysis. © 2016 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd. PMID:27885709

  12. Multivariate Statistical Analysis of Cigarette Design Feature Influence on ISO TNCO Yields.

    PubMed

    Agnew-Heard, Kimberly A; Lancaster, Vicki A; Bravo, Roberto; Watson, Clifford; Walters, Matthew J; Holman, Matthew R

    2016-06-20

    The aim of this study is to explore how differences in cigarette physical design parameters influence tar, nicotine, and carbon monoxide (TNCO) yields in mainstream smoke (MSS) using the International Organization of Standardization (ISO) smoking regimen. Standardized smoking methods were used to evaluate 50 U.S. domestic brand cigarettes and a reference cigarette representing a range of TNCO yields in MSS collected from linear smoking machines using a nonintense smoking regimen. Multivariate statistical methods were used to form clusters of cigarettes based on their ISO TNCO yields and then to explore the relationship between the ISO generated TNCO yields and the nine cigarette physical design parameters between and within each cluster simultaneously. The ISO generated TNCO yields in MSS are 1.1-17.0 mg tar/cigarette, 0.1-2.2 mg nicotine/cigarette, and 1.6-17.3 mg CO/cigarette. Cluster analysis divided the 51 cigarettes into five discrete clusters based on their ISO TNCO yields. No one physical parameter dominated across all clusters. Predicting ISO machine generated TNCO yields based on these nine physical design parameters is complex due to the correlation among and between the nine physical design parameters and TNCO yields. From these analyses, it is estimated that approximately 20% of the variability in the ISO generated TNCO yields comes from other parameters (e.g., filter material, filter type, inclusion of expanded or reconstituted tobacco, and tobacco blend composition, along with differences in tobacco leaf origin and stalk positions and added ingredients). A future article will examine the influence of these physical design parameters on TNCO yields under a Canadian Intense (CI) smoking regimen. Together, these papers will provide a more robust picture of the design features that contribute to TNCO exposure across the range of real world smoking patterns.

  13. The spatio-temporal mapping of epileptic networks: Combination of EEG–fMRI and EEG source imaging

    PubMed Central

    Vulliemoz, S.; Thornton, R.; Rodionov, R.; Carmichael, D.W.; Guye, M.; Lhatoo, S.; McEvoy, A.W.; Spinelli, L.; Michel, C.M.; Duncan, J.S.; Lemieux, L.

    2009-01-01

    Simultaneous EEG–fMRI acquisitions in patients with epilepsy often reveal distributed patterns of Blood Oxygen Level Dependant (BOLD) change correlated with epileptiform discharges. We investigated if electrical source imaging (ESI) performed on the interictal epileptiform discharges (IED) acquired during fMRI acquisition could be used to study the dynamics of the networks identified by the BOLD effect, thereby avoiding the limitations of combining results from separate recordings. Nine selected patients (13 IED types identified) with focal epilepsy underwent EEG–fMRI. Statistical analysis was performed using SPM5 to create BOLD maps. ESI was performed on the IED recorded during fMRI acquisition using a realistic head model (SMAC) and a distributed linear inverse solution (LAURA). ESI could not be performed in one case. In 10/12 remaining studies, ESI at IED onset (ESIo) was anatomically close to one BOLD cluster. Interestingly, ESIo was closest to the positive BOLD cluster with maximal statistical significance in only 4/12 cases and closest to negative BOLD responses in 4/12 cases. Very small BOLD clusters could also have clinical relevance in some cases. ESI at later time frame (ESIp) showed propagation to remote sources co-localised with other BOLD clusters in half of cases. In concordant cases, the distance between maxima of ESI and the closest EEG–fMRI cluster was less than 33 mm, in agreement with previous studies. We conclude that simultaneous ESI and EEG–fMRI analysis may be able to distinguish areas of BOLD response related to initiation of IED from propagation areas. This combination provides new opportunities for investigating epileptic networks. PMID:19408351

  14. Spatial patterns in vegetation fires in the Indian region.

    PubMed

    Vadrevu, Krishna Prasad; Badarinath, K V S; Anuradha, Eaturu

    2008-12-01

    In this study, we used fire count datasets derived from Along Track Scanning Radiometer (ATSR) satellite to characterize spatial patterns in fire occurrences across highly diverse geographical, vegetation and topographic gradients in the Indian region. For characterizing the spatial patterns of fire occurrences, observed fire point patterns were tested against the hypothesis of a complete spatial random (CSR) pattern using three different techniques, the quadrat analysis, nearest neighbor analysis and Ripley's K function. Hierarchical nearest neighboring technique was used to depict the 'hotspots' of fire incidents. Of the different states, highest fire counts were recorded in Madhya Pradesh (14.77%) followed by Gujarat (10.86%), Maharastra (9.92%), Mizoram (7.66%), Jharkhand (6.41%), etc. With respect to the vegetation categories, highest number of fires were recorded in agricultural regions (40.26%) followed by tropical moist deciduous vegetation (12.72), dry deciduous vegetation (11.40%), abandoned slash and burn secondary forests (9.04%), tropical montane forests (8.07%) followed by others. Analysis of fire counts based on elevation and slope range suggested that maximum number of fires occurred in low and medium elevation types and in very low to low-slope categories. Results from three different spatial techniques for spatial pattern suggested clustered pattern in fire events compared to CSR. Most importantly, results from Ripley's K statistic suggested that fire events are highly clustered at a lag-distance of 125 miles. Hierarchical nearest neighboring clustering technique identified significant clusters of fire 'hotspots' in different states in northeast and central India. The implications of these results in fire management and mitigation were discussed. Also, this study highlights the potential of spatial point pattern statistics in environmental monitoring and assessment studies with special reference to fire events in the Indian region.

  15. Micronutrients and macronutrients and parameters of antioxidative ability in saliva of women: inhabitants of Krakow (Poland) in the course of uncomplicated singleton pregnancy.

    PubMed

    Kolarzyk, Emilia; Pietrzycka, Agata; Stepniewski, Marek; Łyszczarz, Justyna; Mendyk, Aleksander; Ostachowska-Gasior, Agnieszka

    2006-01-01

    There were two aims of the present study: (1) to evaluate changes of superoxide dismutase (SOD) activity and total antioxidative status measured as the ferric reducing ability of plasma (FRAP) concentration in saliva of pregnant women during the first, second, and third trimesters of singleton uncomplicated pregnancy and (2) to assess possible relations among SOD, FRAP, and intake of macronutrients and micronutrients in daily nutritional rations (DNRs) during pregnancy. Forty pregnant women aged 27.1+/-5.4 yr (examined group) and 40 healthy women (the control group) were recruited for this study. The relationship between FRAP and SOD in saliva and the intake of macronutrients (proteins, carbohydrates, total fat, saturated fatty acid, monounsaturated fatty acids, polyunsaturated fatty acids), minerals (Na, K, Ca, P, Mg, Fe, Zn, Cu, Mn), and vitamins (A, C, E, B1, B2, B6, PP) in DNR was evaluated by clustering analysis with Ward's grouping method. During pregnancy, FRAP and SOD values were lower than in the controls, but only for FRAP were the differences statistically significant (p < 0.001). For the whole pregnancy period, cluster analysis identified two major clusters for which the differentiating variables were SOD and retinoids intake, but different patterns for each trimester of pregnancy were revealed. The following were concluded: (1) FRAP values were the lowest in the second trimester. It suggests that in this trimester of uncomplicated pregnancy, women might be not fully adapted to increased demands for antioxidative mechanisms. (2) Cluster analysis showed that there were no statistical relationships between the intake of micronutrients and macronutrients in DNRs and the SOD or FRAP level in the saliva of pregnant women.

  16. Clustering change patterns using Fourier transformation with time-course gene expression data.

    PubMed

    Kim, Jaehee

    2011-01-01

    To understand the behavior of genes, it is important to explore how the patterns of gene expression change over a period of time because biologically related gene groups can share the same change patterns. In this study, the problem of finding similar change patterns is induced to clustering with the derivative Fourier coefficients. This work is aimed at discovering gene groups with similar change patterns which share similar biological properties. We developed a statistical model using derivative Fourier coefficients to identify similar change patterns of gene expression. We used a model-based method to cluster the Fourier series estimation of derivatives. We applied our model to cluster change patterns of yeast cell cycle microarray expression data with alpha-factor synchronization. It showed that, as the method clusters with the probability-neighboring data, the model-based clustering with our proposed model yielded biologically interpretable results. We expect that our proposed Fourier analysis with suitably chosen smoothing parameters could serve as a useful tool in classifying genes and interpreting possible biological change patterns.

  17. Wavelet methodology to improve single unit isolation in primary motor cortex cells

    PubMed Central

    Ortiz-Rosario, Alexis; Adeli, Hojjat; Buford, John A.

    2016-01-01

    The proper isolation of action potentials recorded extracellularly from neural tissue is an active area of research in the fields of neuroscience and biomedical signal processing. This paper presents an isolation methodology for neural recordings using the wavelet transform (WT), a statistical thresholding scheme, and the principal component analysis (PCA) algorithm. The effectiveness of five different mother wavelets was investigated: biorthogonal, Daubachies, discrete Meyer, symmetric, and Coifman; along with three different wavelet coefficient thresholding schemes: fixed form threshold, Stein’s unbiased estimate of risk, and minimax; and two different thresholding rules: soft and hard thresholding. The signal quality was evaluated using three different statistical measures: mean-squared error, root-mean squared, and signal to noise ratio. The clustering quality was evaluated using two different statistical measures: isolation distance, and L-ratio. This research shows that the selection of the mother wavelet has a strong influence on the clustering and isolation of single unit neural activity, with the Daubachies 4 wavelet and minimax thresholding scheme performing the best. PMID:25794461

  18. Statistical Clustering and the Contents of the Infant Vocabulary

    ERIC Educational Resources Information Center

    Swingley, Daniel

    2005-01-01

    Infants parse speech into word-sized units according to biases that develop in the first year. One bias, present before the age of 7 months, is to cluster syllables that tend to co-occur. The present computational research demonstrates that this statistical clustering bias could lead to the extraction of speech sequences that are actual words,…

  19. DAFi: A directed recursive data filtering and clustering approach for improving and interpreting data clustering identification of cell populations from polychromatic flow cytometry data.

    PubMed

    Lee, Alexandra J; Chang, Ivan; Burel, Julie G; Lindestam Arlehamn, Cecilia S; Mandava, Aishwarya; Weiskopf, Daniela; Peters, Bjoern; Sette, Alessandro; Scheuermann, Richard H; Qian, Yu

    2018-04-17

    Computational methods for identification of cell populations from polychromatic flow cytometry data are changing the paradigm of cytometry bioinformatics. Data clustering is the most common computational approach to unsupervised identification of cell populations from multidimensional cytometry data. However, interpretation of the identified data clusters is labor-intensive. Certain types of user-defined cell populations are also difficult to identify by fully automated data clustering analysis. Both are roadblocks before a cytometry lab can adopt the data clustering approach for cell population identification in routine use. We found that combining recursive data filtering and clustering with constraints converted from the user manual gating strategy can effectively address these two issues. We named this new approach DAFi: Directed Automated Filtering and Identification of cell populations. Design of DAFi preserves the data-driven characteristics of unsupervised clustering for identifying novel cell subsets, but also makes the results interpretable to experimental scientists through mapping and merging the multidimensional data clusters into the user-defined two-dimensional gating hierarchy. The recursive data filtering process in DAFi helped identify small data clusters which are otherwise difficult to resolve by a single run of the data clustering method due to the statistical interference of the irrelevant major clusters. Our experiment results showed that the proportions of the cell populations identified by DAFi, while being consistent with those by expert centralized manual gating, have smaller technical variances across samples than those from individual manual gating analysis and the nonrecursive data clustering analysis. Compared with manual gating segregation, DAFi-identified cell populations avoided the abrupt cut-offs on the boundaries. DAFi has been implemented to be used with multiple data clustering methods including K-means, FLOCK, FlowSOM, and the ClusterR package. For cell population identification, DAFi supports multiple options including clustering, bisecting, slope-based gating, and reversed filtering to meet various autogating needs from different scientific use cases. © 2018 International Society for Advancement of Cytometry. © 2018 International Society for Advancement of Cytometry.

  20. Applications of Stochastic Analyses for Collaborative Learning and Cognitive Assessment

    DTIC Science & Technology

    2007-04-01

    models (Visser, Maartje, Raijmakers, & Molenaar , 2002). The second part of this paper illustrates two applications of the methods described in the...clustering three-way data sets. Computational Statistics and Data Analysis, 51 (11), 5368–5376. Visser, I., Maartje, E., Raijmakers, E. J., & Molenaar

  1. Exploring relationships between Dairy Herd Improvement monitors of performance and the Transition Cow Index in Wisconsin dairy herds.

    PubMed

    Schultz, K K; Bennett, T B; Nordlund, K V; Döpfer, D; Cook, N B

    2016-09-01

    Transition cow management has been tracked via the Transition Cow Index (TCI; AgSource Cooperative Services, Verona, WI) since 2006. Transition Cow Index was developed to measure the difference between actual and predicted milk yield at first test day to evaluate the relative success of the transition period program. This project aimed to assess TCI in relation to all commonly used Dairy Herd Improvement (DHI) metrics available through AgSource Cooperative Services. Regression analysis was used to isolate variables that were relevant to TCI, and then principal components analysis and network analysis were used to determine the relative strength and relatedness among variables. Finally, cluster analysis was used to segregate herds based on similarity of relevant variables. The DHI data were obtained from 2,131 Wisconsin dairy herds with test-day mean ≥30 cows, which were tested ≥10 times throughout the 2014 calendar year. The original list of 940 DHI variables was reduced through expert-driven selection and regression analysis to 23 variables. The K-means cluster analysis produced 5 distinct clusters. Descriptive statistics were calculated for the 23 variables per cluster grouping. Using principal components analysis, cluster analysis, and network analysis, 4 parameters were isolated as most relevant to TCI; these were energy-corrected milk, 3 measures of intramammary infection (dry cow cure rate, linear somatic cell count score in primiparous cows, and new infection rate), peak ratio, and days in milk at peak milk production. These variables together with cow and newborn calf survival measures form a group of metrics that can be used to assist in the evaluation of overall transition period performance. Copyright © 2016 American Dairy Science Association. Published by Elsevier Inc. All rights reserved.

  2. Astrostatistical Analysis in Solar and Stellar Physics

    NASA Astrophysics Data System (ADS)

    Stenning, David Craig

    This dissertation focuses on developing statistical models and methods to address data-analytic challenges in astrostatistics---a growing interdisciplinary field fostering collaborations between statisticians and astrophysicists. The astrostatistics projects we tackle can be divided into two main categories: modeling solar activity and Bayesian analysis of stellar evolution. These categories from Part I and Part II of this dissertation, respectively. The first line of research we pursue involves classification and modeling of evolving solar features. Advances in space-based observatories are increasing both the quality and quantity of solar data, primarily in the form of high-resolution images. To analyze massive streams of solar image data, we develop a science-driven dimension reduction methodology to extract scientifically meaningful features from images. This methodology utilizes mathematical morphology to produce a concise numerical summary of the magnetic flux distribution in solar "active regions'' that (i) is far easier to work with than the source images, (ii) encapsulates scientifically relevant information in a more informative manner than existing schemes (i.e., manual classification schemes), and (iii) is amenable to sophisticated statistical analyses. In a related line of research, we perform a Bayesian analysis of the solar cycle using multiple proxy variables, such as sunspot numbers. We take advantage of patterns and correlations among the proxy variables to model solar activity using data from proxies that have become available more recently, while also taking advantage of the long history of observations of sunspot numbers. This model is an extension of the Yu et al. (2012) Bayesian hierarchical model for the solar cycle that used the sunspot numbers alone. Since proxies have different temporal coverage, we devise a multiple imputation scheme to account for missing data. We find that incorporating multiple proxies reveals important features of the solar cycle that are missed when the model is fit using only the sunspot numbers. In Part II of this dissertation we focus on two related lines of research involving Bayesian analysis of stellar evolution. We first focus on modeling multiple stellar populations in star clusters. It has long been assumed that all star clusters are comprised of single stellar populations---stars that formed at roughly the same time from a common molecular cloud. However, recent studies have produced evidence that some clusters host multiple populations, which has far-reaching scientific implications. We develop a Bayesian hierarchical model for multiple-population star clusters, extending earlier statistical models of stellar evolution (e.g., van Dyk et al. 2009, Stein et al. 2013). We also devise an adaptive Markov chain Monte Carlo algorithm to explore the complex posterior distribution. We use numerical studies to demonstrate that our method can recover parameters of multiple-population clusters, and also show how model misspecification can be diagnosed. Our model and computational tools are incorporated into an open-source software suite known as BASE-9. We also explore statistical properties of the estimators and determine that the influence of the prior distribution does not diminish with larger sample sizes, leading to non-standard asymptotics. In a final line of research, we present the first-ever attempt to estimate the carbon fraction of white dwarfs. This quantity has important implications for both astrophysics and fundamental nuclear physics, but is currently unknown. We use a numerical study to demonstrate that assuming an incorrect value for the carbon fraction leads to incorrect white-dwarf ages of star clusters. Finally, we present our attempt to estimate the carbon fraction of the white dwarfs in the well-studied star cluster 47 Tucanae.

  3. The Technical and Biological Reproducibility of Matrix-Assisted Laser Desorption Ionization-Time of Flight Mass Spectrometry (MALDI-TOF MS) Based Typing: Employment of Bioinformatics in a Multicenter Study

    PubMed Central

    Oberle, Michael; Wohlwend, Nadia; Jonas, Daniel; Maurer, Florian P.; Jost, Geraldine; Tschudin-Sutter, Sarah; Vranckx, Katleen; Egli, Adrian

    2016-01-01

    Background The technical, biological, and inter-center reproducibility of matrix-assisted laser desorption ionization-time of flight mass spectrometry (MALDI TOF MS) typing data has not yet been explored. The aim of this study is to compare typing data from multiple centers employing bioinformatics using bacterial strains from two past outbreaks and non-related strains. Material/Methods Participants received twelve extended spectrum betalactamase-producing E. coli isolates and followed the same standard operating procedure (SOP) including a full-protein extraction protocol. All laboratories provided visually read spectra via flexAnalysis (Bruker, Germany). Raw data from each laboratory allowed calculating the technical and biological reproducibility between centers using BioNumerics (Applied Maths NV, Belgium). Results Technical and biological reproducibility ranged between 96.8–99.4% and 47.6–94.4%, respectively. The inter-center reproducibility showed a comparable clustering among identical isolates. Principal component analysis indicated a higher tendency to cluster within the same center. Therefore, we used a discriminant analysis, which completely separated the clusters. Next, we defined a reference center and performed a statistical analysis to identify specific peaks to identify the outbreak clusters. Finally, we used a classifier algorithm and a linear support vector machine on the determined peaks as classifier. A validation showed that within the set of the reference center, the identification of the cluster was 100% correct with a large contrast between the score with the correct cluster and the next best scoring cluster. Conclusions Based on the sufficient technical and biological reproducibility of MALDI-TOF MS based spectra, detection of specific clusters is possible from spectra obtained from different centers. However, we believe that a shared SOP and a bioinformatics approach are required to make the analysis robust and reliable. PMID:27798637

  4. Analyzing Protein Clusters on the Plasma Membrane: Application of Spatial Statistical Analysis Methods on Super-Resolution Microscopy Images.

    PubMed

    Paparelli, Laura; Corthout, Nikky; Pavie, Benjamin; Annaert, Wim; Munck, Sebastian

    2016-01-01

    The spatial distribution of proteins within the cell affects their capability to interact with other molecules and directly influences cellular processes and signaling. At the plasma membrane, multiple factors drive protein compartmentalization into specialized functional domains, leading to the formation of clusters in which intermolecule interactions are facilitated. Therefore, quantifying protein distributions is a necessity for understanding their regulation and function. The recent advent of super-resolution microscopy has opened up the possibility of imaging protein distributions at the nanometer scale. In parallel, new spatial analysis methods have been developed to quantify distribution patterns in super-resolution images. In this chapter, we provide an overview of super-resolution microscopy and summarize the factors influencing protein arrangements on the plasma membrane. Finally, we highlight methods for analyzing clusterization of plasma membrane proteins, including examples of their applications.

  5. Typhoid fever acquired in the United States, 1999–2010: epidemiology, microbiology, and use of a space–time scan statistic for outbreak detection

    PubMed Central

    IMANISHI, M.; NEWTON, A. E.; VIEIRA, A. R.; GONZALEZ-AVILES, G.; KENDALL SCOTT, M. E.; MANIKONDA, K.; MAXWELL, T. N.; HALPIN, J. L.; FREEMAN, M. M.; MEDALLA, F.; AYERS, T. L.; DERADO, G.; MAHON, B. E.; MINTZ, E. D.

    2016-01-01

    SUMMARY Although rare, typhoid fever cases acquired in the United States continue to be reported. Detection and investigation of outbreaks in these domestically acquired cases offer opportunities to identify chronic carriers. We searched surveillance and laboratory databases for domestically acquired typhoid fever cases, used a space–time scan statistic to identify clusters, and classified clusters as outbreaks or non-outbreaks. From 1999 to 2010, domestically acquired cases accounted for 18% of 3373 reported typhoid fever cases; their isolates were less often multidrug-resistant (2% vs. 15%) compared to isolates from travel-associated cases. We identified 28 outbreaks and two possible outbreaks within 45 space–time clusters of ⩾2 domestically acquired cases, including three outbreaks involving ⩾2 molecular subtypes. The approach detected seven of the ten outbreaks published in the literature or reported to CDC. Although this approach did not definitively identify any previously unrecognized outbreaks, it showed the potential to detect outbreaks of typhoid fever that may escape detection by routine analysis of surveillance data. Sixteen outbreaks had been linked to a carrier. Every case of typhoid fever acquired in a non-endemic country warrants thorough investigation. Space–time scan statistics, together with shoe-leather epidemiology and molecular subtyping, may improve outbreak detection. PMID:25427666

  6. Typhoid fever acquired in the United States, 1999-2010: epidemiology, microbiology, and use of a space-time scan statistic for outbreak detection.

    PubMed

    Imanishi, M; Newton, A E; Vieira, A R; Gonzalez-Aviles, G; Kendall Scott, M E; Manikonda, K; Maxwell, T N; Halpin, J L; Freeman, M M; Medalla, F; Ayers, T L; Derado, G; Mahon, B E; Mintz, E D

    2015-08-01

    Although rare, typhoid fever cases acquired in the United States continue to be reported. Detection and investigation of outbreaks in these domestically acquired cases offer opportunities to identify chronic carriers. We searched surveillance and laboratory databases for domestically acquired typhoid fever cases, used a space-time scan statistic to identify clusters, and classified clusters as outbreaks or non-outbreaks. From 1999 to 2010, domestically acquired cases accounted for 18% of 3373 reported typhoid fever cases; their isolates were less often multidrug-resistant (2% vs. 15%) compared to isolates from travel-associated cases. We identified 28 outbreaks and two possible outbreaks within 45 space-time clusters of ⩾2 domestically acquired cases, including three outbreaks involving ⩾2 molecular subtypes. The approach detected seven of the ten outbreaks published in the literature or reported to CDC. Although this approach did not definitively identify any previously unrecognized outbreaks, it showed the potential to detect outbreaks of typhoid fever that may escape detection by routine analysis of surveillance data. Sixteen outbreaks had been linked to a carrier. Every case of typhoid fever acquired in a non-endemic country warrants thorough investigation. Space-time scan statistics, together with shoe-leather epidemiology and molecular subtyping, may improve outbreak detection.

  7. Relating N2O emissions during biological nitrogen removal with operating conditions using multivariate statistical techniques.

    PubMed

    Vasilaki, V; Volcke, E I P; Nandi, A K; van Loosdrecht, M C M; Katsou, E

    2018-04-26

    Multivariate statistical analysis was applied to investigate the dependencies and underlying patterns between N 2 O emissions and online operational variables (dissolved oxygen and nitrogen component concentrations, temperature and influent flow-rate) during biological nitrogen removal from wastewater. The system under study was a full-scale reactor, for which hourly sensor data were available. The 15-month long monitoring campaign was divided into 10 sub-periods based on the profile of N 2 O emissions, using Binary Segmentation. The dependencies between operating variables and N 2 O emissions fluctuated according to Spearman's rank correlation. The correlation between N 2 O emissions and nitrite concentrations ranged between 0.51 and 0.78. Correlation >0.7 between N 2 O emissions and nitrate concentrations was observed at sub-periods with average temperature lower than 12 °C. Hierarchical k-means clustering and principal component analysis linked N 2 O emission peaks with precipitation events and ammonium concentrations higher than 2 mg/L, especially in sub-periods characterized by low N 2 O fluxes. Additionally, the highest ranges of measured N 2 O fluxes belonged to clusters corresponding with NO 3 -N concentration less than 1 mg/L in the upstream plug-flow reactor (middle of oxic zone), indicating slow nitrification rates. The results showed that the range of N 2 O emissions partially depends on the prior behavior of the system. The principal component analysis validated the findings from the clustering analysis and showed that ammonium, nitrate, nitrite and temperature explained a considerable percentage of the variance in the system for the majority of the sub-periods. The applied statistical methods, linked the different ranges of emissions with the system variables, provided insights on the effect of operating conditions on N 2 O emissions in each sub-period and can be integrated into N 2 O emissions data processing at wastewater treatment plants. Copyright © 2018. Published by Elsevier Ltd.

  8. SEARCHING FOR BULK MOTIONS IN THE INTRACLUSTER MEDIUM OF MASSIVE, MERGING CLUSTERS WITH CHANDRA CCD DATA

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Liu, Ang; Yu, Heng; Tozzi, Paolo

    2016-04-10

    We search for bulk motions in the intracluster medium (ICM) of massive clusters showing evidence of an ongoing or recent major merger with spatially resolved spectroscopy in Chandra CCD data. We identify a sample of six merging clusters with >150 ks Chandra exposure in the redshift range 0.1 < z < 0.3. By performing X-ray spectral analysis of projected ICM regions selected according to their surface brightness, we obtain the projected redshift maps for all of these clusters. After performing a robust analysis of the statistical and systematic uncertainties in the measured X-ray redshift z{sub X}, we check whether or not themore » global z{sub X} distribution differs from that expected when the ICM is at rest. We find evidence of significant bulk motions at more than 3σ in A2142 and A115, and less than 2σ in A2034 and A520. Focusing on single regions, we identify significant localized velocity differences in all of the merger clusters. We also perform the same analysis on two relaxed clusters with no signatures of recent mergers, finding no signs of bulk motions, as expected. Our results indicate that deep Chandra CCD data enable us to identify the presence of bulk motions at the level of v{sub BM} > 1000 km s{sup −1} in the ICM of massive merging clusters at 0.1 < z < 0.3. Although the CCD spectral resolution is not sufficient for a detailed analysis of the ICM dynamics, Chandra CCD data constitute a key diagnostic tool complementing X-ray bolometers on board future X-ray missions.« less

  9. A Framework for Establishing Standard Reference Scale of Texture by Multivariate Statistical Analysis Based on Instrumental Measurement and Sensory Evaluation.

    PubMed

    Zhi, Ruicong; Zhao, Lei; Xie, Nan; Wang, Houyin; Shi, Bolin; Shi, Jingye

    2016-01-13

    A framework of establishing standard reference scale (texture) is proposed by multivariate statistical analysis according to instrumental measurement and sensory evaluation. Multivariate statistical analysis is conducted to rapidly select typical reference samples with characteristics of universality, representativeness, stability, substitutability, and traceability. The reasonableness of the framework method is verified by establishing standard reference scale of texture attribute (hardness) with Chinese well-known food. More than 100 food products in 16 categories were tested using instrumental measurement (TPA test), and the result was analyzed with clustering analysis, principal component analysis, relative standard deviation, and analysis of variance. As a result, nine kinds of foods were determined to construct the hardness standard reference scale. The results indicate that the regression coefficient between the estimated sensory value and the instrumentally measured value is significant (R(2) = 0.9765), which fits well with Stevens's theory. The research provides reliable a theoretical basis and practical guide for quantitative standard reference scale establishment on food texture characteristics.

  10. The X-ray cluster survey with eRosita: forecasts for cosmology, cluster physics and primordial non-Gaussianity

    NASA Astrophysics Data System (ADS)

    Pillepich, Annalisa; Porciani, Cristiano; Reiprich, Thomas H.

    2012-05-01

    Starting in late 2013, the eRosita telescope will survey the X-ray sky with unprecedented sensitivity. Assuming a detection limit of 50 photons in the (0.5-2.0) keV energy band with a typical exposure time of 1.6 ks, we predict that eRosita will detect ˜9.3 × 104 clusters of galaxies more massive than 5 × 1013 h-1 M⊙, with the currently planned all-sky survey. Their median redshift will be z≃ 0.35. We perform a Fisher-matrix analysis to forecast the constraining power of ? on the Λ cold dark matter (ΛCDM) cosmology and, simultaneously, on the X-ray scaling relations for galaxy clusters. Special attention is devoted to the possibility of detecting primordial non-Gaussianity. We consider two experimental probes: the number counts and the angular clustering of a photon-count limited sample of clusters. We discuss how the cluster sample should be split to optimize the analysis and we show that redshift information of the individual clusters is vital to break the strong degeneracies among the model parameters. For example, performing a 'tomographic' analysis based on photometric-redshift estimates and combining one- and two-point statistics will give marginal 1σ errors of Δσ8≃ 0.036 and ΔΩm≃ 0.012 without priors, and improve the current estimates on the slope of the luminosity-mass relation by a factor of 3. Regarding primordial non-Gaussianity, ? clusters alone will give ΔfNL≃ 9, 36 and 144 for the local, orthogonal and equilateral model, respectively. Measuring redshifts with spectroscopic accuracy would further tighten the constraints by nearly 40 per cent (barring fNL which displays smaller improvements). Finally, combining ? data with the analysis of temperature anisotropies in the cosmic microwave background by the Planck satellite should give sensational constraints on both the cosmology and the properties of the intracluster medium.

  11. Development of an automated energy audit protocol for office buildings

    NASA Astrophysics Data System (ADS)

    Deb, Chirag

    This study aims to enhance the building energy audit process, and bring about reduction in time and cost requirements in the conduction of a full physical audit. For this, a total of 5 Energy Service Companies in Singapore have collaborated and provided energy audit reports for 62 office buildings. Several statistical techniques are adopted to analyse these reports. These techniques comprise cluster analysis and development of prediction models to predict energy savings for buildings. The cluster analysis shows that there are 3 clusters of buildings experiencing different levels of energy savings. To understand the effect of building variables on the change in EUI, a robust iterative process for selecting the appropriate variables is developed. The results show that the 4 variables of GFA, non-air-conditioning energy consumption, average chiller plant efficiency and installed capacity of chillers should be taken for clustering. This analysis is extended to the development of prediction models using linear regression and artificial neural networks (ANN). An exhaustive variable selection algorithm is developed to select the input variables for the two energy saving prediction models. The results show that the ANN prediction model can predict the energy saving potential of a given building with an accuracy of +/-14.8%.

  12. [On studying the social economic aftermath of neirotrauma].

    PubMed

    Potapov, A A; Potapov, N A; Likhterman, L B

    2011-01-01

    To implement probing medical statistic studies on neiro-trauma the cluster analysis technique was applied to classify the regions of the Russian Federation. The characteristics of social climate, demographic and economic indicators and level of medical service are considered. The eleven clusters are selected and combined into four groups. Thereby, due to possible appropriate extrapolation, the epidemiologic studies concerning the prevalence of craniocerebral and backbone cerebrospinal injuries and their aftermath can be simplified and made cheaper to facilitate the assessment of the impact on economy, demography and social climate of the country.

  13. Modeling the Movement of Homicide by Type to Inform Public Health Prevention Efforts

    PubMed Central

    Grady, Sue; Pizarro, Jesenia M.; Melde, Chris

    2015-01-01

    Objectives. We modeled the spatiotemporal movement of hotspot clusters of homicide by motive in Newark, New Jersey, to investigate whether different homicide types have different patterns of clustering and movement. Methods. We obtained homicide data from the Newark Police Department Homicide Unit’s investigative files from 1997 through 2007 (n = 560). We geocoded the address at which each homicide victim was found and recorded the date of and the motive for the homicide. We used cluster detection software to model the spatiotemporal movement of statistically significant homicide clusters by motive, using census tract and month of occurrence as the spatial and temporal units of analysis. Results. Gang-motivated homicides showed evidence of clustering and diffusion through Newark. Additionally, gang-motivated homicide clusters overlapped to a degree with revenge and drug-motivated homicide clusters. Escalating dispute and nonintimate familial homicides clustered; however, there was no evidence of diffusion. Intimate partner and robbery homicides did not cluster. Conclusions. By tracking how homicide types diffuse through communities and determining which places have ongoing or emerging homicide problems by type, we can better inform the deployment of prevention and intervention efforts. PMID:26270315

  14. The Atacama Cosmology Telescope: Cosmology from Galaxy Clusters Detected Via the Sunyaev-Zel'dovich Effect

    NASA Technical Reports Server (NTRS)

    Sehgal, Neelima; Trac, Hy; Acquaviva, Viviana; Ade, Peter A. R.; Aguirre, Paula; Amiri, Mandana; Appel, John W.; Barrientos, L. Felipe; Battistelli, Elia S.; Bond, J. Richard; hide

    2010-01-01

    We present constraints on cosmological parameters based on a sample of Sunyaev-Zel'dovich-selected galaxy clusters detected in a millimeter-wave survey by the Atacama Cosmology Telescope. The cluster sample used in this analysis consists of 9 optically-confirmed high-mass clusters comprising the high-significance end of the total cluster sample identified in 455 square degrees of sky surveyed during 2008 at 148 GHz. We focus on the most massive systems to reduce the degeneracy between unknown cluster astrophysics and cosmology derived from SZ surveys. We describe the scaling relation between cluster mass and SZ signal with a 4-parameter fit. Marginalizing over the values of the parameters in this fit with conservative priors gives (sigma)8 = 0.851 +/- 0.115 and w = -1.14 +/- 0.35 for a spatially-flat wCDM cosmological model with WMAP 7-year priors on cosmological parameters. This gives a modest improvement in statistical uncertainty over WMAP 7-year constraints alone. Fixing the scaling relation between cluster mass and SZ signal to a fiducial relation obtained from numerical simulations and calibrated by X-ray observations, we find (sigma)8 + 0.821 +/- 0.044 and w = -1.05 +/- 0.20. These results are consistent with constraints from WMAP 7 plus baryon acoustic oscillations plus type Ia supernova which give (sigma)8 = 0.802 +/- 0.038 and w = -0.98 +/- 0.053. A stacking analysis of the clusters in this sample compared to clusters simulated assuming the fiducial model also shows good agreement. These results suggest that, given the sample of clusters used here, both the astrophysics of massive clusters and the cosmological parameters derived from them are broadly consistent with current models.

  15. Validating clustering of molecular dynamics simulations using polymer models.

    PubMed

    Phillips, Joshua L; Colvin, Michael E; Newsam, Shawn

    2011-11-14

    Molecular dynamics (MD) simulation is a powerful technique for sampling the meta-stable and transitional conformations of proteins and other biomolecules. Computational data clustering has emerged as a useful, automated technique for extracting conformational states from MD simulation data. Despite extensive application, relatively little work has been done to determine if the clustering algorithms are actually extracting useful information. A primary goal of this paper therefore is to provide such an understanding through a detailed analysis of data clustering applied to a series of increasingly complex biopolymer models. We develop a novel series of models using basic polymer theory that have intuitive, clearly-defined dynamics and exhibit the essential properties that we are seeking to identify in MD simulations of real biomolecules. We then apply spectral clustering, an algorithm particularly well-suited for clustering polymer structures, to our models and MD simulations of several intrinsically disordered proteins. Clustering results for the polymer models provide clear evidence that the meta-stable and transitional conformations are detected by the algorithm. The results for the polymer models also help guide the analysis of the disordered protein simulations by comparing and contrasting the statistical properties of the extracted clusters. We have developed a framework for validating the performance and utility of clustering algorithms for studying molecular biopolymer simulations that utilizes several analytic and dynamic polymer models which exhibit well-behaved dynamics including: meta-stable states, transition states, helical structures, and stochastic dynamics. We show that spectral clustering is robust to anomalies introduced by structural alignment and that different structural classes of intrinsically disordered proteins can be reliably discriminated from the clustering results. To our knowledge, our framework is the first to utilize model polymers to rigorously test the utility of clustering algorithms for studying biopolymers.

  16. Validating clustering of molecular dynamics simulations using polymer models

    PubMed Central

    2011-01-01

    Background Molecular dynamics (MD) simulation is a powerful technique for sampling the meta-stable and transitional conformations of proteins and other biomolecules. Computational data clustering has emerged as a useful, automated technique for extracting conformational states from MD simulation data. Despite extensive application, relatively little work has been done to determine if the clustering algorithms are actually extracting useful information. A primary goal of this paper therefore is to provide such an understanding through a detailed analysis of data clustering applied to a series of increasingly complex biopolymer models. Results We develop a novel series of models using basic polymer theory that have intuitive, clearly-defined dynamics and exhibit the essential properties that we are seeking to identify in MD simulations of real biomolecules. We then apply spectral clustering, an algorithm particularly well-suited for clustering polymer structures, to our models and MD simulations of several intrinsically disordered proteins. Clustering results for the polymer models provide clear evidence that the meta-stable and transitional conformations are detected by the algorithm. The results for the polymer models also help guide the analysis of the disordered protein simulations by comparing and contrasting the statistical properties of the extracted clusters. Conclusions We have developed a framework for validating the performance and utility of clustering algorithms for studying molecular biopolymer simulations that utilizes several analytic and dynamic polymer models which exhibit well-behaved dynamics including: meta-stable states, transition states, helical structures, and stochastic dynamics. We show that spectral clustering is robust to anomalies introduced by structural alignment and that different structural classes of intrinsically disordered proteins can be reliably discriminated from the clustering results. To our knowledge, our framework is the first to utilize model polymers to rigorously test the utility of clustering algorithms for studying biopolymers. PMID:22082218

  17. Identifying and Assessing Interesting Subgroups in a Heterogeneous Population

    PubMed Central

    Lee, Woojoo; Alexeyenko, Andrey; Pernemalm, Maria; Guegan, Justine; Dessen, Philippe; Lazar, Vladimir; Lehtiö, Janne; Pawitan, Yudi

    2015-01-01

    Biological heterogeneity is common in many diseases and it is often the reason for therapeutic failures. Thus, there is great interest in classifying a disease into subtypes that have clinical significance in terms of prognosis or therapy response. One of the most popular methods to uncover unrecognized subtypes is cluster analysis. However, classical clustering methods such as k-means clustering or hierarchical clustering are not guaranteed to produce clinically interesting subtypes. This could be because the main statistical variability—the basis of cluster generation—is dominated by genes not associated with the clinical phenotype of interest. Furthermore, a strong prognostic factor might be relevant for a certain subgroup but not for the whole population; thus an analysis of the whole sample may not reveal this prognostic factor. To address these problems we investigate methods to identify and assess clinically interesting subgroups in a heterogeneous population. The identification step uses a clustering algorithm and to assess significance we use a false discovery rate- (FDR-) based measure. Under the heterogeneity condition the standard FDR estimate is shown to overestimate the true FDR value, but this is remedied by an improved FDR estimation procedure. As illustrations, two real data examples from gene expression studies of lung cancer are provided. PMID:26339613

  18. Optimizing the maximum reported cluster size in the spatial scan statistic for ordinal data.

    PubMed

    Kim, Sehwi; Jung, Inkyung

    2017-01-01

    The spatial scan statistic is an important tool for spatial cluster detection. There have been numerous studies on scanning window shapes. However, little research has been done on the maximum scanning window size or maximum reported cluster size. Recently, Han et al. proposed to use the Gini coefficient to optimize the maximum reported cluster size. However, the method has been developed and evaluated only for the Poisson model. We adopt the Gini coefficient to be applicable to the spatial scan statistic for ordinal data to determine the optimal maximum reported cluster size. Through a simulation study and application to a real data example, we evaluate the performance of the proposed approach. With some sophisticated modification, the Gini coefficient can be effectively employed for the ordinal model. The Gini coefficient most often picked the optimal maximum reported cluster sizes that were the same as or smaller than the true cluster sizes with very high accuracy. It seems that we can obtain a more refined collection of clusters by using the Gini coefficient. The Gini coefficient developed specifically for the ordinal model can be useful for optimizing the maximum reported cluster size for ordinal data and helpful for properly and informatively discovering cluster patterns.

  19. Optimizing the maximum reported cluster size in the spatial scan statistic for ordinal data

    PubMed Central

    Kim, Sehwi

    2017-01-01

    The spatial scan statistic is an important tool for spatial cluster detection. There have been numerous studies on scanning window shapes. However, little research has been done on the maximum scanning window size or maximum reported cluster size. Recently, Han et al. proposed to use the Gini coefficient to optimize the maximum reported cluster size. However, the method has been developed and evaluated only for the Poisson model. We adopt the Gini coefficient to be applicable to the spatial scan statistic for ordinal data to determine the optimal maximum reported cluster size. Through a simulation study and application to a real data example, we evaluate the performance of the proposed approach. With some sophisticated modification, the Gini coefficient can be effectively employed for the ordinal model. The Gini coefficient most often picked the optimal maximum reported cluster sizes that were the same as or smaller than the true cluster sizes with very high accuracy. It seems that we can obtain a more refined collection of clusters by using the Gini coefficient. The Gini coefficient developed specifically for the ordinal model can be useful for optimizing the maximum reported cluster size for ordinal data and helpful for properly and informatively discovering cluster patterns. PMID:28753674

  20. A spatial scan statistic for survival data based on Weibull distribution.

    PubMed

    Bhatt, Vijaya; Tiwari, Neeraj

    2014-05-20

    The spatial scan statistic has been developed as a geographical cluster detection analysis tool for different types of data sets such as Bernoulli, Poisson, ordinal, normal and exponential. We propose a scan statistic for survival data based on Weibull distribution. It may also be used for other survival distributions, such as exponential, gamma, and log normal. The proposed method is applied on the survival data of tuberculosis patients for the years 2004-2005 in Nainital district of Uttarakhand, India. Simulation studies reveal that the proposed method performs well for different survival distribution functions. Copyright © 2013 John Wiley & Sons, Ltd.

  1. Functional Interference Clusters in Cancer Patients With Bone Metastases: A Secondary Analysis of RTOG 9714

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Chow, Edward, E-mail: Edward.Chow@sunnybrook.c; James, Jennifer; Barsevick, Andrea

    Purpose: To explore the relationships (clusters) among the functional interference items in the Brief Pain Inventory (BPI) in patients with bone metastases. Methods: Patients enrolled in the Radiation Therapy Oncology Group (RTOG) 9714 bone metastases study were eligible. Patients were assessed at baseline and 4, 8, and 12 weeks after randomization for the palliative radiotherapy with the BPI, which consists of seven functional items: general activity, mood, walking ability, normal work, relations with others, sleep, and enjoyment of life. Principal component analysis with varimax rotation was used to determine the clusters between the functional items at baseline and the follow-up.more » Cronbach's alpha was used to determine the consistency and reliability of each cluster at baseline and follow-up. Results: There were 448 male and 461 female patients, with a median age of 67 years. There were two functional interference clusters at baseline, which accounted for 71% of the total variance. The first cluster (physical interference) included normal work and walking ability, which accounted for 58% of the total variance. The second cluster (psychosocial interference) included relations with others and sleep, which accounted for 13% of the total variance. The Cronbach's alpha statistics were 0.83 and 0.80, respectively. The functional clusters changed at week 12 in responders but persisted through week 12 in nonresponders. Conclusion: Palliative radiotherapy is effective in reducing bone pain. Functional interference component clusters exist in patients treated for bone metastases. These clusters changed over time in this study, possibly attributable to treatment. Further research is needed to examine these effects.« less

  2. Identification of Reliable Components in Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS): a Data-Driven Approach across Metabolic Processes.

    PubMed

    Motegi, Hiromi; Tsuboi, Yuuri; Saga, Ayako; Kagami, Tomoko; Inoue, Maki; Toki, Hideaki; Minowa, Osamu; Noda, Tetsuo; Kikuchi, Jun

    2015-11-04

    There is an increasing need to use multivariate statistical methods for understanding biological functions, identifying the mechanisms of diseases, and exploring biomarkers. In addition to classical analyses such as hierarchical cluster analysis, principal component analysis, and partial least squares discriminant analysis, various multivariate strategies, including independent component analysis, non-negative matrix factorization, and multivariate curve resolution, have recently been proposed. However, determining the number of components is problematic. Despite the proposal of several different methods, no satisfactory approach has yet been reported. To resolve this problem, we implemented a new idea: classifying a component as "reliable" or "unreliable" based on the reproducibility of its appearance, regardless of the number of components in the calculation. Using the clustering method for classification, we applied this idea to multivariate curve resolution-alternating least squares (MCR-ALS). Comparisons between conventional and modified methods applied to proton nuclear magnetic resonance ((1)H-NMR) spectral datasets derived from known standard mixtures and biological mixtures (urine and feces of mice) revealed that more plausible results are obtained by the modified method. In particular, clusters containing little information were detected with reliability. This strategy, named "cluster-aided MCR-ALS," will facilitate the attainment of more reliable results in the metabolomics datasets.

  3. Exploring root symbiotic programs in the model legume Medicago truncatula using EST analysis.

    PubMed

    Journet, Etienne-Pascal; van Tuinen, Diederik; Gouzy, Jérome; Crespeau, Hervé; Carreau, Véronique; Farmer, Mary-Jo; Niebel, Andreas; Schiex, Thomas; Jaillon, Olivier; Chatagnier, Odile; Godiard, Laurence; Micheli, Fabienne; Kahn, Daniel; Gianinazzi-Pearson, Vivienne; Gamas, Pascal

    2002-12-15

    We report on a large-scale expressed sequence tag (EST) sequencing and analysis program aimed at characterizing the sets of genes expressed in roots of the model legume Medicago truncatula during interactions with either of two microsymbionts, the nitrogen-fixing bacterium Sinorhizobium meliloti or the arbuscular mycorrhizal fungus Glomus intraradices. We have designed specific tools for in silico analysis of EST data, in relation to chimeric cDNA detection, EST clustering, encoded protein prediction, and detection of differential expression. Our 21 473 5'- and 3'-ESTs could be grouped into 6359 EST clusters, corresponding to distinct virtual genes, along with 52 498 other M.truncatula ESTs available in the dbEST (NCBI) database that were recruited in the process. These clusters were manually annotated, using a specifically developed annotation interface. Analysis of EST cluster distribution in various M.truncatula cDNA libraries, supported by a refined R test to evaluate statistical significance and by 'electronic northern' representation, enabled us to identify a large number of novel genes predicted to be up- or down-regulated during either symbiotic root interaction. These in silico analyses provide a first global view of the genetic programs for root symbioses in M.truncatula. A searchable database has been built and can be accessed through a public interface.

  4. Clustering Analysis of Antibiograms and Antibiogram Types of Streptococcus agalactiae Strains from Tilapia in China.

    PubMed

    Liu, Chan; Feng, Juan; Zhang, Defeng; Xie, Yundan; Li, Anxing; Wang, Jiangyong; Su, Youlu

    2018-05-11

    In view of the changing antibiotic-resistance profiles of Streptococcus agalactiae from tilapia in China, antimicrobial susceptibilities of 75 S. agalactiae strains were determined by the disc diffusion method, and cluster analyses of the antibiograms and antibiogram types were performed. All strains displayed multidrug resistance (MDR). The antimicrobial-resistance rates were highest (>90%) to aminoglycosides, sulfonamides, pipemidic acid, and norfloxacin, followed by penicillin, ampicillin, and ciprofloxacin (26.7-38.7%); those to furadantin, lincomycin, erythromycin, ofloxacin, tetracycline, and florfenicol were low (<10%), and no resistance to vancomycin, cefalexin, cefoxitin, amoxicillin, medemycin, doxitard, oxytetracycline, rifampin, chloramphenicol, or thiamphenicol was detected. Statistical analysis showed that the resistance rate to ciprofloxacin increased significantly in 2016 (p = 0.009), whereas that to trimethoprim/sulfamethoxazole decreased (p = 0.017). Cluster analyses identified that the strains had 23 antibiogram types (A-W) and clustered in five groups (Groups I-V). The strains with higher antimicrobial resistance mainly clustered in Groups I and II. Our results show that the antibiograms varied with time and by location and that antibiogram types are constantly updating and expanding. Effective measures must be taken to reduce the antimicrobial resistance and spread of MDR strains.

  5. Spatial cluster analysis of nanoscopically mapped serotonin receptors for classification of fixed brain tissue

    NASA Astrophysics Data System (ADS)

    Sams, Michael; Silye, Rene; Göhring, Janett; Muresan, Leila; Schilcher, Kurt; Jacak, Jaroslaw

    2014-01-01

    We present a cluster spatial analysis method using nanoscopic dSTORM images to determine changes in protein cluster distributions within brain tissue. Such methods are suitable to investigate human brain tissue and will help to achieve a deeper understanding of brain disease along with aiding drug development. Human brain tissue samples are usually treated postmortem via standard fixation protocols, which are established in clinical laboratories. Therefore, our localization microscopy-based method was adapted to characterize protein density and protein cluster localization in samples fixed using different protocols followed by common fluorescent immunohistochemistry techniques. The localization microscopy allows nanoscopic mapping of serotonin 5-HT1A receptor groups within a two-dimensional image of a brain tissue slice. These nanoscopically mapped proteins can be confined to clusters by applying the proposed statistical spatial analysis. Selected features of such clusters were subsequently used to characterize and classify the tissue. Samples were obtained from different types of patients, fixed with different preparation methods, and finally stored in a human tissue bank. To verify the proposed method, samples of a cryopreserved healthy brain have been compared with epitope-retrieved and paraffin-fixed tissues. Furthermore, samples of healthy brain tissues were compared with data obtained from patients suffering from mental illnesses (e.g., major depressive disorder). Our work demonstrates the applicability of localization microscopy and image analysis methods for comparison and classification of human brain tissues at a nanoscopic level. Furthermore, the presented workflow marks a unique technological advance in the characterization of protein distributions in brain tissue sections.

  6. The Effect of Cluster Sampling Design in Survey Research on the Standard Error Statistic.

    ERIC Educational Resources Information Center

    Wang, Lin; Fan, Xitao

    Standard statistical methods are used to analyze data that is assumed to be collected using a simple random sampling scheme. These methods, however, tend to underestimate variance when the data is collected with a cluster design, which is often found in educational survey research. The purposes of this paper are to demonstrate how a cluster design…

  7. Use of keyword hierarchies to interpret gene expression patterns.

    PubMed

    Masys, D R; Welsh, J B; Lynn Fink, J; Gribskov, M; Klacansky, I; Corbeil, J

    2001-04-01

    High-density microarray technology permits the quantitative and simultaneous monitoring of thousands of genes. The interpretation challenge is to extract relevant information from this large amount of data. A growing variety of statistical analysis approaches are available to identify clusters of genes that share common expression characteristics, but provide no information regarding the biological similarities of genes within clusters. The published literature provides a potential source of information to assist in interpretation of clustering results. We describe a data mining method that uses indexing terms ('keywords') from the published literature linked to specific genes to present a view of the conceptual similarity of genes within a cluster or group of interest. The method takes advantage of the hierarchical nature of Medical Subject Headings used to index citations in the MEDLINE database, and the registry numbers applied to enzymes.

  8. Neutrino and axion bounds from the globular cluster M5 (NGC 5904).

    PubMed

    Viaux, N; Catelan, M; Stetson, P B; Raffelt, G G; Redondo, J; Valcarce, A A R; Weiss, A

    2013-12-06

    The red-giant branch (RGB) in globular clusters is extended to larger brightness if the degenerate helium core loses too much energy in "dark channels." Based on a large set of archival observations, we provide high-precision photometry for the Galactic globular cluster M5 (NGC 5904), allowing for a detailed comparison between the observed tip of the RGB with predictions based on contemporary stellar evolution theory. In particular, we derive 95% confidence limits of g(ae)<4.3×10(-13) on the axion-electron coupling and μ(ν)<4.5×10(-12)μ(B) (Bohr magneton μ(B)=e/2m(e)) on a neutrino dipole moment, based on a detailed analysis of statistical and systematic uncertainties. The cluster distance is the single largest source of uncertainty and can be improved in the future.

  9. Collected Notes on the Workshop for Pattern Discovery in Large Databases

    NASA Technical Reports Server (NTRS)

    Buntine, Wray (Editor); Delalto, Martha (Editor)

    1991-01-01

    These collected notes are a record of material presented at the Workshop. The core data analysis is addressed that have traditionally required statistical or pattern recognition techniques. Some of the core tasks include classification, discrimination, clustering, supervised and unsupervised learning, discovery and diagnosis, i.e., general pattern discovery.

  10. Classification of frailty using the Kihon checklist: A cluster analysis of older adults in urban areas.

    PubMed

    Kera, Takeshi; Kawai, Hisashi; Yoshida, Hideyo; Hirano, Hirohiko; Kojima, Motonaga; Fujiwara, Yoshinori; Ihara, Kazushige; Obuchi, Shuichi

    2017-01-01

    Frailty is an important predictor of the need for long-term care and hospitalization. Our aim was to categorize frailty in community-dwelling older adults. The present study was carried out in 2011-2013, and consisted of 1380 individuals over 65 years of age. Participants completed the Kihon checklist, which is widely used to assess frailty in Japan, and their physical, cognitive and social function was evaluated. Non-hierarchical cluster analysis was used to statistically categorize frailty. The optimum number of clusters was determined as the point at which the external reference values (instrumental activity of daily living score, grip power, 10-m walk time, body mass index, portable fall risk index, occlusal force and Mini-Mental State Examination score) differed. According to the Kihon checklist, 369 (26.7%) of the 1380 study participants were considered frail. When the cluster number was increased from two to six, the scores in each subdomain of the Kihon checklist significantly differed. The estimated minimum number of clusters was five, and each of the five cluster groups had distinct characteristics. The numbers of participants in cluster groups 1-5 were 105, 78, 62, 71 and 53, respectively. We identified five types of frailty in community-dwelling older adults in Japan: "experience of falling," "pre-frailty," "oral frailty," "housebound" and "severe frailty." Geriatr Gerontol Int 2017; 17: 69-77. © 2016 Japan Geriatrics Society.

  11. Optical spectroscopy and velocity dispersions of galaxy clusters from the SPT-SZ survey

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Ruel, J.; Bayliss, M.; Bazin, G.

    2014-09-01

    We present optical spectroscopy of galaxies in clusters detected through the Sunyaev-Zel'dovich (SZ) effect with the South Pole Telescope (SPT). We report our own measurements of 61 spectroscopic cluster redshifts, and 48 velocity dispersions each calculated with more than 15 member galaxies. This catalog also includes 19 dispersions of SPT-observed clusters previously reported in the literature. The majority of the clusters in this paper are SPT-discovered; of these, most have been previously reported in other SPT cluster catalogs, and five are reported here as SPT discoveries for the first time. By performing a resampling analysis of galaxy velocities, we findmore » that unbiased velocity dispersions can be obtained from a relatively small number of member galaxies (≲ 30), but with increased systematic scatter. We use this analysis to determine statistical confidence intervals that include the effect of membership selection. We fit scaling relations between the observed cluster velocity dispersions and mass estimates from SZ and X-ray observables. In both cases, the results are consistent with the scaling relation between velocity dispersion and mass expected from dark-matter simulations. We measure a ∼30% log-normal scatter in dispersion at fixed mass, and a ∼10% offset in the normalization of the dispersion-mass relation when compared to the expectation from simulations, which is within the expected level of systematic uncertainty.« less

  12. A cluster-based approach to selecting representative stimuli from the International Affective Picture System (IAPS) database.

    PubMed

    Constantinescu, Alexandra C; Wolters, Maria; Moore, Adam; MacPherson, Sarah E

    2017-06-01

    The International Affective Picture System (IAPS; Lang, Bradley, & Cuthbert, 2008) is a stimulus database that is frequently used to investigate various aspects of emotional processing. Despite its extensive use, selecting IAPS stimuli for a research project is not usually done according to an established strategy, but rather is tailored to individual studies. Here we propose a standard, replicable method for stimulus selection based on cluster analysis, which re-creates the group structure that is most likely to have produced the valence arousal, and dominance norms associated with the IAPS images. Our method includes screening the database for outliers, identifying a suitable clustering solution, and then extracting the desired number of stimuli on the basis of their level of certainty of belonging to the cluster they were assigned to. Our method preserves statistical power in studies by maximizing the likelihood that the stimuli belong to the cluster structure fitted to them, and by filtering stimuli according to their certainty of cluster membership. In addition, although our cluster-based method is illustrated using the IAPS, it can be extended to other stimulus databases.

  13. Village-based spatio-temporal cluster analysis of the schistosomiasis risk in the Poyang Lake Region, China.

    PubMed

    Xia, Congcong; Bergquist, Robert; Lynn, Henry; Hu, Fei; Lin, Dandan; Hao, Yuwan; Li, Shizhu; Hu, Yi; Zhang, Zhijie

    2017-03-08

    The Poyang Lake Region, one of the major epidemic sites of schistosomiasis in China, remains a severe challenge. To improve our understanding of the current endemic status of schistosomiasis and to better control the transmission of the disease in the Poyang Lake Region, it is important to analyse the clustering pattern of schistosomiasis and detect the hotspots of transmission risk. Based on annual surveillance data, at the village level in this region from 2009 to 2014, spatial and temporal cluster analyses were conducted to assess the pattern of schistosomiasis infection risk among humans through purely spatial (Local Moran's I, Kulldorff and Flexible scan statistic) and space-time scan statistics (Kulldorff). A dramatic decline was found in the infection rate during the study period, which was shown to be maintained at a low level. The number of spatial clusters declined over time and were concentrated in counties around Poyang Lake, including Yugan, Yongxiu, Nanchang, Xingzi, Xinjian, De'an as well as Pengze, situated along the Yangtze River and the most serious area found in this study. Space-time analysis revealed that the clustering time frame appeared between 2009 and 2011 and the most likely cluster with the widest range was particularly concentrated in Pengze County. This study detected areas at high risk for schistosomiasis both in space and time at the village level from 2009 to 2014 in Poyang Lake Region. The high-risk areas are now more concentrated and mainly distributed at the river inflows Poyang Lake and along Yangtze River in Pengze County. It was assumed that the water projects including reservoirs and a recently breached dyke in this area were partly to blame. This study points out that attempts to reduce the negative effects of water projects in China should focus on the Poyang Lake Region.

  14. The halo Boltzmann equation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Biagetti, Matteo; Desjacques, Vincent; Kehagias, Alex

    2016-04-01

    Dark matter halos are the building blocks of the universe as they host galaxies and clusters. The knowledge of the clustering properties of halos is therefore essential for the understanding of the galaxy statistical properties. We derive an effective halo Boltzmann equation which can be used to describe the halo clustering statistics. In particular, we show how the halo Boltzmann equation encodes a statistically biased gravitational force which generates a bias in the peculiar velocities of virialized halos with respect to the underlying dark matter, as recently observed in N-body simulations.

  15. Cluster analysis of cognitive performance in elderly and demented subjects.

    PubMed

    Giaquinto, S; Nolfe, G; Calvani, M

    1985-06-01

    48 elderly normals, 14 demented subjects and 76 young controls were tested for basic cognitive functions. All the tests were quantified and could therefore be subjected to statistical analysis. The results show a difference in the speed of information processing and in memory load between the young controls and elderly normals but the age groups differed in quantitative terms only. Cluster analysis showed that the elderly and the demented formed two distinctly separate groups at the qualitative level, the basic cognitive processes being damaged in the demented group. Age thus appears to be only a risk factor for dementia and not its cause. It is concluded that batteries based on precise and measurable tasks are the most appropriate not only for the study of dementia but for rehabilitation purposes too.

  16. AMMI adjustment for statistical analysis of an international wheat yield trial.

    PubMed

    Crossa, J; Fox, P N; Pfeiffer, W H; Rajaram, S; Gauch, H G

    1991-01-01

    Multilocation trials are important for the CIMMYT Bread Wheat Program in producing high-yielding, adapted lines for a wide range of environments. This study investigated procedures for improving predictive success of a yield trial, grouping environments and genotypes into homogeneous subsets, and determining the yield stability of 18 CIMMYT bread wheats evaluated at 25 locations. Additive Main effects and Multiplicative Interaction (AMMI) analysis gave more precise estimates of genotypic yields within locations than means across replicates. This precision facilitated formation by cluster analysis of more cohesive groups of genotypes and locations for biological interpretation of interactions than occurred with unadjusted means. Locations were clustered into two subsets for which genotypes with positive interactions manifested in high, stable yields were identified. The analyses highlighted superior selections with both broad and specific adaptation.

  17. Analysis of the Seismicity Preceding Large Earthquakes

    NASA Astrophysics Data System (ADS)

    Stallone, A.; Marzocchi, W.

    2016-12-01

    The most common earthquake forecasting models assume that the magnitude of the next earthquake is independent from the past. This feature is probably one of the most severe limitations of the capability to forecast large earthquakes.In this work, we investigate empirically on this specific aspect, exploring whether spatial-temporal variations in seismicity encode some information on the magnitude of the future earthquakes. For this purpose, and to verify the universality of the findings, we consider seismic catalogs covering quite different space-time-magnitude windows, such as the Alto Tiberina Near Fault Observatory (TABOO) catalogue, and the California and Japanese seismic catalog. Our method is inspired by the statistical methodology proposed by Zaliapin (2013) to distinguish triggered and background earthquakes, using the nearest-neighbor clustering analysis in a two-dimension plan defined by rescaled time and space. In particular, we generalize the metric based on the nearest-neighbor to a metric based on the k-nearest-neighbors clustering analysis that allows us to consider the overall space-time-magnitude distribution of k-earthquakes (k-foreshocks) which anticipate one target event (the mainshock); then we analyze the statistical properties of the clusters identified in this rescaled space. In essence, the main goal of this study is to verify if different classes of mainshock magnitudes are characterized by distinctive k-foreshocks distribution. The final step is to show how the findings of this work may (or not) improve the skill of existing earthquake forecasting models.

  18. A study on phenomenology of Dhat syndrome in men in a general medical setting.

    PubMed

    Prakash, Sathya; Sharan, Pratap; Sood, Mamta

    2016-01-01

    "Dhat syndrome" is believed to be a culture-bound syndrome of the Indian subcontinent. Although many studies have been performed, many have methodological limitations and there is a lack of agreement in many areas. The aim is to study the phenomenology of "Dhat syndrome" in men and to explore the possibility of subtypes within this entity. It is a cross-sectional descriptive study conducted at a sex and marriage counseling clinic of a tertiary care teaching hospital in Northern India. An operational definition and assessment instrument for "Dhat syndrome" was developed after taking all concerned stakeholders into account and review of literature. It was applied on 100 patients along with socio-demographic profile, Hamilton Depression Rating Scale, Hamilton Anxiety Rating Scale, Mini International Neuropsychiatric Interview, and Postgraduate Institute Neuroticism Scale. For statistical analysis, descriptive statistics, group comparisons, and Pearson's product moment correlations were carried out. Factor analysis and cluster analysis were done to determine the factor structure and subtypes of "Dhat syndrome." A diagnostic and assessment instrument for "Dhat syndrome" has been developed and the phenomenology in 100 patients has been described. Both the health beliefs scale and associated symptoms scale demonstrated a three-factor structure. The patients with "Dhat syndrome" could be categorized into three clusters based on severity. There appears to be a significant agreement among various stakeholders on the phenomenology of "Dhat syndrome" although some differences exist. "Dhat syndrome" could be subtyped into three clusters based on severity.

  19. Spatio-temporal pattern of sylvatic rabies in the Sultanate of Oman, 2006-2010.

    PubMed

    Hussain, Muhammad Hammad; Ward, Michael P; Body, Mohammed; Al-Rawahi, Abdulmajeed; Wadir, Ali Awlad; Al-Habsi, Saif; Saqib, Muhammad; Ahmed, Mohammed Sayed; Almaawali, Mahir Gharib

    2013-07-01

    Rabies was first reported in the Sultanate of Oman is 1990. We analysed passive surveillance data (444 samples) collected and reported between 2006 and 2010. During this period, between 45 and 75% of samples submitted from suspect animals were subsequently confirmed (fluorescent antibody test, histopathology and reverse transcription PCR) as rabies cases. Overall, 63% of submitted samples were confirmed as rabies cases. The spatial distribution of species-specific cases were similar (centred in north-central Oman with a northeast-southwest distribution), although fox cases had a wider distribution and an east-west orientation. Clustering of cases was detected using interpolation, local spatial autocorrelation and scan statistical analysis. Several local government areas (wilayats) in north-central Oman were identified where higher than expected numbers of laboratory-confirmed rabies cases were reported. For fox rabies, more clusters (local spatial autocorrelation analysis) and a larger clustered area (scan statistical analysis) were detected. In Oman, monthly reports of fox rabies cases were highly correlated (rSP>0.5) with reports of camel, cattle, sheep and goat rabies. The best-fitting ARIMA model included a seasonality component. Fox rabies cases reported 6 months previously best explained rabies reported cases in other animal species. Despite likely reporting bias, results suggest that rabies exists as a sylvatic cycle of transmission in Oman and an opportunity still exists to prevent establishment of dog-mediated rabies. Copyright © 2013 Elsevier B.V. All rights reserved.

  20. A Multivariate Analysis of Galaxy Cluster Properties

    NASA Astrophysics Data System (ADS)

    Ogle, P. M.; Djorgovski, S.

    1993-05-01

    We have assembled from the literature a data base on on 394 clusters of galaxies, with up to 16 parameters per cluster. They include optical and x-ray luminosities, x-ray temperatures, galaxy velocity dispersions, central galaxy and particle densities, optical and x-ray core radii and ellipticities, etc. In addition, derived quantities, such as the mass-to-light ratios and x-ray gas masses are included. Doubtful measurements have been identified, and deleted from the data base. Our goal is to explore the correlations between these parameters, and interpret them in the framework of our understanding of evolution of clusters and large-scale structure, such as the Gott-Rees scaling hierarchy. Among the simple, monovariate correlations we found, the most significant include those between the optical and x-ray luminosities, x-ray temperatures, cluster velocity dispersions, and central galaxy densities, in various mutual combinations. While some of these correlations have been discussed previously in the literature, generally smaller samples of objects have been used. We will also present the results of a multivariate statistical analysis of the data, including a principal component analysis (PCA). Such an approach has not been used previously for studies of cluster properties, even though it is much more powerful and complete than the simple monovariate techniques which are commonly employed. The observed correlations may lead to powerful constraints for theoretical models of formation and evolution of galaxy clusters. P.M.O. was supported by a Caltech graduate fellowship. S.D. acknowledges a partial support from the NASA contract NAS5-31348 and the NSF PYI award AST-9157412.

  1. Cluster Analysis of Time-Dependent Crystallographic Data: Direct Identification of Time-Independent Structural Intermediates

    PubMed Central

    Kostov, Konstantin S.; Moffat, Keith

    2011-01-01

    The initial output of a time-resolved macromolecular crystallography experiment is a time-dependent series of difference electron density maps that displays the time-dependent changes in underlying structure as a reaction progresses. The goal is to interpret such data in terms of a small number of crystallographically refinable, time-independent structures, each associated with a reaction intermediate; to establish the pathways and rate coefficients by which these intermediates interconvert; and thereby to elucidate a chemical kinetic mechanism. One strategy toward achieving this goal is to use cluster analysis, a statistical method that groups objects based on their similarity. If the difference electron density at a particular voxel in the time-dependent difference electron density (TDED) maps is sensitive to the presence of one and only one intermediate, then its temporal evolution will exactly parallel the concentration profile of that intermediate with time. The rationale is therefore to cluster voxels with respect to the shapes of their TDEDs, so that each group or cluster of voxels corresponds to one structural intermediate. Clusters of voxels whose TDEDs reflect the presence of two or more specific intermediates can also be identified. From such groupings one can then infer the number of intermediates, obtain their time-independent difference density characteristics, and refine the structure of each intermediate. We review the principles of cluster analysis and clustering algorithms in a crystallographic context, and describe the application of the method to simulated and experimental time-resolved crystallographic data for the photocycle of photoactive yellow protein. PMID:21244840

  2. Spatial, temporal and spatio-temporal clusters of measles incidence at the county level in Guangxi, China during 2004-2014: flexibly shaped scan statistics.

    PubMed

    Tang, Xianyan; Geater, Alan; McNeil, Edward; Deng, Qiuyun; Dong, Aihu; Zhong, Ge

    2017-04-04

    Outbreaks of measles re-emerged in Guangxi province during 2013-2014, where measles again became a major public health concern. A better understanding of the patterns of measles cases would help in identifying high-risk areas and periods for optimizing preventive strategies, yet these patterns remain largely unknown. Thus, this study aimed to determine the patterns of measles clusters in space, time and space-time at the county level over the period 2004-2014 in Guangxi. Annual data on measles cases and population sizes for each county were obtained from Guangxi CDC and Guangxi Bureau of Statistics, respectively. Epidemic curves and Kulldorff's temporal scan statistics were used to identify seasonal peaks and high-risk periods. Tango's flexible scan statistics were implemented to determine irregular spatial clusters. Spatio-temporal clusters in elliptical cylinder shapes were detected by Kulldorff's scan statistics. Population attributable risk percent (PAR%) of children aged ≤24 months was used to identify regions with a heavy burden of measles. Seasonal peaks occurred between April and June, and a temporal measles cluster was detected in 2014. Spatial clusters were identified in West, Southwest and North Central Guangxi. Three phases of spatio-temporal clusters with high relative risk were detected: Central Guangxi during 2004-2005, Midwest Guangxi in 2007, and West and Southwest Guangxi during 2013-2014. Regions with high PAR% were mainly clustered in West, Southwest, North and Central Guangxi. A temporal uptrend of measles incidence existed in Guangxi between 2010 and 2014, while downtrend during 2004-2009. The hotspots shifted from Central to West and Southwest Guangxi, regions overburdened with measles. Thus, intensifying surveillance of timeliness and completeness of routine vaccination and implementing supplementary immunization activities for measles should prioritized in these regions.

  3. Untangling Magmatic Processes and Hydrothermal Alteration of in situ Superfast Spreading Ocean Crust at ODP/IODP Site 1256 with Fuzzy c-means Cluster Analysis of Rock Magnetic Properties

    NASA Astrophysics Data System (ADS)

    Dekkers, M. J.; Heslop, D.; Herrero-Bervera, E.; Acton, G.; Krasa, D.

    2014-12-01

    Ocean Drilling Program (ODP)/Integrated ODP (IODP) Hole 1256D (6.44.1' N, 91.56.1' W) on the Cocos Plate occurs in 15.2 Ma oceanic crust generated by superfast seafloor spreading. Presently, it is the only drill hole that has sampled all three oceanic crust layers in a tectonically undisturbed setting. Here we interpret down-hole trends in several rock-magnetic parameters with fuzzy c-means cluster analysis, a multivariate statistical technique. The parameters include the magnetization ratio, the coercivity ratio, the coercive force, the low-field susceptibility, and the Curie temperature. By their combined, multivariate, analysis the effects of magmatic and hydrothermal processes can be evaluated. The optimal number of clusters - a key point in the analysis because there is no a priori information on this - was determined through a combination of approaches: by calculation of several cluster validity indices, by testing for coherent cluster distributions on non-linear-map plots, and importantly by testing for stability of the cluster solution from all possible starting points. Here, we consider a solution robust if the cluster allocation is independent of the starting configuration. The five-cluster solution appeared to be robust. Three clusters are distinguished in the extrusive segment of the Hole that express increasing hydrothermal alteration of the lavas. The sheeted dike and gabbro portions are characterized by two clusters, both with higher coercivities than in lava samples. Extensive alteration, however, can obliterate magnetic property differences between lavas, dikes, and gabbros. The imprint of thermochemical alteration on the iron-titanium oxides is only partially related to the porosity of the rocks. All clusters display rock magnetic characteristics in line with a stable NRM. This implies that the entire sampled sequence of ocean crust can contribute to marine magnetic anomalies. Determination of the absolute paleointensity with thermal techniques is not straightforward because of the propensity of oxyexsolution during laboratory heating and/or the presence of intergrowths. The upper part of the extrusive sequence, the granoblastic portion of the dikes, and moderately altered gabbros may contain a comparatively uncontaminated thermoremanent magnetization.

  4. Symptom clusters predict mortality among dialysis patients in Norway: a prospective observational cohort study.

    PubMed

    Amro, Amin; Waldum, Bård; von der Lippe, Nanna; Brekke, Fredrik Barth; Dammen, Toril; Miaskowski, Christine; Os, Ingrid

    2015-01-01

    Patients with end-stage renal disease on dialysis have reduced survival rates compared with the general population. Symptoms are frequent in dialysis patients, and a symptom cluster is defined as two or more related co-occurring symptoms. The aim of this study was to explore the associations between symptom clusters and mortality in dialysis patients. In a prospective observational cohort study of dialysis patients (n = 301), Kidney Disease and Quality of Life Short Form and Beck Depression Inventory questionnaires were administered. To generate symptom clusters, principal component analysis with varimax rotation was used on 11 kidney-specific self-reported physical symptoms. A Beck Depression Inventory score of 16 or greater was defined as clinically significant depressive symptoms. Physical and mental component summary scores were generated from Short Form-36. Multivariate Cox regression analysis was used for the survival analysis, Kaplan-Meier curves and log-rank statistics were applied to compare survival rates between the groups. Three different symptom clusters were identified; one included loading of several uremic symptoms. In multivariate analyses and after adjustment for health-related quality of life and depressive symptoms, the worst perceived quartile of the "uremic" symptom cluster independently predicted all-cause mortality (hazard ratio 2.47, 95% CI 1.44-4.22, P = 0.001) compared with the other quartiles during a follow-up period that ranged from four to 52 months. The two other symptom clusters ("neuromuscular" and "skin") or the individual symptoms did not predict mortality. Clustering of uremic symptoms predicted mortality. Assessing co-occurring symptoms rather than single symptoms may help to identify dialysis patients at high risk for mortality. Copyright © 2015 American Academy of Hospice and Palliative Medicine. Published by Elsevier Inc. All rights reserved.

  5. Cluster mass calibration at high redshift: HST weak lensing analysis of 13 distant galaxy clusters from the South Pole Telescope Sunyaev-Zel'dovich Survey

    NASA Astrophysics Data System (ADS)

    Schrabback, T.; Applegate, D.; Dietrich, J. P.; Hoekstra, H.; Bocquet, S.; Gonzalez, A. H.; von der Linden, A.; McDonald, M.; Morrison, C. B.; Raihan, S. F.; Allen, S. W.; Bayliss, M.; Benson, B. A.; Bleem, L. E.; Chiu, I.; Desai, S.; Foley, R. J.; de Haan, T.; High, F. W.; Hilbert, S.; Mantz, A. B.; Massey, R.; Mohr, J.; Reichardt, C. L.; Saro, A.; Simon, P.; Stern, C.; Stubbs, C. W.; Zenteno, A.

    2018-02-01

    We present an HST/Advanced Camera for Surveys (ACS) weak gravitational lensing analysis of 13 massive high-redshift (zmedian = 0.88) galaxy clusters discovered in the South Pole Telescope (SPT) Sunyaev-Zel'dovich Survey. This study is part of a larger campaign that aims to robustly calibrate mass-observable scaling relations over a wide range in redshift to enable improved cosmological constraints from the SPT cluster sample. We introduce new strategies to ensure that systematics in the lensing analysis do not degrade constraints on cluster scaling relations significantly. First, we efficiently remove cluster members from the source sample by selecting very blue galaxies in V - I colour. Our estimate of the source redshift distribution is based on Cosmic Assembly Near-infrared Deep Extragalactic Legacy Survey (CANDELS) data, where we carefully mimic the source selection criteria of the cluster fields. We apply a statistical correction for systematic photometric redshift errors as derived from Hubble Ultra Deep Field data and verified through spatial cross-correlations. We account for the impact of lensing magnification on the source redshift distribution, finding that this is particularly relevant for shallower surveys. Finally, we account for biases in the mass modelling caused by miscentring and uncertainties in the concentration-mass relation using simulations. In combination with temperature estimates from Chandra we constrain the normalization of the mass-temperature scaling relation ln (E(z)M500c/1014 M⊙) = A + 1.5ln (kT/7.2 keV) to A=1.81^{+0.24}_{-0.14}(stat.) {± } 0.09(sys.), consistent with self-similar redshift evolution when compared to lower redshift samples. Additionally, the lensing data constrain the average concentration of the clusters to c_200c=5.6^{+3.7}_{-1.8}.

  6. Cluster Mass Calibration at High Redshift: HST Weak Lensing Analysis of 13 Distant Galaxy Clusters from the South Pole Telescope Sunyaev-Zel'dovich Survey

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Schrabback, T.; et al.

    We present an HST/ACS weak gravitational lensing analysis of 13 massive high-redshift (z_median=0.88) galaxy clusters discovered in the South Pole Telescope (SPT) Sunyaev-Zel'dovich Survey. This study is part of a larger campaign that aims to robustly calibrate mass-observable scaling relations over a wide range in redshift to enable improved cosmological constraints from the SPT cluster sample. We introduce new strategies to ensure that systematics in the lensing analysis do not degrade constraints on cluster scaling relations significantly. First, we efficiently remove cluster members from the source sample by selecting very blue galaxies in V-I colour. Our estimate of the sourcemore » redshift distribution is based on CANDELS data, where we carefully mimic the source selection criteria of the cluster fields. We apply a statistical correction for systematic photometric redshift errors as derived from Hubble Ultra Deep Field data and verified through spatial cross-correlations. We account for the impact of lensing magnification on the source redshift distribution, finding that this is particularly relevant for shallower surveys. Finally, we account for biases in the mass modelling caused by miscentring and uncertainties in the mass-concentration relation using simulations. In combination with temperature estimates from Chandra we constrain the normalisation of the mass-temperature scaling relation ln(E(z) M_500c/10^14 M_sun)=A+1.5 ln(kT/7.2keV) to A=1.81^{+0.24}_{-0.14}(stat.) +/- 0.09(sys.), consistent with self-similar redshift evolution when compared to lower redshift samples. Additionally, the lensing data constrain the average concentration of the clusters to c_200c=5.6^{+3.7}_{-1.8}.« less

  7. Strong-lensing analysis of A2744 with MUSE and Hubble Frontier Fields images

    NASA Astrophysics Data System (ADS)

    Mahler, G.; Richard, J.; Clément, B.; Lagattuta, D.; Schmidt, K.; Patrício, V.; Soucail, G.; Bacon, R.; Pello, R.; Bouwens, R.; Maseda, M.; Martinez, J.; Carollo, M.; Inami, H.; Leclercq, F.; Wisotzki, L.

    2018-01-01

    We present an analysis of Multi Unit Spectroscopic Explorer (MUSE) observations obtained on the massive Frontier Fields (FFs) cluster A2744. This new data set covers the entire multiply imaged region around the cluster core. The combined catalogue consists of 514 spectroscopic redshifts (with 414 new identifications). We use this redshift information to perform a strong-lensing analysis revising multiple images previously found in the deep FF images, and add three new MUSE-detected multiply imaged systems with no obvious Hubble Space Telescope counterpart. The combined strong-lensing constraints include a total of 60 systems producing 188 images altogether, out of which 29 systems and 83 images are spectroscopically confirmed, making A2744 one of the most well-constrained clusters to date. Thanks to the large amount of spectroscopic redshifts, we model the influence of substructures at larger radii, using a parametrization including two cluster-scale components in the cluster core and several group scale in the outskirts. The resulting model accurately reproduces all the spectroscopic multiple systems, reaching an rms of 0.67 arcsec in the image plane. The large number of MUSE spectroscopic redshifts gives us a robust model, which we estimate reduces the systematic uncertainty on the 2D mass distribution by up to ∼2.5 times the statistical uncertainty in the cluster core. In addition, from a combination of the parametrization and the set of constraints, we estimate the relative systematic uncertainty to be up to 9 per cent at 200 kpc.

  8. Deriving spatial patterns from a novel database of volcanic rock geochemistry in the Virunga Volcanic Province, East African Rift

    NASA Astrophysics Data System (ADS)

    Poppe, Sam; Barette, Florian; Smets, Benoît; Benbakkar, Mhammed; Kervyn, Matthieu

    2016-04-01

    The Virunga Volcanic Province (VVP) is situated within the western branch of the East-African Rift. The geochemistry and petrology of its' volcanic products has been studied extensively in a fragmented manner. They represent a unique collection of silica-undersaturated, ultra-alkaline and ultra-potassic compositions, displaying marked geochemical variations over the area occupied by the VVP. We present a novel spatially-explicit database of existing whole-rock geochemical analyses of the VVP volcanics, compiled from international publications, (post-)colonial scientific reports and PhD theses. In the database, a total of 703 geochemical analyses of whole-rock samples collected from the 1950s until recently have been characterised with a geographical location, eruption source location, analytical results and uncertainty estimates for each of these categories. Comparative box plots and Kruskal-Wallis H tests on subsets of analyses with contrasting ages or analytical methods suggest that the overall database accuracy is consistent. We demonstrate how statistical techniques such as Principal Component Analysis (PCA) and subsequent cluster analysis allow the identification of clusters of samples with similar major-element compositions. The spatial patterns represented by the contrasting clusters show that both the historically active volcanoes represent compositional clusters which can be identified based on their contrasted silica and alkali contents. Furthermore, two sample clusters are interpreted to represent the most primitive, deep magma source within the VVP, different from the shallow magma reservoirs that feed the eight dominant large volcanoes. The samples from these two clusters systematically originate from locations which 1. are distal compared to the eight large volcanoes and 2. mostly coincide with the surface expressions of rift faults or NE-SW-oriented inherited Precambrian structures which were reactivated during rifting. The lava from the Mugogo eruption of 1957 belongs to these primitive clusters and is the only known to have erupted outside the current rift valley in historical times. We thus infer there is a distributed hazard of vent opening susceptibility additional to the susceptibility associated with the main Virunga edifices. This study suggests that the statistical analysis of such geochemical database may help to understand complex volcanic plumbing systems and the spatial distribution of volcanic hazards in active and poorly known volcanic areas such as the Virunga Volcanic Province.

  9. [The occurrence of Echinococcus multilocularis in red foxes in lower Saxony: identification of a high risk area by spatial epidemiological cluster analysis].

    PubMed

    Berke, Olaf; von Keyserlingk, Michael; Broll, Susanne; Kreienbrock, Lothar

    2002-01-01

    There is considerable interest in the spatial distribution of Echinococcus multilocularis in red foxes (Vulpes vulpes L.), because this parasite causes the zoonoses of alveolar echinococcosis which is potentially of high fatality rate. High risk areas are known from France, Switzerland and the Swabian Alb in Germany for a long time. In this work, the spatial scan statistic is introduced as an instrument for identification and localisation of high risk areas, so called disease clusters in spatial epidemiology. The use of the spatial scan statistic along with data about the distribution of the parasite in 5365 red foxes in Lower Saxony, that were collected during 1991 to 1997, led to the identification of another high risk area. The relative risk for this disease cluster is approximated by RR = 5.03 (CI0.95(RR) = [4.27; 6.58]) for the period of 1991 to 1994 and by RR = 4.45 (CI0.95(RR) = [3.53; 5.59]) for the period of 1994 to 1997, respectively.

  10. Salient concerns in using analgesia for cancer pain among outpatients: A cluster analysis study.

    PubMed

    Meghani, Salimah H; Knafl, George J

    2017-02-10

    To identify unique clusters of patients based on their concerns in using analgesia for cancer pain and predictors of the cluster membership. This was a 3-mo prospective observational study ( n = 207). Patients were included if they were adults (≥ 18 years), diagnosed with solid tumors or multiple myelomas, and had at least one prescription of around-the-clock pain medication for cancer or cancer-treatment-related pain. Patients were recruited from two outpatient medical oncology clinics within a large health system in Philadelphia. A choice-based conjoint (CBC) analysis experiment was used to elicit analgesic treatment preferences (utilities). Patients employed trade-offs based on five analgesic attributes (percent relief from analgesics, type of analgesic, type of side-effects, severity of side-effects, out of pocket cost). Patients were clustered based on CBC utilities using novel adaptive statistical methods. Multiple logistic regression was used to identify predictors of cluster membership. The analyses found 4 unique clusters: Most patients made trade-offs based on the expectation of pain relief (cluster 1, 41%). For a subset, the main underlying concern was type of analgesic prescribed, i.e ., opioid vs non-opioid (cluster 2, 11%) and type of analgesic side effects (cluster 4, 21%), respectively. About one in four made trade-offs based on multiple concerns simultaneously including pain relief, type of side effects, and severity of side effects (cluster 3, 28%). In multivariable analysis, to identify predictors of cluster membership, clinical and socioeconomic factors (education, health literacy, income, social support) rather than analgesic attitudes and beliefs were found important; only the belief, i.e ., pain medications can mask changes in health or keep you from knowing what is going on in your body was found significant in predicting two of the four clusters [cluster 1 (-); cluster 4 (+)]. Most patients appear to be driven by a single salient concern in using analgesia for cancer pain. Addressing these concerns, perhaps through real time clinical assessments, may improve patients' analgesic adherence patterns and cancer pain outcomes.

  11. Hierarchical cluster analysis of technical replicates to identify interferents in untargeted mass spectrometry metabolomics.

    PubMed

    Caesar, Lindsay K; Kvalheim, Olav M; Cech, Nadja B

    2018-08-27

    Mass spectral data sets often contain experimental artefacts, and data filtering prior to statistical analysis is crucial to extract reliable information. This is particularly true in untargeted metabolomics analyses, where the analyte(s) of interest are not known a priori. It is often assumed that chemical interferents (i.e. solvent contaminants such as plasticizers) are consistent across samples, and can be removed by background subtraction from blank injections. On the contrary, it is shown here that chemical contaminants may vary in abundance across each injection, potentially leading to their misidentification as relevant sample components. With this metabolomics study, we demonstrate the effectiveness of hierarchical cluster analysis (HCA) of replicate injections (technical replicates) as a methodology to identify chemical interferents and reduce their contaminating contribution to metabolomics models. Pools of metabolites with varying complexity were prepared from the botanical Angelica keiskei Koidzumi and spiked with known metabolites. Each set of pools was analyzed in triplicate and at multiple concentrations using ultraperformance liquid chromatography coupled to mass spectrometry (UPLC-MS). Before filtering, HCA failed to cluster replicates in the data sets. To identify contaminant peaks, we developed a filtering process that evaluated the relative peak area variance of each variable within triplicate injections. These interferent peaks were found across all samples, but did not show consistent peak area from injection to injection, even when evaluating the same chemical sample. This filtering process identified 128 ions that appear to originate from the UPLC-MS system. Data sets collected for a high number of pools with comparatively simple chemical composition were highly influenced by these chemical interferents, as were samples that were analyzed at a low concentration. When chemical interferent masses were removed, technical replicates clustered in all data sets. This work highlights the importance of technical replication in mass spectrometry-based studies, and presents a new application of HCA as a tool for evaluating the effectiveness of data filtering prior to statistical analysis. Copyright © 2018 Elsevier B.V. All rights reserved.

  12. Voxel-based statistical analysis of cerebral blood flow using Tc-99m ECD brain SPECT in patients with traumatic brain injury: group and individual analyses.

    PubMed

    Shin, Yong Beom; Kim, Seong-Jang; Kim, In-Ju; Kim, Yong-Ki; Kim, Dong-Soo; Park, Jae Heung; Yeom, Seok-Ran

    2006-06-01

    Statistical parametric mapping (SPM) was applied to brain perfusion single photon emission computed tomography (SPECT) images in patients with traumatic brain injury (TBI) to investigate regional cerebral abnormalities compared to age-matched normal controls. Thirteen patients with TBI underwent brain perfusion SPECT were included in this study (10 males, three females, mean age 39.8 +/- 18.2, range 21 - 74). SPM2 software implemented in MATLAB 5.3 was used for spatial pre-processing and analysis and to determine the quantitative differences between TBI patients and age-matched normal controls. Three large voxel clusters of significantly decreased cerebral blood perfusion were found in patients with TBI. The largest clusters were area including medial frontal gyrus (voxel number 3642, peak Z-value = 4.31, 4.27, p = 0.000) in both hemispheres. The second largest clusters were areas including cingulated gyrus and anterior cingulate gyrus of left hemisphere (voxel number 381, peak Z-value = 3.67, 3.62, p = 0.000). Other clusters were parahippocampal gyrus (voxel number 173, peak Z-value = 3.40, p = 0.000) and hippocampus (voxel number 173, peak Z-value = 3.23, p = 0.001) in the left hemisphere. The false discovery rate (FDR) was less than 0.04. From this study, group and individual analyses of SPM2 could clearly identify the perfusion abnormalities of brain SPECT in patients with TBI. Group analysis of SPM2 showed hypoperfusion pattern in the areas including medial frontal gyrus of both hemispheres, cingulate gyrus, anterior cingulate gyrus, parahippocampal gyrus and hippocampus in the left hemisphere compared to age-matched normal controls. Also, left parahippocampal gyrus and left hippocampus were additional hypoperfusion areas. However, these findings deserve further investigation on a larger number of patients to be performed to allow a better validation of objective SPM analysis in patients with TBI.

  13. A novel exploratory chemometric approach to environmental monitorring by combining block clustering with Partial Least Square (PLS) analysis

    PubMed Central

    2013-01-01

    Background Given the serious threats posed to terrestrial ecosystems by industrial contamination, environmental monitoring is a standard procedure used for assessing the current status of an environment or trends in environmental parameters. Measurement of metal concentrations at different trophic levels followed by their statistical analysis using exploratory multivariate methods can provide meaningful information on the status of environmental quality. In this context, the present paper proposes a novel chemometric approach to standard statistical methods by combining the Block clustering with Partial least square (PLS) analysis to investigate the accumulation patterns of metals in anthropized terrestrial ecosystems. The present study focused on copper, zinc, manganese, iron, cobalt, cadmium, nickel, and lead transfer along a soil-plant-snai food chain, and the hepatopancreas of the Roman snail (Helix pomatia) was used as a biological end-point of metal accumulation. Results Block clustering deliniates between the areas exposed to industrial and vehicular contamination. The toxic metals have similar distributions in the nettle leaves and snail hepatopancreas. PLS analysis showed that (1) zinc and copper concentrations at the lower trophic levels are the most important latent factors that contribute to metal accumulation in land snails; (2) cadmium and lead are the main determinants of pollution pattern in areas exposed to industrial contamination; (3) at the sites located near roads lead is the most threatfull metal for terrestrial ecosystems. Conclusion There were three major benefits by applying block clustering with PLS for processing the obtained data: firstly, it helped in grouping sites depending on the type of contamination. Secondly, it was valuable for identifying the latent factors that contribute the most to metal accumulation in land snails. Finally, it optimized the number and type of data that are best for monitoring the status of metallic contamination in terrestrial ecosystems exposed to different kinds of anthropic polution. PMID:23987502

  14. Analysis of ligand-protein exchange by Clustering of Ligand Diffusion Coefficient Pairs (CoLD-CoP)

    NASA Astrophysics Data System (ADS)

    Snyder, David A.; Chantova, Mihaela; Chaudhry, Saadia

    2015-06-01

    NMR spectroscopy is a powerful tool in describing protein structures and protein activity for pharmaceutical and biochemical development. This study describes a method to determine weak binding ligands in biological systems by using hierarchic diffusion coefficient clustering of multidimensional data obtained with a 400 MHz Bruker NMR. Comparison of DOSY spectrums of ligands of the chemical library in the presence and absence of target proteins show translational diffusion rates for small molecules upon interaction with macromolecules. For weak binders such as compounds found in fragment libraries, changes in diffusion rates upon macromolecular binding are on the order of the precision of DOSY diffusion measurements, and identifying such subtle shifts in diffusion requires careful statistical analysis. The "CoLD-CoP" (Clustering of Ligand Diffusion Coefficient Pairs) method presented here uses SAHN clustering to identify protein-binders in a chemical library or even a not fully characterized metabolite mixture. We will show how DOSY NMR and the "CoLD-CoP" method complement each other in identifying the most suitable candidates for lysozyme and wheat germ acid phosphatase.

  15. Connecting optical and X-ray tracers of galaxy cluster relaxation

    NASA Astrophysics Data System (ADS)

    Roberts, Ian D.; Parker, Laura C.; Hlavacek-Larrondo, Julie

    2018-04-01

    Substantial effort has been devoted in determining the ideal proxy for quantifying the morphology of the hot intracluster medium in clusters of galaxies. These proxies, based on X-ray emission, typically require expensive, high-quality X-ray observations making them difficult to apply to large surveys of groups and clusters. Here, we compare optical relaxation proxies with X-ray asymmetries and centroid shifts for a sample of Sloan Digital Sky Survey clusters with high-quality, archival X-ray data from Chandra and XMM-Newton. The three optical relaxation measures considered are the shape of the member-galaxy projected velocity distribution - measured by the Anderson-Darling (AD) statistic, the stellar mass gap between the most-massive and second-most-massive cluster galaxy, and the offset between the most-massive galaxy (MMG) position and the luminosity-weighted cluster centre. The AD statistic and stellar mass gap correlate significantly with X-ray relaxation proxies, with the AD statistic being the stronger correlator. Conversely, we find no evidence for a correlation between X-ray asymmetry or centroid shift and the MMG offset. High-mass clusters (Mhalo > 1014.5 M⊙) in this sample have X-ray asymmetries, centroid shifts, and Anderson-Darling statistics which are systematically larger than for low-mass systems. Finally, considering the dichotomy of Gaussian and non-Gaussian clusters (measured by the AD test), we show that the probability of being a non-Gaussian cluster correlates significantly with X-ray asymmetry but only shows a marginal correlation with centroid shift. These results confirm the shape of the radial velocity distribution as a useful proxy for cluster relaxation, which can then be applied to large redshift surveys lacking extensive X-ray coverage.

  16. Development of a model of the tobacco industry's interference with tobacco control programmes

    PubMed Central

    Trochim, W; Stillman, F; Clark, P; Schmitt, C

    2003-01-01

    Objective: To construct a conceptual model of tobacco industry tactics to undermine tobacco control programmes for the purposes of: (1) developing measures to evaluate industry tactics, (2) improving tobacco control planning, and (3) supplementing current or future frameworks used to classify and analyse tobacco industry documents. Design: Web based concept mapping was conducted, including expert brainstorming, sorting, and rating of statements describing industry tactics. Statistical analyses used multidimensional scaling and cluster analysis. Interpretation of the resulting maps was accomplished by an expert panel during a face-to-face meeting. Subjects: 34 experts, selected because of their previous encounters with industry resistance or because of their research into industry tactics, took part in some or all phases of the project. Results: Maps with eight non-overlapping clusters in two dimensional space were developed, with importance ratings of the statements and clusters. Cluster and quadrant labels were agreed upon by the experts. Conclusions: The conceptual maps summarise the tactics used by the industry and their relationships to each other, and suggest a possible hierarchy for measures that can be used in statistical modelling of industry tactics and for review of industry documents. Finally, the maps enable hypothesis of a likely progression of industry reactions as public health programmes become more successful, and therefore more threatening to industry profits. PMID:12773723

  17. First passage times in homogeneous nucleation: Dependence on the total number of particles

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Yvinec, Romain; Bernard, Samuel; Pujo-Menjouet, Laurent

    2016-01-21

    Motivated by nucleation and molecular aggregation in physical, chemical, and biological settings, we present an extension to a thorough analysis of the stochastic self-assembly of a fixed number of identical particles in a finite volume. We study the statistics of times required for maximal clusters to be completed, starting from a pure-monomeric particle configuration. For finite volumes, we extend previous analytical approaches to the case of arbitrary size-dependent aggregation and fragmentation kinetic rates. For larger volumes, we develop a scaling framework to study the first assembly time behavior as a function of the total quantity of particles. We find thatmore » the mean time to first completion of a maximum-sized cluster may have a surprisingly weak dependence on the total number of particles. We highlight how higher statistics (variance, distribution) of the first passage time may nevertheless help to infer key parameters, such as the size of the maximum cluster. Finally, we present a framework to quantify formation of macroscopic sized clusters, which are (asymptotically) very unlikely and occur as a large deviation phenomenon from the mean-field limit. We argue that this framework is suitable to describe phase transition phenomena, as inherent infrequent stochastic processes, in contrast to classical nucleation theory.« less

  18. First passage times in homogeneous nucleation: Dependence on the total number of particles

    NASA Astrophysics Data System (ADS)

    Yvinec, Romain; Bernard, Samuel; Hingant, Erwan; Pujo-Menjouet, Laurent

    2016-01-01

    Motivated by nucleation and molecular aggregation in physical, chemical, and biological settings, we present an extension to a thorough analysis of the stochastic self-assembly of a fixed number of identical particles in a finite volume. We study the statistics of times required for maximal clusters to be completed, starting from a pure-monomeric particle configuration. For finite volumes, we extend previous analytical approaches to the case of arbitrary size-dependent aggregation and fragmentation kinetic rates. For larger volumes, we develop a scaling framework to study the first assembly time behavior as a function of the total quantity of particles. We find that the mean time to first completion of a maximum-sized cluster may have a surprisingly weak dependence on the total number of particles. We highlight how higher statistics (variance, distribution) of the first passage time may nevertheless help to infer key parameters, such as the size of the maximum cluster. Finally, we present a framework to quantify formation of macroscopic sized clusters, which are (asymptotically) very unlikely and occur as a large deviation phenomenon from the mean-field limit. We argue that this framework is suitable to describe phase transition phenomena, as inherent infrequent stochastic processes, in contrast to classical nucleation theory.

  19. Application of radioisotope XRF and thermoluminescence (TL) dating in investigation of pottery from Tell AL-Kasra archaeological site, Syria.

    PubMed

    Abboud, R; Issa, H; Abed-Allah, Y D; Bakraji, E H

    2015-11-01

    Statistical analysis based on chemical composition, using radioisotope X-ray fluorescence, have been applied on 39 ancient pottery fragments coming from the excavation at Tell Al-Kasra archaeological site, Syria. Three groups were defined by applying Cluster and Factor analysis statistical methods. Thermoluminescence (TL) dating was investigated on three sherds taken from the bathroom (hammam) on the site. Multiple aliquot additive dose (MAAD) was used to estimate the paleodose value, and the gamma spectrometry was used to estimate the dose rate. The average age was found to be 715±36 year. Copyright © 2015 Elsevier Ltd. All rights reserved.

  20. Spatiotemporal Analysis of the Malaria Epidemic in Mainland China, 2004-2014.

    PubMed

    Huang, Qiang; Hu, Lin; Liao, Qi-Bin; Xia, Jing; Wang, Qian-Ru; Peng, Hong-Juan

    2017-08-01

    The purpose of this study is to characterize spatiotemporal heterogeneities in malaria distribution at a provincial level and investigate the association between malaria incidence and climate factors from 2004 to 2014 in China to inform current malaria control efforts. National malaria incidence peaked (4.6/100,000) in 2006 and decreased to a very low level (0.21/100,000) in 2014, and the proportion of imported cases increased from 16.2% in 2004 to 98.2% in 2014. Statistical analyses of global and local spatial autocorrelations and purely spatial scan statistics revealed that malaria was localized in Hainan, Anhui, and Yunnan during 2004-2009 and then gradually shifted and clustered in Yunnan after 2010. Purely temporal clusters shortened to less than 5 months during 2012-2014. The two most likely clusters detected using spatiotemporal analysis occurred in Anhui between July 2005 and November 2007 and Yunnan between January 2010 and June 2012. Correlation coefficients for the association between malaria incidence and climate factors sharply decreased after 2010, and there were zero-month lag effects for climate factors during 2010-2014. Overall, the spatiotemporal distribution of malaria in China changed from relatively scattered (2004-2009) to relatively clustered (2010-2014). As the proportion of imported cases increased, the effect of climate factors on malaria incidence has gradually become weaker since 2011. Therefore, new warning systems should be applied to monitor resurgence and outbreaks of malaria in mainland China, and quarantine at borders should be reinforced to control the increasingly trend of imported malaria cases.

  1. A mixture model-based approach to the clustering of microarray expression data.

    PubMed

    McLachlan, G J; Bean, R W; Peel, D

    2002-03-01

    This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets. EMMIX-GENE is available at http://www.maths.uq.edu.au/~gjm/emmix-gene/

  2. Bayesian statistics as a new tool for spectral analysis - I. Application for the determination of basic parameters of massive stars

    NASA Astrophysics Data System (ADS)

    Mugnes, J.-M.; Robert, C.

    2015-11-01

    Spectral analysis is a powerful tool to investigate stellar properties and it has been widely used for decades now. However, the methods considered to perform this kind of analysis are mostly based on iteration among a few diagnostic lines to determine the stellar parameters. While these methods are often simple and fast, they can lead to errors and large uncertainties due to the required assumptions. Here, we present a method based on Bayesian statistics to find simultaneously the best combination of effective temperature, surface gravity, projected rotational velocity, and microturbulence velocity, using all the available spectral lines. Different tests are discussed to demonstrate the strength of our method, which we apply to 54 mid-resolution spectra of field and cluster B stars obtained at the Observatoire du Mont-Mégantic. We compare our results with those found in the literature. Differences are seen which are well explained by the different methods used. We conclude that the B-star microturbulence velocities are often underestimated. We also confirm the trend that B stars in clusters are on average faster rotators than field B stars.

  3. [Temporal-spatial analysis of bacillary dysentery in the Three Gorges Area of China, 2005-2016].

    PubMed

    Zhang, P; Zhang, J; Chang, Z R; Li, Z J

    2018-01-10

    Objective: To analyze the spatial and temporal distributions of bacillary dysentery in Chongqing, Yichang and Enshi (the Three Gorges Area) from 2005 to 2016, and provide evidence for the disease prevention and control. Methods: The incidence data of bacillary dysentery in the Three Gorges Area during this period were collected from National Notifiable Infectious Disease Reporting System. The spatial-temporal scan statistic was conducted with software SaTScan 9.4 and bacillary dysentery clusters were visualized with software ArcGIS 10.3. Results: A total of 126 196 cases were reported in the Three Gorges Area during 2005-2016, with an average incidence rate of 29.67/100 000. The overall incidence was in a downward trend, with an average annual decline rate of 4.74%. Cases occurred all the year round but with an obvious seasonal increase between May and October. Among the reported cases, 44.71% (56 421/126 196) were children under 5-year-old, the cases in children outside child care settings accounted for 41.93% (52 918/126 196) of the total. The incidence rates in districts of Yuzhong, Dadukou, Jiangbei, Shapingba, Jiulongpo, Nanan, Yubei, Chengkou of Chongqing and districts of Xiling and Wujiagang of Yichang city of Hubei province were high, ranging from 60.20/100 000 to 114.81/100 000. Spatial-temporal scan statistic for the spatial and temporal distributions of bacillary dysentery during this period revealed that the temporal distribution was during May-October, and there were 12 class Ⅰ clusters, 35 class Ⅱ clusters, and 9 clusters without statistical significance in counties with high incidence. All the class Ⅰ clusters were in urban area of Chongqing (Yuzhong, Dadukou, Jiangbei, Shapingba, Jiulongpo, Nanan, Beibei, Yubei, Banan) and surrounding counties, and the class Ⅱ clusters transformed from concentrated distribution to scattered distribution. Conclusions: Temporal and spatial cluster of bacillary dysentery incidence existed in the three gorges area during 2005-2016. It is necessary to strengthen the bacillary dysentery prevention and control in urban areas of Chongqing and Yichang.

  4. Benthic foraminiferal assemblages as bio-indicators of metals contamination in sediments, Qarun Lake as a case study, Egypt

    NASA Astrophysics Data System (ADS)

    Abd El Naby, Ahmed; Al Menoufy, Safia; Gad, Ahmed

    2018-03-01

    Qarun Lake, in the Fayoum Depression of the Western Desert of Egypt, lies within the deepest area in the River Nile flood plain. The drainage water in the Qarun Lake is derived from the discharge of the natural and artificial drainage systems in the Fayoum. Mixed domestic and agricultural pollutants, including heavy metals, nitrates, phosphates, sulfates and pesticides, are discharged into Qarun Lake. Forty-six samples, collected from the undisturbed layer of sediments were used for benthic foraminiferal analysis. Concentrations of some selected trace metal elements (Cd, Co, Cr, Cu, Fe, Mn, Ni, Pb, Sr, V, and Zn) were also determined. Statistical analysis of the abiotic variables (Texture distribution of sediments, Physico-chemical parameters, and metals concentrations) and of the biotic variables (distribution of benthic foraminiferal species) were also performed. The Q-mode cluster analysis of benthic foraminiferal distribution has provided evidence that the Qarun Lake can be subdivided into two cluster groups (A and B), reflecting environmental changes in the lake ecosystem. Cluster B can also be subdivided into two sub-clusters (B1 and B2). The presence of only pollution tolerant taxa with higher faunal density and lower diversity and the absence of the other foraminiferal assemblages in cluster A were attributed to the high concentration of trace metal elements and the strong environmental stress at the eastern and central parts of the Qarun Lake.

  5. Pore-scale micro-computed-tomography imaging: Nonwetting-phase cluster-size distribution during drainage and imbibition

    NASA Astrophysics Data System (ADS)

    Georgiadis, A.; Berg, S.; Makurat, A.; Maitland, G.; Ott, H.

    2013-09-01

    We investigated the cluster-size distribution of the residual nonwetting phase in a sintered glass-bead porous medium at two-phase flow conditions, by means of micro-computed-tomography (μCT) imaging with pore-scale resolution. Cluster-size distribution functions and cluster volumes were obtained by image analysis for a range of injected pore volumes under both imbibition and drainage conditions; the field of view was larger than the porosity-based representative elementary volume (REV). We did not attempt to make a definition for a two-phase REV but used the nonwetting-phase cluster-size distribution as an indicator. Most of the nonwetting-phase total volume was found to be contained in clusters that were one to two orders of magnitude larger than the porosity-based REV. The largest observed clusters in fact ranged in volume from 65% to 99% of the entire nonwetting phase in the field of view. As a consequence, the largest clusters observed were statistically not represented and were found to be smaller than the estimated maximum cluster length. The results indicate that the two-phase REV is larger than the field of view attainable by μCT scanning, at a resolution which allows for the accurate determination of cluster connectivity.

  6. Spatial Differentiation of Landscape Values in the Murray River Region of Victoria, Australia

    NASA Astrophysics Data System (ADS)

    Zhu, Xuan; Pfueller, Sharron; Whitelaw, Paul; Winter, Caroline

    2010-05-01

    This research advances the understanding of the location of perceived landscape values through a statistically based approach to spatial analysis of value densities. Survey data were obtained from a sample of people living in and using the Murray River region, Australia, where declining environmental quality prompted a reevaluation of its conservation status. When densities of 12 perceived landscape values were mapped using geographic information systems (GIS), valued places clustered along the entire river bank and in associated National/State Parks and reserves. While simple density mapping revealed high value densities in various locations, it did not indicate what density of a landscape value could be regarded as a statistically significant hotspot or distinguish whether overlapping areas of high density for different values indicate identical or adjacent locations. A spatial statistic Getis-Ord Gi* was used to indicate statistically significant spatial clusters of high value densities or “hotspots”. Of 251 hotspots, 40% were for single non-use values, primarily spiritual, therapeutic or intrinsic. Four hotspots had 11 landscape values. Two, lacking economic value, were located in ecologically important river red gum forests and two, lacking wilderness value, were near the major towns of Echuca-Moama and Albury-Wodonga. Hotspots for eight values showed statistically significant associations with another value. There were high associations between learning and heritage values while economic and biological diversity values showed moderate associations with several other direct and indirect use values. This approach may improve confidence in the interpretation of spatial analysis of landscape values by enhancing understanding of value relationships.

  7. Weak Lensing Peaks in Simulated Light-Cones: Investigating the Coupling between Dark Matter and Dark Energy

    NASA Astrophysics Data System (ADS)

    Giocoli, Carlo; Moscardini, Lauro; Baldi, Marco; Meneghetti, Massimo; Metcalf, Robert B.

    2018-05-01

    In this paper, we study the statistical properties of weak lensing peaks in light-cones generated from cosmological simulations. In order to assess the prospects of such observable as a cosmological probe, we consider simulations that include interacting Dark Energy (hereafter DE) models with coupling term between DE and Dark Matter. Cosmological models that produce a larger population of massive clusters have more numerous high signal-to-noise peaks; among models with comparable numbers of clusters those with more concentrated haloes produce more peaks. The most extreme model under investigation shows a difference in peak counts of about 20% with respect to the reference ΛCDM model. We find that peak statistics can be used to distinguish a coupling DE model from a reference one with the same power spectrum normalisation. The differences in the expansion history and the growth rate of structure formation are reflected in their halo counts, non-linear scale features and, through them, in the properties of the lensing peaks. For a source redshift distribution consistent with the expectations of future space-based wide field surveys, we find that typically seventy percent of the cluster population contributes to weak-lensing peaks with signal-to-noise ratios larger than two, and that the fraction of clusters in peaks approaches one-hundred percent for haloes with redshift z ≤ 0.5. Our analysis demonstrates that peak statistics are an important tool for disentangling DE models by accurately tracing the structure formation processes as a function of the cosmic time.

  8. DOE Office of Scientific and Technical Information (OSTI.GOV)

    Murugesan, Sugeerth; Bouchard, Kristofer; Chang, Edward

    There exists a need for effective and easy-to-use software tools supporting the analysis of complex Electrocorticography (ECoG) data. Understanding how epileptic seizures develop or identifying diagnostic indicators for neurological diseases require the in-depth analysis of neural activity data from ECoG. Such data is multi-scale and is of high spatio-temporal resolution. Comprehensive analysis of this data should be supported by interactive visual analysis methods that allow a scientist to understand functional patterns at varying levels of granularity and comprehend its time-varying behavior. We introduce a novel multi-scale visual analysis system, ECoG ClusterFlow, for the detailed exploration of ECoG data. Our systemmore » detects and visualizes dynamic high-level structures, such as communities, derived from the time-varying connectivity network. The system supports two major views: 1) an overview summarizing the evolution of clusters over time and 2) an electrode view using hierarchical glyph-based design to visualize the propagation of clusters in their spatial, anatomical context. We present case studies that were performed in collaboration with neuroscientists and neurosurgeons using simulated and recorded epileptic seizure data to demonstrate our system's effectiveness. ECoG ClusterFlow supports the comparison of spatio-temporal patterns for specific time intervals and allows a user to utilize various clustering algorithms. Neuroscientists can identify the site of seizure genesis and its spatial progression during various the stages of a seizure. Our system serves as a fast and powerful means for the generation of preliminary hypotheses that can be used as a basis for subsequent application of rigorous statistical methods, with the ultimate goal being the clinical treatment of epileptogenic zones.« less

  9. The use of multicomponent statistical analysis in hydrogeological environmental research.

    PubMed

    Lambrakis, Nicolaos; Antonakos, Andreas; Panagopoulos, George

    2004-04-01

    The present article examines the possibilities of investigating NO(3)(-) spread in aquifers by applying multicomponent statistical methods (factor, cluster and discriminant analysis) on hydrogeological, hydrochemical, and environmental parameters. A 4-R-Mode factor model determined from the analysis showed its useful role in investigating hydrogeological parameters affecting NO(3)(-) concentration, such as its dilution by upcoming groundwater of the recharge areas. The relationship between NO(3)(-) concentration and agricultural activities can be determined sufficiently by the first factor which relies on NO(3)(-) and SO(4)(2-) of the same origin-that of agricultural fertilizers. The other three factors of R-Mode analysis are not connected directly to the NO(3)(-) problem. They do however, by extracting the role of the unsaturated zone, show an interesting relationship between organic matter content, thickness and saturated hydraulic conductivity. The application of Hirerarchical Cluster Analysis, based on all possible combinations of classification method, showed two main groups of samples. The first group comprises samples from the edges and the second from the central part of the study area. By the application of Discriminant Analysis it was shown that NO(3)(-) and SO(4)(2-) ions are the most significant variables in the discriminant function. Therefore, the first group is considered to comprise all samples from areas not influenced by fertilizers lying on the edges of contaminating activities such as crop cultivation, while the second comprises all the other samples.

  10. ParallABEL: an R library for generalized parallelization of genome-wide association studies.

    PubMed

    Sangket, Unitsa; Mahasirimongkol, Surakameth; Chantratita, Wasun; Tandayya, Pichaya; Aulchenko, Yurii S

    2010-04-29

    Genome-Wide Association (GWA) analysis is a powerful method for identifying loci associated with complex traits and drug response. Parts of GWA analyses, especially those involving thousands of individuals and consuming hours to months, will benefit from parallel computation. It is arduous acquiring the necessary programming skills to correctly partition and distribute data, control and monitor tasks on clustered computers, and merge output files. Most components of GWA analysis can be divided into four groups based on the types of input data and statistical outputs. The first group contains statistics computed for a particular Single Nucleotide Polymorphism (SNP), or trait, such as SNP characterization statistics or association test statistics. The input data of this group includes the SNPs/traits. The second group concerns statistics characterizing an individual in a study, for example, the summary statistics of genotype quality for each sample. The input data of this group includes individuals. The third group consists of pair-wise statistics derived from analyses between each pair of individuals in the study, for example genome-wide identity-by-state or genomic kinship analyses. The input data of this group includes pairs of SNPs/traits. The final group concerns pair-wise statistics derived for pairs of SNPs, such as the linkage disequilibrium characterisation. The input data of this group includes pairs of individuals. We developed the ParallABEL library, which utilizes the Rmpi library, to parallelize these four types of computations. ParallABEL library is not only aimed at GenABEL, but may also be employed to parallelize various GWA packages in R. The data set from the North American Rheumatoid Arthritis Consortium (NARAC) includes 2,062 individuals with 545,080, SNPs' genotyping, was used to measure ParallABEL performance. Almost perfect speed-up was achieved for many types of analyses. For example, the computing time for the identity-by-state matrix was linearly reduced from approximately eight hours to one hour when ParallABEL employed eight processors. Executing genome-wide association analysis using the ParallABEL library on a computer cluster is an effective way to boost performance, and simplify the parallelization of GWA studies. ParallABEL is a user-friendly parallelization of GenABEL.

  11. The influence of academic examinations on energy and nutrient intake in male university students.

    PubMed

    Barker, Margo E; Blain, Richard J; Russell, Jean M

    2015-09-25

    Taking examinations is central to student experience at University and may cause psychological stress. Although stress is recognised to impact on food intake, the effects of undertaking examinations on students' dietary intake have not been well characterised. The purpose of this study was to assess how students' energy and nutrient intake may alter during examination periods. The study design was a within-subject comparison of students' energy and nutrient intake during an examination period contrasted with that outside an examination period (baseline). A total of 20 male students from the University of Sheffield completed an automated photographic 4-d dietary record alongside four 24-h recalls in each time period. Daily energy and nutrient intake was estimated for each student by time period and change in energy and nutrient intake calculated. Intakes at baseline were compared to UK dietary recommendations. Cluster analysis categorised students according to their change in energy intake between baseline and the examination period. Non-parametric statistical tests identified differences by cluster. Baseline intakes did not meet recommendations for energy, non-milk extrinsic sugars, non-starch polysaccharide and sodium. Three defined clusters of students were identified: Cluster D who decreased daily energy intake by 12.06 MJ (n = 5), Cluster S who had similar energy intakes (n = 13) and Cluster I who substantially increased energy intake by 6.37 MJ (n = 2) between baseline and examination period. There were statistically significant differences (all p < 0.05) in change in intake of protein, carbohydrate, calcium and sodium between clusters. Cluster D recorded greater energy, carbohydrate and protein intakes than Cluster I at baseline. The majority of students were dietary resilient. Students who demonstrated hypophagia in the examination period had a high energy and nutrient intake at baseline, conversely those who showed hyperphagia had a low energy and nutrient intake. These patterns require confirmation in studies including women, but if confirmed, there is need to address some students' poor food choice especially during examinations.

  12. Removal of impulse noise clusters from color images with local order statistics

    NASA Astrophysics Data System (ADS)

    Ruchay, Alexey; Kober, Vitaly

    2017-09-01

    This paper proposes a novel algorithm for restoring images corrupted with clusters of impulse noise. The noise clusters often occur when the probability of impulse noise is very high. The proposed noise removal algorithm consists of detection of bulky impulse noise in three color channels with local order statistics followed by removal of the detected clusters by means of vector median filtering. With the help of computer simulation we show that the proposed algorithm is able to effectively remove clustered impulse noise. The performance of the proposed algorithm is compared in terms of image restoration metrics with that of common successful algorithms.

  13. Statistical Exploration of Electronic Structure of Molecules from Quantum Monte-Carlo Simulations

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Prabhat, Mr; Zubarev, Dmitry; Lester, Jr., William A.

    In this report, we present results from analysis of Quantum Monte Carlo (QMC) simulation data with the goal of determining internal structure of a 3N-dimensional phase space of an N-electron molecule. We are interested in mining the simulation data for patterns that might be indicative of the bond rearrangement as molecules change electronic states. We examined simulation output that tracks the positions of two coupled electrons in the singlet and triplet states of an H2 molecule. The electrons trace out a trajectory, which was analyzed with a number of statistical techniques. This project was intended to address the following scientificmore » questions: (1) Do high-dimensional phase spaces characterizing electronic structure of molecules tend to cluster in any natural way? Do we see a change in clustering patterns as we explore different electronic states of the same molecule? (2) Since it is hard to understand the high-dimensional space of trajectories, can we project these trajectories to a lower dimensional subspace to gain a better understanding of patterns? (3) Do trajectories inherently lie in a lower-dimensional manifold? Can we recover that manifold? After extensive statistical analysis, we are now in a better position to respond to these questions. (1) We definitely see clustering patterns, and differences between the H2 and H2tri datasets. These are revealed by the pamk method in a fairly reliable manner and can potentially be used to distinguish bonded and non-bonded systems and get insight into the nature of bonding. (2) Projecting to a lower dimensional subspace ({approx}4-5) using PCA or Kernel PCA reveals interesting patterns in the distribution of scalar values, which can be related to the existing descriptors of electronic structure of molecules. Also, these results can be immediately used to develop robust tools for analysis of noisy data obtained during QMC simulations (3) All dimensionality reduction and estimation techniques that we tried seem to indicate that one needs 4 or 5 components to account for most of the variance in the data, hence this 5D dataset does not necessarily lie on a well-defined, low dimensional manifold. In terms of specific clustering techniques, K-means was generally useful in exploring the dataset. The partition around medoids (pam) technique produced the most definitive results for our data showing distinctive patterns for both a sample of the complete data and time-series. The gap statistic with tibshirani criteria did not provide any distinction across the 2 dataset. The gap statistic w/DandF criteria, Model based clustering and hierarchical modeling simply failed to run on our datasets. Thankfully, the vanilla PCA technique was successful in handling our entire dataset. PCA revealed some interesting patterns for the scalar value distribution. Kernel PCA techniques (vanilladot, RBF, Polynomial) and MDS failed to run on the entire dataset, or even a significant fraction of the dataset, and we resorted to creating an explicit feature map followed by conventional PCA. Clustering using K-means and PAM in the new basis set seems to produce promising results. Understanding the new basis set in the scientific context of the problem is challenging, and we are currently working to further examine and interpret the results.« less

  14. Multi-Parent Clustering Algorithms from Stochastic Grammar Data Models

    NASA Technical Reports Server (NTRS)

    Mjoisness, Eric; Castano, Rebecca; Gray, Alexander

    1999-01-01

    We introduce a statistical data model and an associated optimization-based clustering algorithm which allows data vectors to belong to zero, one or several "parent" clusters. For each data vector the algorithm makes a discrete decision among these alternatives. Thus, a recursive version of this algorithm would place data clusters in a Directed Acyclic Graph rather than a tree. We test the algorithm with synthetic data generated according to the statistical data model. We also illustrate the algorithm using real data from large-scale gene expression assays.

  15. The MUSE-Wide survey: detection of a clustering signal from Lyman α emitters in the range 3 < z < 6

    NASA Astrophysics Data System (ADS)

    Diener, C.; Wisotzki, L.; Schmidt, K. B.; Herenz, E. C.; Urrutia, T.; Garel, T.; Kerutt, J.; Saust, R. L.; Bacon, R.; Cantalupo, S.; Contini, T.; Guiderdoni, B.; Marino, R. A.; Richard, J.; Schaye, J.; Soucail, G.; Weilbacher, P. M.

    2017-11-01

    We present a clustering analysis of a sample of 238 Ly α emitters at redshift 3 ≲ z ≲ 6 from the MUSE-Wide survey. This survey mosaics extragalactic legacy fields with 1h MUSE pointings to detect statistically relevant samples of emission line galaxies. We analysed the first year observations from MUSE-Wide making use of the clustering signal in the line-of-sight direction. This method relies on comparing pair-counts at close redshifts for a fixed transverse distance and thus exploits the full potential of the redshift range covered by our sample. A clear clustering signal with a correlation length of r0=2.9^{+1.0}_{-1.1} Mpc (comoving) is detected. Whilst this result is based on only about a quarter of the full survey size, it already shows the immense potential of MUSE for efficiently observing and studying the clustering of Ly α emitters.

  16. Detecting subtle hydrochemical anomalies with multivariate statistics: an example from homogeneous groundwaters in the Great Artesian Basin, Australia

    NASA Astrophysics Data System (ADS)

    O'Shea, Bethany; Jankowski, Jerzy

    2006-12-01

    The major ion composition of Great Artesian Basin groundwater in the lower Namoi River valley is relatively homogeneous in chemical composition. Traditional graphical techniques have been combined with multivariate statistical methods to determine whether subtle differences in the chemical composition of these waters can be delineated. Hierarchical cluster analysis and principal components analysis were successful in delineating minor variations within the groundwaters of the study area that were not visually identified in the graphical techniques applied. Hydrochemical interpretation allowed geochemical processes to be identified in each statistically defined water type and illustrated how these groundwaters differ from one another. Three main geochemical processes were identified in the groundwaters: ion exchange, precipitation, and mixing between waters from different sources. Both statistical methods delineated an anomalous sample suspected of being influenced by magmatic CO2 input. The use of statistical methods to complement traditional graphical techniques for waters appearing homogeneous is emphasized for all investigations of this type. Copyright

  17. Linnorm: improved statistical analysis for single cell RNA-seq expression data

    PubMed Central

    Yip, Shun H.; Wang, Panwen; Kocher, Jean-Pierre A.; Sham, Pak Chung

    2017-01-01

    Abstract Linnorm is a novel normalization and transformation method for the analysis of single cell RNA sequencing (scRNA-seq) data. Linnorm is developed to remove technical noises and simultaneously preserve biological variations in scRNA-seq data, such that existing statistical methods can be improved. Using real scRNA-seq data, we compared Linnorm with existing normalization methods, including NODES, SAMstrt, SCnorm, scran, DESeq and TMM. Linnorm shows advantages in speed, technical noise removal and preservation of cell heterogeneity, which can improve existing methods in the discovery of novel subtypes, pseudo-temporal ordering of cells, clustering analysis, etc. Linnorm also performs better than existing DEG analysis methods, including BASiCS, NODES, SAMstrt, Seurat and DESeq2, in false positive rate control and accuracy. PMID:28981748

  18. Phenotypic mapping of metabolic profiles using self-organizing maps of high-dimensional mass spectrometry data.

    PubMed

    Goodwin, Cody R; Sherrod, Stacy D; Marasco, Christina C; Bachmann, Brian O; Schramm-Sapyta, Nicole; Wikswo, John P; McLean, John A

    2014-07-01

    A metabolic system is composed of inherently interconnected metabolic precursors, intermediates, and products. The analysis of untargeted metabolomics data has conventionally been performed through the use of comparative statistics or multivariate statistical analysis-based approaches; however, each falls short in representing the related nature of metabolic perturbations. Herein, we describe a complementary method for the analysis of large metabolite inventories using a data-driven approach based upon a self-organizing map algorithm. This workflow allows for the unsupervised clustering, and subsequent prioritization of, correlated features through Gestalt comparisons of metabolic heat maps. We describe this methodology in detail, including a comparison to conventional metabolomics approaches, and demonstrate the application of this method to the analysis of the metabolic repercussions of prolonged cocaine exposure in rat sera profiles.

  19. Accounting for Multiple Births in Neonatal and Perinatal Trials: Systematic Review and Case Study

    PubMed Central

    Hibbs, Anna Maria; Black, Dennis; Palermo, Lisa; Cnaan, Avital; Luan, Xianqun; Truog, William E; Walsh, Michele C; Ballard, Roberta A

    2010-01-01

    Objectives To determine the prevalence in the neonatal literature of statistical approaches accounting for the unique clustering patterns of multiple births. To explore the sensitivity of an actual trial to several analytic approaches to multiples. Methods A systematic review of recent perinatal trials assessed the prevalence of studies accounting for clustering of multiples. The NO CLD trial served as a case study of the sensitivity of the outcome to several statistical strategies. We calculated odds ratios using non-clustered (logistic regression) and clustered (generalized estimating equations, multiple outputation) analyses. Results In the systematic review, most studies did not describe the randomization of twins and did not account for clustering. Of those studies that did, exclusion of multiples and generalized estimating equations were the most common strategies. The NO CLD study included 84 infants with a sibling enrolled in the study. Multiples were more likely than singletons to be white and were born to older mothers (p<0.01). Analyses that accounted for clustering were statistically significant; analyses assuming independence were not. Conclusions The statistical approach to multiples can influence the odds ratio and width of confidence intervals, thereby affecting the interpretation of a study outcome. A minority of perinatal studies address this issue. PMID:19969305

  20. Quantitative classification of a historic northern Wisconsin (U.S.A.) landscape: mapping forests at regional scales

    Treesearch

    Lisa A. Schulte; David J. Mladenoff; Erik V. Nordheim

    2002-01-01

    We developed a quantitative and replicable classification system to improve understanding of historical composition and structure within northern Wisconsin's forests. The classification system was based on statistical cluster analysis and two forest metrics, relative dominance (% basal area) and relative importance (mean of relative dominance and relative density...

Top