Sample records for statistical methods cluster

  1. Cluster mass inference via random field theory.

    PubMed

    Zhang, Hui; Nichols, Thomas E; Johnson, Timothy D

    2009-01-01

    Cluster extent and voxel intensity are two widely used statistics in neuroimaging inference. Cluster extent is sensitive to spatially extended signals while voxel intensity is better for intense but focal signals. In order to leverage strength from both statistics, several nonparametric permutation methods have been proposed to combine the two methods. Simulation studies have shown that of the different cluster permutation methods, the cluster mass statistic is generally the best. However, to date, there is no parametric cluster mass inference available. In this paper, we propose a cluster mass inference method based on random field theory (RFT). We develop this method for Gaussian images, evaluate it on Gaussian and Gaussianized t-statistic images and investigate its statistical properties via simulation studies and real data. Simulation results show that the method is valid under the null hypothesis and demonstrate that it can be more powerful than the cluster extent inference method. Further, analyses with a single subject and a group fMRI dataset demonstrate better power than traditional cluster size inference, and good accuracy relative to a gold-standard permutation test.

  2. A spatial scan statistic for multiple clusters.

    PubMed

    Li, Xiao-Zhou; Wang, Jin-Feng; Yang, Wei-Zhong; Li, Zhong-Jie; Lai, Sheng-Jie

    2011-10-01

    Spatial scan statistics are commonly used for geographical disease surveillance and cluster detection. While there are multiple clusters coexisting in the study area, they become difficult to detect because of clusters' shadowing effect to each other. The recently proposed sequential method showed its better power for detecting the second weaker cluster, but did not improve the ability of detecting the first stronger cluster which is more important than the second one. We propose a new extension of the spatial scan statistic which could be used to detect multiple clusters. Through constructing two or more clusters in the alternative hypothesis, our proposed method accounts for other coexisting clusters in the detecting and evaluating process. The performance of the proposed method is compared to the sequential method through an intensive simulation study, in which our proposed method shows better power in terms of both rejecting the null hypothesis and accurately detecting the coexisting clusters. In the real study of hand-foot-mouth disease data in Pingdu city, a true cluster town is successfully detected by our proposed method, which cannot be evaluated to be statistically significant by the standard method due to another cluster's shadowing effect. Copyright © 2011 Elsevier Inc. All rights reserved.

  3. A spatial scan statistic for nonisotropic two-level risk cluster.

    PubMed

    Li, Xiao-Zhou; Wang, Jin-Feng; Yang, Wei-Zhong; Li, Zhong-Jie; Lai, Sheng-Jie

    2012-01-30

    Spatial scan statistic methods are commonly used for geographical disease surveillance and cluster detection. The standard spatial scan statistic does not model any variability in the underlying risks of subregions belonging to a detected cluster. For a multilevel risk cluster, the isotonic spatial scan statistic could model a centralized high-risk kernel in the cluster. Because variations in disease risks are anisotropic owing to different social, economical, or transport factors, the real high-risk kernel will not necessarily take the central place in a whole cluster area. We propose a spatial scan statistic for a nonisotropic two-level risk cluster, which could be used to detect a whole cluster and a noncentralized high-risk kernel within the cluster simultaneously. The performance of the three methods was evaluated through an intensive simulation study. Our proposed nonisotropic two-level method showed better power and geographical precision with two-level risk cluster scenarios, especially for a noncentralized high-risk kernel. Our proposed method is illustrated using the hand-foot-mouth disease data in Pingdu City, Shandong, China in May 2009, compared with two other methods. In this practical study, the nonisotropic two-level method is the only way to precisely detect a high-risk area in a detected whole cluster. Copyright © 2011 John Wiley & Sons, Ltd.

  4. Testing prediction methods: Earthquake clustering versus the Poisson model

    USGS Publications Warehouse

    Michael, A.J.

    1997-01-01

    Testing earthquake prediction methods requires statistical techniques that compare observed success to random chance. One technique is to produce simulated earthquake catalogs and measure the relative success of predicting real and simulated earthquakes. The accuracy of these tests depends on the validity of the statistical model used to simulate the earthquakes. This study tests the effect of clustering in the statistical earthquake model on the results. Three simulation models were used to produce significance levels for a VLF earthquake prediction method. As the degree of simulated clustering increases, the statistical significance drops. Hence, the use of a seismicity model with insufficient clustering can lead to overly optimistic results. A successful method must pass the statistical tests with a model that fully replicates the observed clustering. However, a method can be rejected based on tests with a model that contains insufficient clustering. U.S. copyright. Published in 1997 by the American Geophysical Union.

  5. A scan statistic for binary outcome based on hypergeometric probability model, with an application to detecting spatial clusters of Japanese encephalitis.

    PubMed

    Zhao, Xing; Zhou, Xiao-Hua; Feng, Zijian; Guo, Pengfei; He, Hongyan; Zhang, Tao; Duan, Lei; Li, Xiaosong

    2013-01-01

    As a useful tool for geographical cluster detection of events, the spatial scan statistic is widely applied in many fields and plays an increasingly important role. The classic version of the spatial scan statistic for the binary outcome is developed by Kulldorff, based on the Bernoulli or the Poisson probability model. In this paper, we apply the Hypergeometric probability model to construct the likelihood function under the null hypothesis. Compared with existing methods, the likelihood function under the null hypothesis is an alternative and indirect method to identify the potential cluster, and the test statistic is the extreme value of the likelihood function. Similar with Kulldorff's methods, we adopt Monte Carlo test for the test of significance. Both methods are applied for detecting spatial clusters of Japanese encephalitis in Sichuan province, China, in 2009, and the detected clusters are identical. Through a simulation to independent benchmark data, it is indicated that the test statistic based on the Hypergeometric model outweighs Kulldorff's statistics for clusters of high population density or large size; otherwise Kulldorff's statistics are superior.

  6. A note on the kappa statistic for clustered dichotomous data.

    PubMed

    Zhou, Ming; Yang, Zhao

    2014-06-30

    The kappa statistic is widely used to assess the agreement between two raters. Motivated by a simulation-based cluster bootstrap method to calculate the variance of the kappa statistic for clustered physician-patients dichotomous data, we investigate its special correlation structure and develop a new simple and efficient data generation algorithm. For the clustered physician-patients dichotomous data, based on the delta method and its special covariance structure, we propose a semi-parametric variance estimator for the kappa statistic. An extensive Monte Carlo simulation study is performed to evaluate the performance of the new proposal and five existing methods with respect to the empirical coverage probability, root-mean-square error, and average width of the 95% confidence interval for the kappa statistic. The variance estimator ignoring the dependence within a cluster is generally inappropriate, and the variance estimators from the new proposal, bootstrap-based methods, and the sampling-based delta method perform reasonably well for at least a moderately large number of clusters (e.g., the number of clusters K ⩾50). The new proposal and sampling-based delta method provide convenient tools for efficient computations and non-simulation-based alternatives to the existing bootstrap-based methods. Moreover, the new proposal has acceptable performance even when the number of clusters is as small as K = 25. To illustrate the practical application of all the methods, one psychiatric research data and two simulated clustered physician-patients dichotomous data are analyzed. Copyright © 2014 John Wiley & Sons, Ltd.

  7. Symptom Clusters in Advanced Cancer Patients: An Empirical Comparison of Statistical Methods and the Impact on Quality of Life.

    PubMed

    Dong, Skye T; Costa, Daniel S J; Butow, Phyllis N; Lovell, Melanie R; Agar, Meera; Velikova, Galina; Teckle, Paulos; Tong, Allison; Tebbutt, Niall C; Clarke, Stephen J; van der Hoek, Kim; King, Madeleine T; Fayers, Peter M

    2016-01-01

    Symptom clusters in advanced cancer can influence patient outcomes. There is large heterogeneity in the methods used to identify symptom clusters. To investigate the consistency of symptom cluster composition in advanced cancer patients using different statistical methodologies for all patients across five primary cancer sites, and to examine which clusters predict functional status, a global assessment of health and global quality of life. Principal component analysis and exploratory factor analysis (with different rotation and factor selection methods) and hierarchical cluster analysis (with different linkage and similarity measures) were used on a data set of 1562 advanced cancer patients who completed the European Organization for the Research and Treatment of Cancer Quality of Life Questionnaire-Core 30. Four clusters consistently formed for many of the methods and cancer sites: tense-worry-irritable-depressed (emotional cluster), fatigue-pain, nausea-vomiting, and concentration-memory (cognitive cluster). The emotional cluster was a stronger predictor of overall quality of life than the other clusters. Fatigue-pain was a stronger predictor of overall health than the other clusters. The cognitive cluster and fatigue-pain predicted physical functioning, role functioning, and social functioning. The four identified symptom clusters were consistent across statistical methods and cancer types, although there were some noteworthy differences. Statistical derivation of symptom clusters is in need of greater methodological guidance. A psychosocial pathway in the management of symptom clusters may improve quality of life. Biological mechanisms underpinning symptom clusters need to be delineated by future research. A framework for evidence-based screening, assessment, treatment, and follow-up of symptom clusters in advanced cancer is essential. Copyright © 2016 American Academy of Hospice and Palliative Medicine. Published by Elsevier Inc. All rights reserved.

  8. Performance of cancer cluster Q-statistics for case-control residential histories

    PubMed Central

    Sloan, Chantel D.; Jacquez, Geoffrey M.; Gallagher, Carolyn M.; Ward, Mary H.; Raaschou-Nielsen, Ole; Nordsborg, Rikke Baastrup; Meliker, Jaymie R.

    2012-01-01

    Few investigations of health event clustering have evaluated residential mobility, though causative exposures for chronic diseases such as cancer often occur long before diagnosis. Recently developed Q-statistics incorporate human mobility into disease cluster investigations by quantifying space- and time-dependent nearest neighbor relationships. Using residential histories from two cancer case-control studies, we created simulated clusters to examine Q-statistic performance. Results suggest the intersection of cases with significant clustering over their life course, Qi, with cases who are constituents of significant local clusters at given times, Qit, yielded the best performance, which improved with increasing cluster size. Upon comparison, a larger proportion of true positives were detected with Kulldorf’s spatial scan method if the time of clustering was provided. We recommend using Q-statistics to identify when and where clustering may have occurred, followed by the scan method to localize the candidate clusters. Future work should investigate the generalizability of these findings. PMID:23149326

  9. Cluster size statistic and cluster mass statistic: two novel methods for identifying changes in functional connectivity between groups or conditions.

    PubMed

    Ing, Alex; Schwarzbauer, Christian

    2014-01-01

    Functional connectivity has become an increasingly important area of research in recent years. At a typical spatial resolution, approximately 300 million connections link each voxel in the brain with every other. This pattern of connectivity is known as the functional connectome. Connectivity is often compared between experimental groups and conditions. Standard methods used to control the type 1 error rate are likely to be insensitive when comparisons are carried out across the whole connectome, due to the huge number of statistical tests involved. To address this problem, two new cluster based methods--the cluster size statistic (CSS) and cluster mass statistic (CMS)--are introduced to control the family wise error rate across all connectivity values. These methods operate within a statistical framework similar to the cluster based methods used in conventional task based fMRI. Both methods are data driven, permutation based and require minimal statistical assumptions. Here, the performance of each procedure is evaluated in a receiver operator characteristic (ROC) analysis, utilising a simulated dataset. The relative sensitivity of each method is also tested on real data: BOLD (blood oxygen level dependent) fMRI scans were carried out on twelve subjects under normal conditions and during the hypercapnic state (induced through the inhalation of 6% CO2 in 21% O2 and 73%N2). Both CSS and CMS detected significant changes in connectivity between normal and hypercapnic states. A family wise error correction carried out at the individual connection level exhibited no significant changes in connectivity.

  10. Cluster Size Statistic and Cluster Mass Statistic: Two Novel Methods for Identifying Changes in Functional Connectivity Between Groups or Conditions

    PubMed Central

    Ing, Alex; Schwarzbauer, Christian

    2014-01-01

    Functional connectivity has become an increasingly important area of research in recent years. At a typical spatial resolution, approximately 300 million connections link each voxel in the brain with every other. This pattern of connectivity is known as the functional connectome. Connectivity is often compared between experimental groups and conditions. Standard methods used to control the type 1 error rate are likely to be insensitive when comparisons are carried out across the whole connectome, due to the huge number of statistical tests involved. To address this problem, two new cluster based methods – the cluster size statistic (CSS) and cluster mass statistic (CMS) – are introduced to control the family wise error rate across all connectivity values. These methods operate within a statistical framework similar to the cluster based methods used in conventional task based fMRI. Both methods are data driven, permutation based and require minimal statistical assumptions. Here, the performance of each procedure is evaluated in a receiver operator characteristic (ROC) analysis, utilising a simulated dataset. The relative sensitivity of each method is also tested on real data: BOLD (blood oxygen level dependent) fMRI scans were carried out on twelve subjects under normal conditions and during the hypercapnic state (induced through the inhalation of 6% CO2 in 21% O2 and 73%N2). Both CSS and CMS detected significant changes in connectivity between normal and hypercapnic states. A family wise error correction carried out at the individual connection level exhibited no significant changes in connectivity. PMID:24906136

  11. A scan statistic to extract causal gene clusters from case-control genome-wide rare CNV data.

    PubMed

    Nishiyama, Takeshi; Takahashi, Kunihiko; Tango, Toshiro; Pinto, Dalila; Scherer, Stephen W; Takami, Satoshi; Kishino, Hirohisa

    2011-05-26

    Several statistical tests have been developed for analyzing genome-wide association data by incorporating gene pathway information in terms of gene sets. Using these methods, hundreds of gene sets are typically tested, and the tested gene sets often overlap. This overlapping greatly increases the probability of generating false positives, and the results obtained are difficult to interpret, particularly when many gene sets show statistical significance. We propose a flexible statistical framework to circumvent these problems. Inspired by spatial scan statistics for detecting clustering of disease occurrence in the field of epidemiology, we developed a scan statistic to extract disease-associated gene clusters from a whole gene pathway. Extracting one or a few significant gene clusters from a global pathway limits the overall false positive probability, which results in increased statistical power, and facilitates the interpretation of test results. In the present study, we applied our method to genome-wide association data for rare copy-number variations, which have been strongly implicated in common diseases. Application of our method to a simulated dataset demonstrated the high accuracy of this method in detecting disease-associated gene clusters in a whole gene pathway. The scan statistic approach proposed here shows a high level of accuracy in detecting gene clusters in a whole gene pathway. This study has provided a sound statistical framework for analyzing genome-wide rare CNV data by incorporating topological information on the gene pathway.

  12. Detecting Genomic Clustering of Risk Variants from Sequence Data: Cases vs. Controls

    PubMed Central

    Schaid, Daniel J.; Sinnwell, Jason P.; McDonnell, Shannon K.; Thibodeau, Stephen N.

    2013-01-01

    As the ability to measure dense genetic markers approaches the limit of the DNA sequence itself, taking advantage of possible clustering of genetic variants in, and around, a gene would benefit genetic association analyses, and likely provide biological insights. The greatest benefit might be realized when multiple rare variants cluster in a functional region. Several statistical tests have been developed, one of which is based on the popular Kulldorff scan statistic for spatial clustering of disease. We extended another popular spatial clustering method – Tango’s statistic – to genomic sequence data. An advantage of Tango’s method is that it is rapid to compute, and when single test statistic is computed, its distribution is well approximated by a scaled chi-square distribution, making computation of p-values very rapid. We compared the Type-I error rates and power of several clustering statistics, as well as the omnibus sequence kernel association test (SKAT). Although our version of Tango’s statistic, which we call “Kernel Distance” statistic, took approximately half the time to compute than the Kulldorff scan statistic, it had slightly less power than the scan statistic. Our results showed that the Ionita-Laza version of Kulldorff’s scan statistic had the greatest power over a range of clustering scenarios. PMID:23842950

  13. Spatial scan statistics for detection of multiple clusters with arbitrary shapes.

    PubMed

    Lin, Pei-Sheng; Kung, Yi-Hung; Clayton, Murray

    2016-12-01

    In applying scan statistics for public health research, it would be valuable to develop a detection method for multiple clusters that accommodates spatial correlation and covariate effects in an integrated model. In this article, we connect the concepts of the likelihood ratio (LR) scan statistic and the quasi-likelihood (QL) scan statistic to provide a series of detection procedures sufficiently flexible to apply to clusters of arbitrary shape. First, we use an independent scan model for detection of clusters and then a variogram tool to examine the existence of spatial correlation and regional variation based on residuals of the independent scan model. When the estimate of regional variation is significantly different from zero, a mixed QL estimating equation is developed to estimate coefficients of geographic clusters and covariates. We use the Benjamini-Hochberg procedure (1995) to find a threshold for p-values to address the multiple testing problem. A quasi-deviance criterion is used to regroup the estimated clusters to find geographic clusters with arbitrary shapes. We conduct simulations to compare the performance of the proposed method with other scan statistics. For illustration, the method is applied to enterovirus data from Taiwan. © 2016, The International Biometric Society.

  14. Comparisons of non-Gaussian statistical models in DNA methylation analysis.

    PubMed

    Ma, Zhanyu; Teschendorff, Andrew E; Yu, Hong; Taghia, Jalil; Guo, Jun

    2014-06-16

    As a key regulatory mechanism of gene expression, DNA methylation patterns are widely altered in many complex genetic diseases, including cancer. DNA methylation is naturally quantified by bounded support data; therefore, it is non-Gaussian distributed. In order to capture such properties, we introduce some non-Gaussian statistical models to perform dimension reduction on DNA methylation data. Afterwards, non-Gaussian statistical model-based unsupervised clustering strategies are applied to cluster the data. Comparisons and analysis of different dimension reduction strategies and unsupervised clustering methods are presented. Experimental results show that the non-Gaussian statistical model-based methods are superior to the conventional Gaussian distribution-based method. They are meaningful tools for DNA methylation analysis. Moreover, among several non-Gaussian methods, the one that captures the bounded nature of DNA methylation data reveals the best clustering performance.

  15. Comparisons of Non-Gaussian Statistical Models in DNA Methylation Analysis

    PubMed Central

    Ma, Zhanyu; Teschendorff, Andrew E.; Yu, Hong; Taghia, Jalil; Guo, Jun

    2014-01-01

    As a key regulatory mechanism of gene expression, DNA methylation patterns are widely altered in many complex genetic diseases, including cancer. DNA methylation is naturally quantified by bounded support data; therefore, it is non-Gaussian distributed. In order to capture such properties, we introduce some non-Gaussian statistical models to perform dimension reduction on DNA methylation data. Afterwards, non-Gaussian statistical model-based unsupervised clustering strategies are applied to cluster the data. Comparisons and analysis of different dimension reduction strategies and unsupervised clustering methods are presented. Experimental results show that the non-Gaussian statistical model-based methods are superior to the conventional Gaussian distribution-based method. They are meaningful tools for DNA methylation analysis. Moreover, among several non-Gaussian methods, the one that captures the bounded nature of DNA methylation data reveals the best clustering performance. PMID:24937687

  16. Scalable Integrated Region-Based Image Retrieval Using IRM and Statistical Clustering.

    ERIC Educational Resources Information Center

    Wang, James Z.; Du, Yanping

    Statistical clustering is critical in designing scalable image retrieval systems. This paper presents a scalable algorithm for indexing and retrieving images based on region segmentation. The method uses statistical clustering on region features and IRM (Integrated Region Matching), a measure developed to evaluate overall similarity between images…

  17. Developing appropriate methods for cost-effectiveness analysis of cluster randomized trials.

    PubMed

    Gomes, Manuel; Ng, Edmond S-W; Grieve, Richard; Nixon, Richard; Carpenter, James; Thompson, Simon G

    2012-01-01

    Cost-effectiveness analyses (CEAs) may use data from cluster randomized trials (CRTs), where the unit of randomization is the cluster, not the individual. However, most studies use analytical methods that ignore clustering. This article compares alternative statistical methods for accommodating clustering in CEAs of CRTs. Our simulation study compared the performance of statistical methods for CEAs of CRTs with 2 treatment arms. The study considered a method that ignored clustering--seemingly unrelated regression (SUR) without a robust standard error (SE)--and 4 methods that recognized clustering--SUR and generalized estimating equations (GEEs), both with robust SE, a "2-stage" nonparametric bootstrap (TSB) with shrinkage correction, and a multilevel model (MLM). The base case assumed CRTs with moderate numbers of balanced clusters (20 per arm) and normally distributed costs. Other scenarios included CRTs with few clusters, imbalanced cluster sizes, and skewed costs. Performance was reported as bias, root mean squared error (rMSE), and confidence interval (CI) coverage for estimating incremental net benefits (INBs). We also compared the methods in a case study. Each method reported low levels of bias. Without the robust SE, SUR gave poor CI coverage (base case: 0.89 v. nominal level: 0.95). The MLM and TSB performed well in each scenario (CI coverage, 0.92-0.95). With few clusters, the GEE and SUR (with robust SE) had coverage below 0.90. In the case study, the mean INBs were similar across all methods, but ignoring clustering underestimated statistical uncertainty and the value of further research. MLMs and the TSB are appropriate analytical methods for CEAs of CRTs with the characteristics described. SUR and GEE are not recommended for studies with few clusters.

  18. Manipulating measurement scales in medical statistical analysis and data mining: A review of methodologies

    PubMed Central

    Marateb, Hamid Reza; Mansourian, Marjan; Adibi, Peyman; Farina, Dario

    2014-01-01

    Background: selecting the correct statistical test and data mining method depends highly on the measurement scale of data, type of variables, and purpose of the analysis. Different measurement scales are studied in details and statistical comparison, modeling, and data mining methods are studied based upon using several medical examples. We have presented two ordinal–variables clustering examples, as more challenging variable in analysis, using Wisconsin Breast Cancer Data (WBCD). Ordinal-to-Interval scale conversion example: a breast cancer database of nine 10-level ordinal variables for 683 patients was analyzed by two ordinal-scale clustering methods. The performance of the clustering methods was assessed by comparison with the gold standard groups of malignant and benign cases that had been identified by clinical tests. Results: the sensitivity and accuracy of the two clustering methods were 98% and 96%, respectively. Their specificity was comparable. Conclusion: by using appropriate clustering algorithm based on the measurement scale of the variables in the study, high performance is granted. Moreover, descriptive and inferential statistics in addition to modeling approach must be selected based on the scale of the variables. PMID:24672565

  19. Developing Appropriate Methods for Cost-Effectiveness Analysis of Cluster Randomized Trials

    PubMed Central

    Gomes, Manuel; Ng, Edmond S.-W.; Nixon, Richard; Carpenter, James; Thompson, Simon G.

    2012-01-01

    Aim. Cost-effectiveness analyses (CEAs) may use data from cluster randomized trials (CRTs), where the unit of randomization is the cluster, not the individual. However, most studies use analytical methods that ignore clustering. This article compares alternative statistical methods for accommodating clustering in CEAs of CRTs. Methods. Our simulation study compared the performance of statistical methods for CEAs of CRTs with 2 treatment arms. The study considered a method that ignored clustering—seemingly unrelated regression (SUR) without a robust standard error (SE)—and 4 methods that recognized clustering—SUR and generalized estimating equations (GEEs), both with robust SE, a “2-stage” nonparametric bootstrap (TSB) with shrinkage correction, and a multilevel model (MLM). The base case assumed CRTs with moderate numbers of balanced clusters (20 per arm) and normally distributed costs. Other scenarios included CRTs with few clusters, imbalanced cluster sizes, and skewed costs. Performance was reported as bias, root mean squared error (rMSE), and confidence interval (CI) coverage for estimating incremental net benefits (INBs). We also compared the methods in a case study. Results. Each method reported low levels of bias. Without the robust SE, SUR gave poor CI coverage (base case: 0.89 v. nominal level: 0.95). The MLM and TSB performed well in each scenario (CI coverage, 0.92–0.95). With few clusters, the GEE and SUR (with robust SE) had coverage below 0.90. In the case study, the mean INBs were similar across all methods, but ignoring clustering underestimated statistical uncertainty and the value of further research. Conclusions. MLMs and the TSB are appropriate analytical methods for CEAs of CRTs with the characteristics described. SUR and GEE are not recommended for studies with few clusters. PMID:22016450

  20. The Effect of Cluster Sampling Design in Survey Research on the Standard Error Statistic.

    ERIC Educational Resources Information Center

    Wang, Lin; Fan, Xitao

    Standard statistical methods are used to analyze data that is assumed to be collected using a simple random sampling scheme. These methods, however, tend to underestimate variance when the data is collected with a cluster design, which is often found in educational survey research. The purposes of this paper are to demonstrate how a cluster design…

  1. Cluster-level statistical inference in fMRI datasets: The unexpected behavior of random fields in high dimensions.

    PubMed

    Bansal, Ravi; Peterson, Bradley S

    2018-06-01

    Identifying regional effects of interest in MRI datasets usually entails testing a priori hypotheses across many thousands of brain voxels, requiring control for false positive findings in these multiple hypotheses testing. Recent studies have suggested that parametric statistical methods may have incorrectly modeled functional MRI data, thereby leading to higher false positive rates than their nominal rates. Nonparametric methods for statistical inference when conducting multiple statistical tests, in contrast, are thought to produce false positives at the nominal rate, which has thus led to the suggestion that previously reported studies should reanalyze their fMRI data using nonparametric tools. To understand better why parametric methods may yield excessive false positives, we assessed their performance when applied both to simulated datasets of 1D, 2D, and 3D Gaussian Random Fields (GRFs) and to 710 real-world, resting-state fMRI datasets. We showed that both the simulated 2D and 3D GRFs and the real-world data contain a small percentage (<6%) of very large clusters (on average 60 times larger than the average cluster size), which were not present in 1D GRFs. These unexpectedly large clusters were deemed statistically significant using parametric methods, leading to empirical familywise error rates (FWERs) as high as 65%: the high empirical FWERs were not a consequence of parametric methods failing to model spatial smoothness accurately, but rather of these very large clusters that are inherently present in smooth, high-dimensional random fields. In fact, when discounting these very large clusters, the empirical FWER for parametric methods was 3.24%. Furthermore, even an empirical FWER of 65% would yield on average less than one of those very large clusters in each brain-wide analysis. Nonparametric methods, in contrast, estimated distributions from those large clusters, and therefore, by construct rejected the large clusters as false positives at the nominal FWERs. Those rejected clusters were outlying values in the distribution of cluster size but cannot be distinguished from true positive findings without further analyses, including assessing whether fMRI signal in those regions correlates with other clinical, behavioral, or cognitive measures. Rejecting the large clusters, however, significantly reduced the statistical power of nonparametric methods in detecting true findings compared with parametric methods, which would have detected most true findings that are essential for making valid biological inferences in MRI data. Parametric analyses, in contrast, detected most true findings while generating relatively few false positives: on average, less than one of those very large clusters would be deemed a true finding in each brain-wide analysis. We therefore recommend the continued use of parametric methods that model nonstationary smoothness for cluster-level, familywise control of false positives, particularly when using a Cluster Defining Threshold of 2.5 or higher, and subsequently assessing rigorously the biological plausibility of the findings, even for large clusters. Finally, because nonparametric methods yielded a large reduction in statistical power to detect true positive findings, we conclude that the modest reduction in false positive findings that nonparametric analyses afford does not warrant a re-analysis of previously published fMRI studies using nonparametric techniques. Copyright © 2018 Elsevier Inc. All rights reserved.

  2. Application of Scan Statistics to Detect Suicide Clusters in Australia

    PubMed Central

    Cheung, Yee Tak Derek; Spittal, Matthew J.; Williamson, Michelle Kate; Tung, Sui Jay; Pirkis, Jane

    2013-01-01

    Background Suicide clustering occurs when multiple suicide incidents take place in a small area or/and within a short period of time. In spite of the multi-national research attention and particular efforts in preparing guidelines for tackling suicide clusters, the broader picture of epidemiology of suicide clustering remains unclear. This study aimed to develop techniques in using scan statistics to detect clusters, with the detection of suicide clusters in Australia as example. Methods and Findings Scan statistics was applied to detect clusters among suicides occurring between 2004 and 2008. Manipulation of parameter settings and change of area for scan statistics were performed to remedy shortcomings in existing methods. In total, 243 suicides out of 10,176 (2.4%) were identified as belonging to 15 suicide clusters. These clusters were mainly located in the Northern Territory, the northern part of Western Australia, and the northern part of Queensland. Among the 15 clusters, 4 (26.7%) were detected by both national and state cluster detections, 8 (53.3%) were only detected by the state cluster detection, and 3 (20%) were only detected by the national cluster detection. Conclusions These findings illustrate that the majority of spatial-temporal clusters of suicide were located in the inland northern areas, with socio-economic deprivation and higher proportions of indigenous people. Discrepancies between national and state/territory cluster detection by scan statistics were due to the contrast of the underlying suicide rates across states/territories. Performing both small-area and large-area analyses, and applying multiple parameter settings may yield the maximum benefits for exploring clusters. PMID:23342098

  3. Evaluation of the Gini Coefficient in Spatial Scan Statistics for Detecting Irregularly Shaped Clusters

    PubMed Central

    Kim, Jiyu; Jung, Inkyung

    2017-01-01

    Spatial scan statistics with circular or elliptic scanning windows are commonly used for cluster detection in various applications, such as the identification of geographical disease clusters from epidemiological data. It has been pointed out that the method may have difficulty in correctly identifying non-compact, arbitrarily shaped clusters. In this paper, we evaluated the Gini coefficient for detecting irregularly shaped clusters through a simulation study. The Gini coefficient, the use of which in spatial scan statistics was recently proposed, is a criterion measure for optimizing the maximum reported cluster size. Our simulation study results showed that using the Gini coefficient works better than the original spatial scan statistic for identifying irregularly shaped clusters, by reporting an optimized and refined collection of clusters rather than a single larger cluster. We have provided a real data example that seems to support the simulation results. We think that using the Gini coefficient in spatial scan statistics can be helpful for the detection of irregularly shaped clusters. PMID:28129368

  4. Managing Clustered Data Using Hierarchical Linear Modeling

    ERIC Educational Resources Information Center

    Warne, Russell T.; Li, Yan; McKyer, E. Lisako J.; Condie, Rachel; Diep, Cassandra S.; Murano, Peter S.

    2012-01-01

    Researchers in nutrition research often use cluster or multistage sampling to gather participants for their studies. These sampling methods often produce violations of the assumption of data independence that most traditional statistics share. Hierarchical linear modeling is a statistical method that can overcome violations of the independence…

  5. Semisupervised Clustering by Iterative Partition and Regression with Neuroscience Applications

    PubMed Central

    Qian, Guoqi; Wu, Yuehua; Ferrari, Davide; Qiao, Puxue; Hollande, Frédéric

    2016-01-01

    Regression clustering is a mixture of unsupervised and supervised statistical learning and data mining method which is found in a wide range of applications including artificial intelligence and neuroscience. It performs unsupervised learning when it clusters the data according to their respective unobserved regression hyperplanes. The method also performs supervised learning when it fits regression hyperplanes to the corresponding data clusters. Applying regression clustering in practice requires means of determining the underlying number of clusters in the data, finding the cluster label of each data point, and estimating the regression coefficients of the model. In this paper, we review the estimation and selection issues in regression clustering with regard to the least squares and robust statistical methods. We also provide a model selection based technique to determine the number of regression clusters underlying the data. We further develop a computing procedure for regression clustering estimation and selection. Finally, simulation studies are presented for assessing the procedure, together with analyzing a real data set on RGB cell marking in neuroscience to illustrate and interpret the method. PMID:27212939

  6. Local multiplicity adjustment for the spatial scan statistic using the Gumbel distribution.

    PubMed

    Gangnon, Ronald E

    2012-03-01

    The spatial scan statistic is an important and widely used tool for cluster detection. It is based on the simultaneous evaluation of the statistical significance of the maximum likelihood ratio test statistic over a large collection of potential clusters. In most cluster detection problems, there is variation in the extent of local multiplicity across the study region. For example, using a fixed maximum geographic radius for clusters, urban areas typically have many overlapping potential clusters, whereas rural areas have relatively few. The spatial scan statistic does not account for local multiplicity variation. We describe a previously proposed local multiplicity adjustment based on a nested Bonferroni correction and propose a novel adjustment based on a Gumbel distribution approximation to the distribution of a local scan statistic. We compare the performance of all three statistics in terms of power and a novel unbiased cluster detection criterion. These methods are then applied to the well-known New York leukemia dataset and a Wisconsin breast cancer incidence dataset. © 2011, The International Biometric Society.

  7. Local multiplicity adjustment for the spatial scan statistic using the Gumbel distribution

    PubMed Central

    Gangnon, Ronald E.

    2011-01-01

    Summary The spatial scan statistic is an important and widely used tool for cluster detection. It is based on the simultaneous evaluation of the statistical significance of the maximum likelihood ratio test statistic over a large collection of potential clusters. In most cluster detection problems, there is variation in the extent of local multiplicity across the study region. For example, using a fixed maximum geographic radius for clusters, urban areas typically have many overlapping potential clusters, while rural areas have relatively few. The spatial scan statistic does not account for local multiplicity variation. We describe a previously proposed local multiplicity adjustment based on a nested Bonferroni correction and propose a novel adjustment based on a Gumbel distribution approximation to the distribution of a local scan statistic. We compare the performance of all three statistics in terms of power and a novel unbiased cluster detection criterion. These methods are then applied to the well-known New York leukemia dataset and a Wisconsin breast cancer incidence dataset. PMID:21762118

  8. Cluster detection methods applied to the Upper Cape Cod cancer data.

    PubMed

    Ozonoff, Al; Webster, Thomas; Vieira, Veronica; Weinberg, Janice; Ozonoff, David; Aschengrau, Ann

    2005-09-15

    A variety of statistical methods have been suggested to assess the degree and/or the location of spatial clustering of disease cases. However, there is relatively little in the literature devoted to comparison and critique of different methods. Most of the available comparative studies rely on simulated data rather than real data sets. We have chosen three methods currently used for examining spatial disease patterns: the M-statistic of Bonetti and Pagano; the Generalized Additive Model (GAM) method as applied by Webster; and Kulldorff's spatial scan statistic. We apply these statistics to analyze breast cancer data from the Upper Cape Cancer Incidence Study using three different latency assumptions. The three different latency assumptions produced three different spatial patterns of cases and controls. For 20 year latency, all three methods generally concur. However, for 15 year latency and no latency assumptions, the methods produce different results when testing for global clustering. The comparative analyses of real data sets by different statistical methods provides insight into directions for further research. We suggest a research program designed around examining real data sets to guide focused investigation of relevant features using simulated data, for the purpose of understanding how to interpret statistical methods applied to epidemiological data with a spatial component.

  9. Spatial temporal clustering for hotspot using kulldorff scan statistic method (KSS): A case in Riau Province

    NASA Astrophysics Data System (ADS)

    Hudjimartsu, S. A.; Djatna, T.; Ambarwari, A.; Apriliantono

    2017-01-01

    The forest fires in Indonesia occurs frequently in the dry season. Almost all the causes of forest fires are caused by the human activity itself. The impact of forest fires is the loss of biodiversity, pollution hazard and harm the economy of surrounding communities. To prevent fires required the method, one of them with spatial temporal clustering. Spatial temporal clustering formed grouping data so that the results of these groupings can be used as initial information on fire prevention. To analyze the fires, used hotspot data as early indicator of fire spot. Hotspot data consists of spatial and temporal dimensions can be processed using the Spatial Temporal Clustering with Kulldorff Scan Statistic (KSS). The result of this research is to the effectiveness of KSS method to cluster spatial hotspot in a case within Riau Province and produces two types of clusters, most cluster and secondary cluster. This cluster can be used as an early fire warning information.

  10. Statistical analysis of activation and reaction energies with quasi-variational coupled-cluster theory

    NASA Astrophysics Data System (ADS)

    Black, Joshua A.; Knowles, Peter J.

    2018-06-01

    The performance of quasi-variational coupled-cluster (QV) theory applied to the calculation of activation and reaction energies has been investigated. A statistical analysis of results obtained for six different sets of reactions has been carried out, and the results have been compared to those from standard single-reference methods. In general, the QV methods lead to increased activation energies and larger absolute reaction energies compared to those obtained with traditional coupled-cluster theory.

  11. A flexible spatial scan statistic with a restricted likelihood ratio for detecting disease clusters.

    PubMed

    Tango, Toshiro; Takahashi, Kunihiko

    2012-12-30

    Spatial scan statistics are widely used tools for detection of disease clusters. Especially, the circular spatial scan statistic proposed by Kulldorff (1997) has been utilized in a wide variety of epidemiological studies and disease surveillance. However, as it cannot detect noncircular, irregularly shaped clusters, many authors have proposed different spatial scan statistics, including the elliptic version of Kulldorff's scan statistic. The flexible spatial scan statistic proposed by Tango and Takahashi (2005) has also been used for detecting irregularly shaped clusters. However, this method sets a feasible limitation of a maximum of 30 nearest neighbors for searching candidate clusters because of heavy computational load. In this paper, we show a flexible spatial scan statistic implemented with a restricted likelihood ratio proposed by Tango (2008) to (1) eliminate the limitation of 30 nearest neighbors and (2) to have surprisingly much less computational time than the original flexible spatial scan statistic. As a side effect, it is shown to be able to detect clusters with any shape reasonably well as the relative risk of the cluster becomes large via Monte Carlo simulation. We illustrate the proposed spatial scan statistic with data on mortality from cerebrovascular disease in the Tokyo Metropolitan area, Japan. Copyright © 2012 John Wiley & Sons, Ltd.

  12. On the blind use of statistical tools in the analysis of globular cluster stars

    NASA Astrophysics Data System (ADS)

    D'Antona, Francesca; Caloi, Vittoria; Tailo, Marco

    2018-04-01

    As with most data analysis methods, the Bayesian method must be handled with care. We show that its application to determine stellar evolution parameters within globular clusters can lead to paradoxical results if used without the necessary precautions. This is a cautionary tale on the use of statistical tools for big data analysis.

  13. The statistical average of optical properties for alumina particle cluster in aircraft plume

    NASA Astrophysics Data System (ADS)

    Li, Jingying; Bai, Lu; Wu, Zhensen; Guo, Lixin

    2018-04-01

    We establish a model for lognormal distribution of monomer radius and number of alumina particle clusters in plume. According to the Multi-Sphere T Matrix (MSTM) theory, we provide a method for finding the statistical average of optical properties for alumina particle clusters in plume, analyze the effect of different distributions and different detection wavelengths on the statistical average of optical properties for alumina particle cluster, and compare the statistical average optical properties under the alumina particle cluster model established in this study and those under three simplified alumina particle models. The calculation results show that the monomer number of alumina particle cluster and its size distribution have a considerable effect on its statistical average optical properties. The statistical average of optical properties for alumina particle cluster at common detection wavelengths exhibit obvious differences, whose differences have a great effect on modeling IR and UV radiation properties of plume. Compared with the three simplified models, the alumina particle cluster model herein features both higher extinction and scattering efficiencies. Therefore, we may find that an accurate description of the scattering properties of alumina particles in aircraft plume is of great significance in the study of plume radiation properties.

  14. Generalising Ward's Method for Use with Manhattan Distances.

    PubMed

    Strauss, Trudie; von Maltitz, Michael Johan

    2017-01-01

    The claim that Ward's linkage algorithm in hierarchical clustering is limited to use with Euclidean distances is investigated. In this paper, Ward's clustering algorithm is generalised to use with l1 norm or Manhattan distances. We argue that the generalisation of Ward's linkage method to incorporate Manhattan distances is theoretically sound and provide an example of where this method outperforms the method using Euclidean distances. As an application, we perform statistical analyses on languages using methods normally applied to biology and genetic classification. We aim to quantify differences in character traits between languages and use a statistical language signature based on relative bi-gram (sequence of two letters) frequencies to calculate a distance matrix between 32 Indo-European languages. We then use Ward's method of hierarchical clustering to classify the languages, using the Euclidean distance and the Manhattan distance. Results obtained from using the different distance metrics are compared to show that the Ward's algorithm characteristic of minimising intra-cluster variation and maximising inter-cluster variation is not violated when using the Manhattan metric.

  15. Dynamic evolution of nearby galaxy clusters

    NASA Astrophysics Data System (ADS)

    Biernacka, M.; Flin, P.

    2011-06-01

    A study of the evolution of 377 rich ACO clusters with redshift z<0.2 is presented. The data concerning galaxies in the investigated clusters were obtained using FOCAS packages applied to Digital Sky Survey I. The 377 galaxy clusters constitute a statistically uniform sample to which visual galaxy/star reclassifications were applied. Cluster shape within 2.0 h-1 Mpc from the adopted cluster centre (the mean and the median of all galaxy coordinates, the position of the brightest and of the third brightest galaxy in the cluster) was determined through its ellipticity calculated using two methods: the covariance ellipse method (hereafter CEM) and the method based on Minkowski functionals (hereafter MFM). We investigated ellipticity dependence on the radius of circular annuli, in which ellipticity was calculated. This was realized by varying the radius from 0.5 to 2 Mpc in steps of 0.25 Mpc. By performing Monte Carlo simulations, we generated clusters to which the two ellipticity methods were applied. We found that the covariance ellipse method works better than the method based on Minkowski functionals. We also found that ellipticity distributions are different for different methods used. Using the ellipticity-redshift relation, we investigated the possibility of cluster evolution in the low-redshift Universe. The correlation of cluster ellipticities with redshifts is undoubtly an indicator of structural evolution. Using the t-Student statistics, we found a statistically significant correlation between ellipticity and redshift at the significance level of α = 0.95. In one of the two shape determination methods we found that ellipticity grew with redshift, while the other method gave opposite results. Monte Carlo simulations showed that only ellipticities calculated at the distance of 1.5 Mpc from cluster centre in the Minkowski functional method are robust enough to be taken into account, but for that radius we did not find any relation between e and z. Since CEM pointed towards the existence of the e(z) relation, we conclude that such an effect is real though rather weak. A detailed study of the e(z) relation showed that the observed relation is nonlinear, and the number of elongated structures grows rapidly for z>0.14.

  16. A spatial scan statistic for compound Poisson data.

    PubMed

    Rosychuk, Rhonda J; Chang, Hsing-Ming

    2013-12-20

    The topic of spatial cluster detection gained attention in statistics during the late 1980s and early 1990s. Effort has been devoted to the development of methods for detecting spatial clustering of cases and events in the biological sciences, astronomy and epidemiology. More recently, research has examined detecting clusters of correlated count data associated with health conditions of individuals. Such a method allows researchers to examine spatial relationships of disease-related events rather than just incident or prevalent cases. We introduce a spatial scan test that identifies clusters of events in a study region. Because an individual case may have multiple (repeated) events, we base the test on a compound Poisson model. We illustrate our method for cluster detection on emergency department visits, where individuals may make multiple disease-related visits. Copyright © 2013 John Wiley & Sons, Ltd.

  17. Kappa statistic for clustered matched-pair data.

    PubMed

    Yang, Zhao; Zhou, Ming

    2014-07-10

    Kappa statistic is widely used to assess the agreement between two procedures in the independent matched-pair data. For matched-pair data collected in clusters, on the basis of the delta method and sampling techniques, we propose a nonparametric variance estimator for the kappa statistic without within-cluster correlation structure or distributional assumptions. The results of an extensive Monte Carlo simulation study demonstrate that the proposed kappa statistic provides consistent estimation and the proposed variance estimator behaves reasonably well for at least a moderately large number of clusters (e.g., K ≥50). Compared with the variance estimator ignoring dependence within a cluster, the proposed variance estimator performs better in maintaining the nominal coverage probability when the intra-cluster correlation is fair (ρ ≥0.3), with more pronounced improvement when ρ is further increased. To illustrate the practical application of the proposed estimator, we analyze two real data examples of clustered matched-pair data. Copyright © 2014 John Wiley & Sons, Ltd.

  18. Spatio-Temporal Clustering of Monitoring Network

    NASA Astrophysics Data System (ADS)

    Hussain, I.; Pilz, J.

    2009-04-01

    Pakistan has much diversity in seasonal variation of different locations. Some areas are in desserts and remain very hot and waterless, for example coastal areas are situated along the Arabian Sea and have very warm season and a little rainfall. Some areas are covered with mountains, have very low temperature and heavy rainfall; for instance Karakoram ranges. The most important variables that have an impact on the climate are temperature, precipitation, humidity, wind speed and elevation. Furthermore, it is hard to find homogeneous regions in Pakistan with respect to climate variation. Identification of homogeneous regions in Pakistan can be useful in many aspects. It can be helpful for prediction of the climate in the sub-regions and for optimizing the number of monitoring sites. In the earlier literature no one tried to identify homogeneous regions of Pakistan with respect to climate variation. There are only a few papers about spatio-temporal clustering of monitoring network. Steinhaus (1956) presented the well-known K-means clustering method. It can identify a predefined number of clusters by iteratively assigning centriods to clusters based. Castro et al. (1997) developed a genetic heuristic algorithm to solve medoids based clustering. Their method is based on genetic recombination upon random assorting recombination. The suggested method is appropriate for clustering the attributes which have genetic characteristics. Sap and Awan (2005) presented a robust weighted kernel K-means algorithm incorporating spatial constraints for clustering climate data. The proposed algorithm can effectively handle noise, outliers and auto-correlation in the spatial data, for effective and efficient data analysis by exploring patterns and structures in the data. Soltani and Modarres (2006) used hierarchical and divisive cluster analysis to categorize patterns of rainfall in Iran. They only considered rainfall at twenty-eight monitoring sites and concluded that eight clusters existed. Soltani and Modarres (2006) classified the sites by using only average rainfall of sites, they did not consider time replications and spatial coordinates. Kerby et.al (2007) purposed spatial clustering method based on likelihood. They took account of the geographic locations through the variance covariance matrix. Their purposed method works like hierarchical clustering methods. Moreovere, it is inappropiriate for time replication data and could not perform well for large number of sites. Tuia.et.al (2008) used scan statistics for identifying spatio-temporal clusters for fire sequences in the Tuscany region in Italy. The scan statistics clustering method was developed by Kulldorff et al. (1997) to detect spatio-temporal clusters in epidemiology and assessing their significance. The purposed scan statistics method is used only for univariate discrete stochastic random variables. In this paper we make use of a very simple approach for spatio-temporal clustering which can create separable and homogeneous clusters. Most of the clustering methods are based on Euclidean distances. It is well known that geographic coordinates are spherical coordinates and estimating Euclidean distances from spherical coordinates is inappropriate. As a transformation from geographic coordinates to rectangular (D-plane) coordinates we use the Lambert projection method. The partition around medoids clustering method is incorporated on the data including D-plane coordinates. Ordinary kriging is taken as validity measure for the precipitation data. The kriging results for clusters are more accurate and have less variation compared to complete monitoring network precipitation data. References Casto.V.E and Murray.A.T (1997). Spatial Clustering with Data Mining with Genetic Algorithms. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.8573 Kaufman.L and Rousseeuw.P.J (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley series of Probability and Mathematical Statistics, New York. Kulldorf.M (1997). A spatial scan statistic. Commun. Stat.-Theor. Math. 26(6), 1481-1496 Kerby. A , Marx. D, Samal. A and Adamchuck. V. (2007). Spatial Clustering Using the Likelihood Function. Seventh IEEE International Conference on Data Mining - Workshops Steinhaus.H (1956). Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci., C1. III vol IV:801- 804 Snyder, J. P. (1987). Map Projection: A Working Manual. U. S. Geological Survey Professional Paper 1395. Washington, DC: U. S. Government Printing Office, pp. 104-110 Sap.M.N and Awan. A.M (2005). Finding Spatio-Temporal Patterns in Climate Data Using Clustering. Proceedings of the International Conference on Cyberworlds (CW'05) Soltani.S and Modarres.R (2006). Classification of Spatio -Temporal Pattern of Rainfall in Iran: Using Hierarchical and Divisive Cluster Analysis. Journal of Spatial Hydrology Vol.6, No.2 Tuia.D, Ratle.F, Lasaponara.R, Telesca.L and Kanevski.M (2008). Scan Statistics Analysis for Forest Fire Clusters. Commun. in Nonlinear science and numerical simulation 13,1689-1694.

  19. A statistical method (cross-validation) for bone loss region detection after spaceflight

    PubMed Central

    Zhao, Qian; Li, Wenjun; Li, Caixia; Chu, Philip W.; Kornak, John; Lang, Thomas F.

    2010-01-01

    Astronauts experience bone loss after the long spaceflight missions. Identifying specific regions that undergo the greatest losses (e.g. the proximal femur) could reveal information about the processes of bone loss in disuse and disease. Methods for detecting such regions, however, remains an open problem. This paper focuses on statistical methods to detect such regions. We perform statistical parametric mapping to get t-maps of changes in images, and propose a new cross-validation method to select an optimum suprathreshold for forming clusters of pixels. Once these candidate clusters are formed, we use permutation testing of longitudinal labels to derive significant changes. PMID:20632144

  20. WordCluster: detecting clusters of DNA words and genomic elements

    PubMed Central

    2011-01-01

    Background Many k-mers (or DNA words) and genomic elements are known to be spatially clustered in the genome. Well established examples are the genes, TFBSs, CpG dinucleotides, microRNA genes and ultra-conserved non-coding regions. Currently, no algorithm exists to find these clusters in a statistically comprehensible way. The detection of clustering often relies on densities and sliding-window approaches or arbitrarily chosen distance thresholds. Results We introduce here an algorithm to detect clusters of DNA words (k-mers), or any other genomic element, based on the distance between consecutive copies and an assigned statistical significance. We implemented the method into a web server connected to a MySQL backend, which also determines the co-localization with gene annotations. We demonstrate the usefulness of this approach by detecting the clusters of CAG/CTG (cytosine contexts that can be methylated in undifferentiated cells), showing that the degree of methylation vary drastically between inside and outside of the clusters. As another example, we used WordCluster to search for statistically significant clusters of olfactory receptor (OR) genes in the human genome. Conclusions WordCluster seems to predict biological meaningful clusters of DNA words (k-mers) and genomic entities. The implementation of the method into a web server is available at http://bioinfo2.ugr.es/wordCluster/wordCluster.php including additional features like the detection of co-localization with gene regions or the annotation enrichment tool for functional analysis of overlapped genes. PMID:21261981

  1. Evaluating and implementing temporal, spatial, and spatio-temporal methods for outbreak detection in a local syndromic surveillance system.

    PubMed

    Mathes, Robert W; Lall, Ramona; Levin-Rector, Alison; Sell, Jessica; Paladini, Marc; Konty, Kevin J; Olson, Don; Weiss, Don

    2017-01-01

    The New York City Department of Health and Mental Hygiene has operated an emergency department syndromic surveillance system since 2001, using temporal and spatial scan statistics run on a daily basis for cluster detection. Since the system was originally implemented, a number of new methods have been proposed for use in cluster detection. We evaluated six temporal and four spatial/spatio-temporal detection methods using syndromic surveillance data spiked with simulated injections. The algorithms were compared on several metrics, including sensitivity, specificity, positive predictive value, coherence, and timeliness. We also evaluated each method's implementation, programming time, run time, and the ease of use. Among the temporal methods, at a set specificity of 95%, a Holt-Winters exponential smoother performed the best, detecting 19% of the simulated injects across all shapes and sizes, followed by an autoregressive moving average model (16%), a generalized linear model (15%), a modified version of the Early Aberration Reporting System's C2 algorithm (13%), a temporal scan statistic (11%), and a cumulative sum control chart (<2%). Of the spatial/spatio-temporal methods we tested, a spatial scan statistic detected 3% of all injects, a Bayes regression found 2%, and a generalized linear mixed model and a space-time permutation scan statistic detected none at a specificity of 95%. Positive predictive value was low (<7%) for all methods. Overall, the detection methods we tested did not perform well in identifying the temporal and spatial clusters of cases in the inject dataset. The spatial scan statistic, our current method for spatial cluster detection, performed slightly better than the other tested methods across different inject magnitudes and types. Furthermore, we found the scan statistics, as applied in the SaTScan software package, to be the easiest to program and implement for daily data analysis.

  2. Evaluating and implementing temporal, spatial, and spatio-temporal methods for outbreak detection in a local syndromic surveillance system

    PubMed Central

    Lall, Ramona; Levin-Rector, Alison; Sell, Jessica; Paladini, Marc; Konty, Kevin J.; Olson, Don; Weiss, Don

    2017-01-01

    The New York City Department of Health and Mental Hygiene has operated an emergency department syndromic surveillance system since 2001, using temporal and spatial scan statistics run on a daily basis for cluster detection. Since the system was originally implemented, a number of new methods have been proposed for use in cluster detection. We evaluated six temporal and four spatial/spatio-temporal detection methods using syndromic surveillance data spiked with simulated injections. The algorithms were compared on several metrics, including sensitivity, specificity, positive predictive value, coherence, and timeliness. We also evaluated each method’s implementation, programming time, run time, and the ease of use. Among the temporal methods, at a set specificity of 95%, a Holt-Winters exponential smoother performed the best, detecting 19% of the simulated injects across all shapes and sizes, followed by an autoregressive moving average model (16%), a generalized linear model (15%), a modified version of the Early Aberration Reporting System’s C2 algorithm (13%), a temporal scan statistic (11%), and a cumulative sum control chart (<2%). Of the spatial/spatio-temporal methods we tested, a spatial scan statistic detected 3% of all injects, a Bayes regression found 2%, and a generalized linear mixed model and a space-time permutation scan statistic detected none at a specificity of 95%. Positive predictive value was low (<7%) for all methods. Overall, the detection methods we tested did not perform well in identifying the temporal and spatial clusters of cases in the inject dataset. The spatial scan statistic, our current method for spatial cluster detection, performed slightly better than the other tested methods across different inject magnitudes and types. Furthermore, we found the scan statistics, as applied in the SaTScan software package, to be the easiest to program and implement for daily data analysis. PMID:28886112

  3. The smart cluster method. Adaptive earthquake cluster identification and analysis in strong seismic regions

    NASA Astrophysics Data System (ADS)

    Schaefer, Andreas M.; Daniell, James E.; Wenzel, Friedemann

    2017-07-01

    Earthquake clustering is an essential part of almost any statistical analysis of spatial and temporal properties of seismic activity. The nature of earthquake clusters and subsequent declustering of earthquake catalogues plays a crucial role in determining the magnitude-dependent earthquake return period and its respective spatial variation for probabilistic seismic hazard assessment. This study introduces the Smart Cluster Method (SCM), a new methodology to identify earthquake clusters, which uses an adaptive point process for spatio-temporal cluster identification. It utilises the magnitude-dependent spatio-temporal earthquake density to adjust the search properties, subsequently analyses the identified clusters to determine directional variation and adjusts its search space with respect to directional properties. In the case of rapid subsequent ruptures like the 1992 Landers sequence or the 2010-2011 Darfield-Christchurch sequence, a reclassification procedure is applied to disassemble subsequent ruptures using near-field searches, nearest neighbour classification and temporal splitting. The method is capable of identifying and classifying earthquake clusters in space and time. It has been tested and validated using earthquake data from California and New Zealand. A total of more than 1500 clusters have been found in both regions since 1980 with M m i n = 2.0. Utilising the knowledge of cluster classification, the method has been adjusted to provide an earthquake declustering algorithm, which has been compared to existing methods. Its performance is comparable to established methodologies. The analysis of earthquake clustering statistics lead to various new and updated correlation functions, e.g. for ratios between mainshock and strongest aftershock and general aftershock activity metrics.

  4. Ion induced electron emission statistics under Agm- cluster bombardment of Ag

    NASA Astrophysics Data System (ADS)

    Breuers, A.; Penning, R.; Wucher, A.

    2018-05-01

    The electron emission from a polycrystalline silver surface under bombardment with Agm- cluster ions (m = 1, 2, 3) is investigated in terms of ion induced kinetic excitation. The electron yield γ is determined directly by a current measurement method on the one hand and implicitly by the analysis of the electron emission statistics on the other hand. Successful measurements of the electron emission spectra ensure a deeper understanding of the ion induced kinetic electron emission process, with particular emphasis on the effect of the projectile cluster size to the yield as well as to emission statistics. The results allow a quantitative comparison to computer simulations performed for silver atoms and clusters impinging onto a silver surface.

  5. Dissociation kinetics of metal clusters on multiple electronic states including electronic level statistics into the vibronic soup

    NASA Astrophysics Data System (ADS)

    Shvartsburg, Alexandre A.; Siu, K. W. Michael

    2001-06-01

    Modeling the delayed dissociation of clusters had been over the last decade a frontline development area in chemical physics. It is of fundamental interest how statistical kinetics methods previously validated for regular molecules and atomic nuclei may apply to clusters, as this would help to understand the transferability of statistical models for disintegration of complex systems across various classes of physical objects. From a practical perspective, accurate simulation of unimolecular decomposition is critical for the extraction of true thermochemical values from measurements on the decay of energized clusters. Metal clusters are particularly challenging because of the multitude of low-lying electronic states that are coupled to vibrations. This has previously been accounted for assuming the average electronic structure of a conducting cluster approximated by the levels of electron in a cavity. While this provides a reasonable time-averaged description, it ignores the distribution of instantaneous electronic structures in a "boiling" cluster around that average. Here we set up a new treatment that incorporates the statistical distribution of electronic levels around the average picture using random matrix theory. This approach faithfully reflects the completely chaotic "vibronic soup" nature of hot metal clusters. We found that the consideration of electronic level statistics significantly promotes electronic excitation and thus increases the magnitude of its effect. As this excitation always depresses the decay rates, the inclusion of level statistics results in slower dissociation of metal clusters.

  6. Visualizing statistical significance of disease clusters using cartograms.

    PubMed

    Kronenfeld, Barry J; Wong, David W S

    2017-05-15

    Health officials and epidemiological researchers often use maps of disease rates to identify potential disease clusters. Because these maps exaggerate the prominence of low-density districts and hide potential clusters in urban (high-density) areas, many researchers have used density-equalizing maps (cartograms) as a basis for epidemiological mapping. However, we do not have existing guidelines for visual assessment of statistical uncertainty. To address this shortcoming, we develop techniques for visual determination of statistical significance of clusters spanning one or more districts on a cartogram. We developed the techniques within a geovisual analytics framework that does not rely on automated significance testing, and can therefore facilitate visual analysis to detect clusters that automated techniques might miss. On a cartogram of the at-risk population, the statistical significance of a disease cluster is determinate from the rate, area and shape of the cluster under standard hypothesis testing scenarios. We develop formulae to determine, for a given rate, the area required for statistical significance of a priori and a posteriori designated regions under certain test assumptions. Uniquely, our approach enables dynamic inference of aggregate regions formed by combining individual districts. The method is implemented in interactive tools that provide choropleth mapping, automated legend construction and dynamic search tools to facilitate cluster detection and assessment of the validity of tested assumptions. A case study of leukemia incidence analysis in California demonstrates the ability to visually distinguish between statistically significant and insignificant regions. The proposed geovisual analytics approach enables intuitive visual assessment of statistical significance of arbitrarily defined regions on a cartogram. Our research prompts a broader discussion of the role of geovisual exploratory analyses in disease mapping and the appropriate framework for visually assessing the statistical significance of spatial clusters.

  7. Comparison of a non-stationary voxelation-corrected cluster-size test with TFCE for group-Level MRI inference.

    PubMed

    Li, Huanjie; Nickerson, Lisa D; Nichols, Thomas E; Gao, Jia-Hong

    2017-03-01

    Two powerful methods for statistical inference on MRI brain images have been proposed recently, a non-stationary voxelation-corrected cluster-size test (CST) based on random field theory and threshold-free cluster enhancement (TFCE) based on calculating the level of local support for a cluster, then using permutation testing for inference. Unlike other statistical approaches, these two methods do not rest on the assumptions of a uniform and high degree of spatial smoothness of the statistic image. Thus, they are strongly recommended for group-level fMRI analysis compared to other statistical methods. In this work, the non-stationary voxelation-corrected CST and TFCE methods for group-level analysis were evaluated for both stationary and non-stationary images under varying smoothness levels, degrees of freedom and signal to noise ratios. Our results suggest that, both methods provide adequate control for the number of voxel-wise statistical tests being performed during inference on fMRI data and they are both superior to current CSTs implemented in popular MRI data analysis software packages. However, TFCE is more sensitive and stable for group-level analysis of VBM data. Thus, the voxelation-corrected CST approach may confer some advantages by being computationally less demanding for fMRI data analysis than TFCE with permutation testing and by also being applicable for single-subject fMRI analyses, while the TFCE approach is advantageous for VBM data. Hum Brain Mapp 38:1269-1280, 2017. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.

  8. Magnification Bias in Gravitational Arc Statistics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Caminha, G. B.; Estrada, J.; Makler, M.

    2013-08-29

    The statistics of gravitational arcs in galaxy clusters is a powerful probe of cluster structure and may provide complementary cosmological constraints. Despite recent progresses, discrepancies still remain among modelling and observations of arc abundance, specially regarding the redshift distribution of strong lensing clusters. Besides, fast "semi-analytic" methods still have to incorporate the success obtained with simulations. In this paper we discuss the contribution of the magnification in gravitational arc statistics. Although lensing conserves surface brightness, the magnification increases the signal-to-noise ratio of the arcs, enhancing their detectability. We present an approach to include this and other observational effects in semi-analyticmore » calculations for arc statistics. The cross section for arc formation ({\\sigma}) is computed through a semi-analytic method based on the ratio of the eigenvalues of the magnification tensor. Using this approach we obtained the scaling of {\\sigma} with respect to the magnification, and other parameters, allowing for a fast computation of the cross section. We apply this method to evaluate the expected number of arcs per cluster using an elliptical Navarro--Frenk--White matter distribution. Our results show that the magnification has a strong effect on the arc abundance, enhancing the fraction of arcs, moving the peak of the arc fraction to higher redshifts, and softening its decrease at high redshifts. We argue that the effect of magnification should be included in arc statistics modelling and that it could help to reconcile arcs statistics predictions with the observational data.« less

  9. Sampling in health geography: reconciling geographical objectives and probabilistic methods. An example of a health survey in Vientiane (Lao PDR)

    PubMed Central

    Vallée, Julie; Souris, Marc; Fournet, Florence; Bochaton, Audrey; Mobillion, Virginie; Peyronnie, Karine; Salem, Gérard

    2007-01-01

    Background Geographical objectives and probabilistic methods are difficult to reconcile in a unique health survey. Probabilistic methods focus on individuals to provide estimates of a variable's prevalence with a certain precision, while geographical approaches emphasise the selection of specific areas to study interactions between spatial characteristics and health outcomes. A sample selected from a small number of specific areas creates statistical challenges: the observations are not independent at the local level, and this results in poor statistical validity at the global level. Therefore, it is difficult to construct a sample that is appropriate for both geographical and probability methods. Methods We used a two-stage selection procedure with a first non-random stage of selection of clusters. Instead of randomly selecting clusters, we deliberately chose a group of clusters, which as a whole would contain all the variation in health measures in the population. As there was no health information available before the survey, we selected a priori determinants that can influence the spatial homogeneity of the health characteristics. This method yields a distribution of variables in the sample that closely resembles that in the overall population, something that cannot be guaranteed with randomly-selected clusters, especially if the number of selected clusters is small. In this way, we were able to survey specific areas while minimising design effects and maximising statistical precision. Application We applied this strategy in a health survey carried out in Vientiane, Lao People's Democratic Republic. We selected well-known health determinants with unequal spatial distribution within the city: nationality and literacy. We deliberately selected a combination of clusters whose distribution of nationality and literacy is similar to the distribution in the general population. Conclusion This paper describes the conceptual reasoning behind the construction of the survey sample and shows that it can be advantageous to choose clusters using reasoned hypotheses, based on both probability and geographical approaches, in contrast to a conventional, random cluster selection strategy. PMID:17543100

  10. A Comparison of Single Sample and Bootstrap Methods to Assess Mediation in Cluster Randomized Trials

    ERIC Educational Resources Information Center

    Pituch, Keenan A.; Stapleton, Laura M.; Kang, Joo Youn

    2006-01-01

    A Monte Carlo study examined the statistical performance of single sample and bootstrap methods that can be used to test and form confidence interval estimates of indirect effects in two cluster randomized experimental designs. The designs were similar in that they featured random assignment of clusters to one of two treatment conditions and…

  11. Testing for X-Ray–SZ Differences and Redshift Evolution in the X-Ray Morphology of Galaxy Clusters

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Nurgaliev, D.; McDonald, M.; Benson, B. A.

    We present a quantitative study of the X-ray morphology of galaxy clusters, as a function of their detection method and redshift. We analyze two separate samples of galaxy clusters: a sample of 36 clusters atmore » $$0.35\\lt z\\lt 0.9$$ selected in the X-ray with the ROSAT PSPC 400 deg(2) survey, and a sample of 90 clusters at $$0.25\\lt z\\lt 1.2$$ selected via the Sunyaev–Zel’dovich (SZ) effect with the South Pole Telescope. Clusters from both samples have similar-quality Chandra observations, which allow us to quantify their X-ray morphologies via two distinct methods: centroid shifts (w) and photon asymmetry ($${A}_{\\mathrm{phot}}$$). The latter technique provides nearly unbiased morphology estimates for clusters spanning a broad range of redshift and data quality. We further compare the X-ray morphologies of X-ray- and SZ-selected clusters with those of simulated clusters. We do not find a statistically significant difference in the measured X-ray morphology of X-ray and SZ-selected clusters over the redshift range probed by these samples, suggesting that the two are probing similar populations of clusters. We find that the X-ray morphologies of simulated clusters are statistically indistinguishable from those of X-ray- or SZ-selected clusters, implying that the most important physics for dictating the large-scale gas morphology (outside of the core) is well-approximated in these simulations. Finally, we find no statistically significant redshift evolution in the X-ray morphology (both for observed and simulated clusters), over the range of $$z\\sim 0.3$$ to $$z\\sim 1$$, seemingly in contradiction with the redshift-dependent halo merger rate predicted by simulations.« less

  12. Testing for X-Ray–SZ Differences and Redshift Evolution in the X-Ray Morphology of Galaxy Clusters

    DOE PAGES

    Nurgaliev, D.; McDonald, M.; Benson, B. A.; ...

    2017-05-16

    We present a quantitative study of the X-ray morphology of galaxy clusters, as a function of their detection method and redshift. We analyze two separate samples of galaxy clusters: a sample of 36 clusters atmore » $$0.35\\lt z\\lt 0.9$$ selected in the X-ray with the ROSAT PSPC 400 deg(2) survey, and a sample of 90 clusters at $$0.25\\lt z\\lt 1.2$$ selected via the Sunyaev–Zel’dovich (SZ) effect with the South Pole Telescope. Clusters from both samples have similar-quality Chandra observations, which allow us to quantify their X-ray morphologies via two distinct methods: centroid shifts (w) and photon asymmetry ($${A}_{\\mathrm{phot}}$$). The latter technique provides nearly unbiased morphology estimates for clusters spanning a broad range of redshift and data quality. We further compare the X-ray morphologies of X-ray- and SZ-selected clusters with those of simulated clusters. We do not find a statistically significant difference in the measured X-ray morphology of X-ray and SZ-selected clusters over the redshift range probed by these samples, suggesting that the two are probing similar populations of clusters. We find that the X-ray morphologies of simulated clusters are statistically indistinguishable from those of X-ray- or SZ-selected clusters, implying that the most important physics for dictating the large-scale gas morphology (outside of the core) is well-approximated in these simulations. Finally, we find no statistically significant redshift evolution in the X-ray morphology (both for observed and simulated clusters), over the range of $$z\\sim 0.3$$ to $$z\\sim 1$$, seemingly in contradiction with the redshift-dependent halo merger rate predicted by simulations.« less

  13. Resemblance profiles as clustering decision criteria: Estimating statistical power, error, and correspondence for a hypothesis test for multivariate structure.

    PubMed

    Kilborn, Joshua P; Jones, David L; Peebles, Ernst B; Naar, David F

    2017-04-01

    Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing-based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance-based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.

  14. Bootstrap-based methods for estimating standard errors in Cox's regression analyses of clustered event times.

    PubMed

    Xiao, Yongling; Abrahamowicz, Michal

    2010-03-30

    We propose two bootstrap-based methods to correct the standard errors (SEs) from Cox's model for within-cluster correlation of right-censored event times. The cluster-bootstrap method resamples, with replacement, only the clusters, whereas the two-step bootstrap method resamples (i) the clusters, and (ii) individuals within each selected cluster, with replacement. In simulations, we evaluate both methods and compare them with the existing robust variance estimator and the shared gamma frailty model, which are available in statistical software packages. We simulate clustered event time data, with latent cluster-level random effects, which are ignored in the conventional Cox's model. For cluster-level covariates, both proposed bootstrap methods yield accurate SEs, and type I error rates, and acceptable coverage rates, regardless of the true random effects distribution, and avoid serious variance under-estimation by conventional Cox-based standard errors. However, the two-step bootstrap method over-estimates the variance for individual-level covariates. We also apply the proposed bootstrap methods to obtain confidence bands around flexible estimates of time-dependent effects in a real-life analysis of cluster event times.

  15. Text grouping in patent analysis using adaptive K-means clustering algorithm

    NASA Astrophysics Data System (ADS)

    Shanie, Tiara; Suprijadi, Jadi; Zulhanif

    2017-03-01

    Patents are one of the Intellectual Property. Analyzing patent is one requirement in knowing well the development of technology in each country and in the world now. This study uses the patent document coming from the Espacenet server about Green Tea. Patent documents related to the technology in the field of tea is still widespread, so it will be difficult for users to information retrieval (IR). Therefore, it is necessary efforts to categorize documents in a specific group of related terms contained therein. This study uses titles patent text data with the proposed Green Tea in Statistical Text Mining methods consists of two phases: data preparation and data analysis stage. The data preparation phase uses Text Mining methods and data analysis stage is done by statistics. Statistical analysis in this study using a cluster analysis algorithm, the Adaptive K-Means Clustering Algorithm. Results from this study showed that based on the maximum value Silhouette, generate 87 clusters associated fifteen terms therein that can be utilized in the process of information retrieval needs.

  16. Sampling in health geography: reconciling geographical objectives and probabilistic methods. An example of a health survey in Vientiane (Lao PDR).

    PubMed

    Vallée, Julie; Souris, Marc; Fournet, Florence; Bochaton, Audrey; Mobillion, Virginie; Peyronnie, Karine; Salem, Gérard

    2007-06-01

    Geographical objectives and probabilistic methods are difficult to reconcile in a unique health survey. Probabilistic methods focus on individuals to provide estimates of a variable's prevalence with a certain precision, while geographical approaches emphasise the selection of specific areas to study interactions between spatial characteristics and health outcomes. A sample selected from a small number of specific areas creates statistical challenges: the observations are not independent at the local level, and this results in poor statistical validity at the global level. Therefore, it is difficult to construct a sample that is appropriate for both geographical and probability methods. We used a two-stage selection procedure with a first non-random stage of selection of clusters. Instead of randomly selecting clusters, we deliberately chose a group of clusters, which as a whole would contain all the variation in health measures in the population. As there was no health information available before the survey, we selected a priori determinants that can influence the spatial homogeneity of the health characteristics. This method yields a distribution of variables in the sample that closely resembles that in the overall population, something that cannot be guaranteed with randomly-selected clusters, especially if the number of selected clusters is small. In this way, we were able to survey specific areas while minimising design effects and maximising statistical precision. We applied this strategy in a health survey carried out in Vientiane, Lao People's Democratic Republic. We selected well-known health determinants with unequal spatial distribution within the city: nationality and literacy. We deliberately selected a combination of clusters whose distribution of nationality and literacy is similar to the distribution in the general population. This paper describes the conceptual reasoning behind the construction of the survey sample and shows that it can be advantageous to choose clusters using reasoned hypotheses, based on both probability and geographical approaches, in contrast to a conventional, random cluster selection strategy.

  17. A method of using cluster analysis to study statistical dependence in multivariate data

    NASA Technical Reports Server (NTRS)

    Borucki, W. J.; Card, D. H.; Lyle, G. C.

    1975-01-01

    A technique is presented that uses both cluster analysis and a Monte Carlo significance test of clusters to discover associations between variables in multidimensional data. The method is applied to an example of a noisy function in three-dimensional space, to a sample from a mixture of three bivariate normal distributions, and to the well-known Fisher's Iris data.

  18. Spatial cluster detection for repeatedly measured outcomes while accounting for residential history.

    PubMed

    Cook, Andrea J; Gold, Diane R; Li, Yi

    2009-10-01

    Spatial cluster detection has become an important methodology in quantifying the effect of hazardous exposures. Previous methods have focused on cross-sectional outcomes that are binary or continuous. There are virtually no spatial cluster detection methods proposed for longitudinal outcomes. This paper proposes a new spatial cluster detection method for repeated outcomes using cumulative geographic residuals. A major advantage of this method is its ability to readily incorporate information on study participants relocation, which most cluster detection statistics cannot. Application of these methods will be illustrated by the Home Allergens and Asthma prospective cohort study analyzing the relationship between environmental exposures and repeated measured outcome, occurrence of wheeze in the last 6 months, while taking into account mobile locations.

  19. Identification and characterization of earthquake clusters: a comparative analysis for selected sequences in Italy

    NASA Astrophysics Data System (ADS)

    Peresan, Antonella; Gentili, Stefania

    2017-04-01

    Identification and statistical characterization of seismic clusters may provide useful insights about the features of seismic energy release and their relation to physical properties of the crust within a given region. Moreover, a number of studies based on spatio-temporal analysis of main-shocks occurrence require preliminary declustering of the earthquake catalogs. Since various methods, relying on different physical/statistical assumptions, may lead to diverse classifications of earthquakes into main events and related events, we aim to investigate the classification differences among different declustering techniques. Accordingly, a formal selection and comparative analysis of earthquake clusters is carried out for the most relevant earthquakes in North-Eastern Italy, as reported in the local OGS-CRS bulletins, compiled at the National Institute of Oceanography and Experimental Geophysics since 1977. The comparison is then extended to selected earthquake sequences associated with a different seismotectonic setting, namely to events that occurred in the region struck by the recent Central Italy destructive earthquakes, making use of INGV data. Various techniques, ranging from classical space-time windows methods to ad hoc manual identification of aftershocks, are applied for detection of earthquake clusters. In particular, a statistical method based on nearest-neighbor distances of events in space-time-energy domain, is considered. Results from clusters identification by the nearest-neighbor method turn out quite robust with respect to the time span of the input catalogue, as well as to minimum magnitude cutoff. The identified clusters for the largest events reported in North-Eastern Italy since 1977 are well consistent with those reported in earlier studies, which were aimed at detailed manual aftershocks identification. The study shows that the data-driven approach, based on the nearest-neighbor distances, can be satisfactorily applied to decompose the seismic catalog into background seismicity and individual sequences of earthquake clusters, also in areas characterized by moderate seismic activity, where the standard declustering techniques may turn out rather gross approximations. With these results acquired, the main statistical features of seismic clusters are explored, including complex interdependence of related events, with the aim to characterize the space-time patterns of earthquakes occurrence in North-Eastern Italy and capture their basic differences with Central Italy sequences.

  20. Linear regression models and k-means clustering for statistical analysis of fNIRS data.

    PubMed

    Bonomini, Viola; Zucchelli, Lucia; Re, Rebecca; Ieva, Francesca; Spinelli, Lorenzo; Contini, Davide; Paganoni, Anna; Torricelli, Alessandro

    2015-02-01

    We propose a new algorithm, based on a linear regression model, to statistically estimate the hemodynamic activations in fNIRS data sets. The main concern guiding the algorithm development was the minimization of assumptions and approximations made on the data set for the application of statistical tests. Further, we propose a K-means method to cluster fNIRS data (i.e. channels) as activated or not activated. The methods were validated both on simulated and in vivo fNIRS data. A time domain (TD) fNIRS technique was preferred because of its high performances in discriminating cortical activation and superficial physiological changes. However, the proposed method is also applicable to continuous wave or frequency domain fNIRS data sets.

  1. Linear regression models and k-means clustering for statistical analysis of fNIRS data

    PubMed Central

    Bonomini, Viola; Zucchelli, Lucia; Re, Rebecca; Ieva, Francesca; Spinelli, Lorenzo; Contini, Davide; Paganoni, Anna; Torricelli, Alessandro

    2015-01-01

    We propose a new algorithm, based on a linear regression model, to statistically estimate the hemodynamic activations in fNIRS data sets. The main concern guiding the algorithm development was the minimization of assumptions and approximations made on the data set for the application of statistical tests. Further, we propose a K-means method to cluster fNIRS data (i.e. channels) as activated or not activated. The methods were validated both on simulated and in vivo fNIRS data. A time domain (TD) fNIRS technique was preferred because of its high performances in discriminating cortical activation and superficial physiological changes. However, the proposed method is also applicable to continuous wave or frequency domain fNIRS data sets. PMID:25780751

  2. A new method to search for high-redshift clusters using photometric redshifts

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Castignani, G.; Celotti, A.; Chiaberge, M.

    2014-09-10

    We describe a new method (Poisson probability method, PPM) to search for high-redshift galaxy clusters and groups by using photometric redshift information and galaxy number counts. The method relies on Poisson statistics and is primarily introduced to search for megaparsec-scale environments around a specific beacon. The PPM is tailored to both the properties of the FR I radio galaxies in the Chiaberge et al. sample, which are selected within the COSMOS survey, and to the specific data set used. We test the efficiency of our method of searching for cluster candidates against simulations. Two different approaches are adopted. (1) Wemore » use two z ∼ 1 X-ray detected cluster candidates found in the COSMOS survey and we shift them to higher redshift up to z = 2. We find that the PPM detects the cluster candidates up to z = 1.5, and it correctly estimates both the redshift and size of the two clusters. (2) We simulate spherically symmetric clusters of different size and richness, and we locate them at different redshifts (i.e., z = 1.0, 1.5, and 2.0) in the COSMOS field. We find that the PPM detects the simulated clusters within the considered redshift range with a statistical 1σ redshift accuracy of ∼0.05. The PPM is an efficient alternative method for high-redshift cluster searches that may also be applied to both present and future wide field surveys such as SDSS Stripe 82, LSST, and Euclid. Accurate photometric redshifts and a survey depth similar or better than that of COSMOS (e.g., I < 25) are required.« less

  3. Estimating multilevel logistic regression models when the number of clusters is low: a comparison of different statistical software procedures.

    PubMed

    Austin, Peter C

    2010-04-22

    Multilevel logistic regression models are increasingly being used to analyze clustered data in medical, public health, epidemiological, and educational research. Procedures for estimating the parameters of such models are available in many statistical software packages. There is currently little evidence on the minimum number of clusters necessary to reliably fit multilevel regression models. We conducted a Monte Carlo study to compare the performance of different statistical software procedures for estimating multilevel logistic regression models when the number of clusters was low. We examined procedures available in BUGS, HLM, R, SAS, and Stata. We found that there were qualitative differences in the performance of different software procedures for estimating multilevel logistic models when the number of clusters was low. Among the likelihood-based procedures, estimation methods based on adaptive Gauss-Hermite approximations to the likelihood (glmer in R and xtlogit in Stata) or adaptive Gaussian quadrature (Proc NLMIXED in SAS) tended to have superior performance for estimating variance components when the number of clusters was small, compared to software procedures based on penalized quasi-likelihood. However, only Bayesian estimation with BUGS allowed for accurate estimation of variance components when there were fewer than 10 clusters. For all statistical software procedures, estimation of variance components tended to be poor when there were only five subjects per cluster, regardless of the number of clusters.

  4. Hydrometeor classification through statistical clustering of polarimetric radar measurements: a semi-supervised approach

    NASA Astrophysics Data System (ADS)

    Besic, Nikola; Ventura, Jordi Figueras i.; Grazioli, Jacopo; Gabella, Marco; Germann, Urs; Berne, Alexis

    2016-09-01

    Polarimetric radar-based hydrometeor classification is the procedure of identifying different types of hydrometeors by exploiting polarimetric radar observations. The main drawback of the existing supervised classification methods, mostly based on fuzzy logic, is a significant dependency on a presumed electromagnetic behaviour of different hydrometeor types. Namely, the results of the classification largely rely upon the quality of scattering simulations. When it comes to the unsupervised approach, it lacks the constraints related to the hydrometeor microphysics. The idea of the proposed method is to compensate for these drawbacks by combining the two approaches in a way that microphysical hypotheses can, to a degree, adjust the content of the classes obtained statistically from the observations. This is done by means of an iterative approach, performed offline, which, in a statistical framework, examines clustered representative polarimetric observations by comparing them to the presumed polarimetric properties of each hydrometeor class. Aside from comparing, a routine alters the content of clusters by encouraging further statistical clustering in case of non-identification. By merging all identified clusters, the multi-dimensional polarimetric signatures of various hydrometeor types are obtained for each of the studied representative datasets, i.e. for each radar system of interest. These are depicted by sets of centroids which are then employed in operational labelling of different hydrometeors. The method has been applied on three C-band datasets, each acquired by different operational radar from the MeteoSwiss Rad4Alp network, as well as on two X-band datasets acquired by two research mobile radars. The results are discussed through a comparative analysis which includes a corresponding supervised and unsupervised approach, emphasising the operational potential of the proposed method.

  5. Unusual clustering of coefficients of variation in published articles from a medical biochemistry department in India.

    PubMed

    Hudes, Mark L; McCann, Joyce C; Ames, Bruce N

    2009-03-01

    A simple statistical method is described to test whether data are consistent with minimum statistical variability expected in a biological experiment. The method is applied to data presented in data tables in a subset of 84 articles among more than 200 published by 3 investigators in a small medical biochemistry department at a major university in India and to 29 "control" articles selected by key word PubMed searches. Major conclusions include: 1) unusual clustering of coefficients of variation (CVs) was observed for data from the majority of articles analyzed that were published by the 3 investigators from 2000-2007; unusual clustering was not observed for data from any of their articles examined that were published between 1992 and 1999; and 2) among a group of 29 control articles retrieved by PubMed key word, title, or title/abstract searches, unusually clustered CVs were observed in 3 articles. Two of these articles were coauthored by 1 of the 3 investigators, and 1 was from the same university but a different department. We are unable to offer a statistical or biological explanation for the unusual clustering observed.

  6. Spike sorting using locality preserving projection with gap statistics and landmark-based spectral clustering.

    PubMed

    Nguyen, Thanh; Khosravi, Abbas; Creighton, Douglas; Nahavandi, Saeid

    2014-12-30

    Understanding neural functions requires knowledge from analysing electrophysiological data. The process of assigning spikes of a multichannel signal into clusters, called spike sorting, is one of the important problems in such analysis. There have been various automated spike sorting techniques with both advantages and disadvantages regarding accuracy and computational costs. Therefore, developing spike sorting methods that are highly accurate and computationally inexpensive is always a challenge in the biomedical engineering practice. An automatic unsupervised spike sorting method is proposed in this paper. The method uses features extracted by the locality preserving projection (LPP) algorithm. These features afterwards serve as inputs for the landmark-based spectral clustering (LSC) method. Gap statistics (GS) is employed to evaluate the number of clusters before the LSC can be performed. The proposed LPP-LSC is highly accurate and computationally inexpensive spike sorting approach. LPP spike features are very discriminative; thereby boost the performance of clustering methods. Furthermore, the LSC method exhibits its efficiency when integrated with the cluster evaluator GS. The proposed method's accuracy is approximately 13% superior to that of the benchmark combination between wavelet transformation and superparamagnetic clustering (WT-SPC). Additionally, LPP-LSC computing time is six times less than that of the WT-SPC. LPP-LSC obviously demonstrates a win-win spike sorting solution meeting both accuracy and computational cost criteria. LPP and LSC are linear algorithms that help reduce computational burden and thus their combination can be applied into real-time spike analysis. Copyright © 2014 Elsevier B.V. All rights reserved.

  7. Spectral gene set enrichment (SGSE).

    PubMed

    Frost, H Robert; Li, Zhigang; Moore, Jason H

    2015-03-03

    Gene set testing is typically performed in a supervised context to quantify the association between groups of genes and a clinical phenotype. In many cases, however, a gene set-based interpretation of genomic data is desired in the absence of a phenotype variable. Although methods exist for unsupervised gene set testing, they predominantly compute enrichment relative to clusters of the genomic variables with performance strongly dependent on the clustering algorithm and number of clusters. We propose a novel method, spectral gene set enrichment (SGSE), for unsupervised competitive testing of the association between gene sets and empirical data sources. SGSE first computes the statistical association between gene sets and principal components (PCs) using our principal component gene set enrichment (PCGSE) method. The overall statistical association between each gene set and the spectral structure of the data is then computed by combining the PC-level p-values using the weighted Z-method with weights set to the PC variance scaled by Tracy-Widom test p-values. Using simulated data, we show that the SGSE algorithm can accurately recover spectral features from noisy data. To illustrate the utility of our method on real data, we demonstrate the superior performance of the SGSE method relative to standard cluster-based techniques for testing the association between MSigDB gene sets and the variance structure of microarray gene expression data. Unsupervised gene set testing can provide important information about the biological signal held in high-dimensional genomic data sets. Because it uses the association between gene sets and samples PCs to generate a measure of unsupervised enrichment, the SGSE method is independent of cluster or network creation algorithms and, most importantly, is able to utilize the statistical significance of PC eigenvalues to ignore elements of the data most likely to represent noise.

  8. Method of identifying clusters representing statistical dependencies in multivariate data

    NASA Technical Reports Server (NTRS)

    Borucki, W. J.; Card, D. H.; Lyle, G. C.

    1975-01-01

    Approach is first to cluster and then to compute spatial boundaries for resulting clusters. Next step is to compute, from set of Monte Carlo samples obtained from scrambled data, estimates of probabilities of obtaining at least as many points within boundaries as were actually observed in original data.

  9. Cluster Analysis of Minnesota School Districts. A Research Report.

    ERIC Educational Resources Information Center

    Cleary, James

    The term "cluster analysis" refers to a set of statistical methods that classify entities with similar profiles of scores on a number of measured dimensions, in order to create empirically based typologies. A 1980 Minnesota House Research Report employed cluster analysis to categorize school districts according to their relative mixtures…

  10. *K-means and cluster models for cancer signatures.

    PubMed

    Kakushadze, Zura; Yu, Willie

    2017-09-01

    We present *K-means clustering algorithm and source code by expanding statistical clustering methods applied in https://ssrn.com/abstract=2802753 to quantitative finance. *K-means is statistically deterministic without specifying initial centers, etc. We apply *K-means to extracting cancer signatures from genome data without using nonnegative matrix factorization (NMF). *K-means' computational cost is a fraction of NMF's. Using 1389 published samples for 14 cancer types, we find that 3 cancers (liver cancer, lung cancer and renal cell carcinoma) stand out and do not have cluster-like structures. Two clusters have especially high within-cluster correlations with 11 other cancers indicating common underlying structures. Our approach opens a novel avenue for studying such structures. *K-means is universal and can be applied in other fields. We discuss some potential applications in quantitative finance.

  11. Coordinate based random effect size meta-analysis of neuroimaging studies.

    PubMed

    Tench, C R; Tanasescu, Radu; Constantinescu, C S; Auer, D P; Cottam, W J

    2017-06-01

    Low power in neuroimaging studies can make them difficult to interpret, and Coordinate based meta-analysis (CBMA) may go some way to mitigating this issue. CBMA has been used in many analyses to detect where published functional MRI or voxel-based morphometry studies testing similar hypotheses report significant summary results (coordinates) consistently. Only the reported coordinates and possibly t statistics are analysed, and statistical significance of clusters is determined by coordinate density. Here a method of performing coordinate based random effect size meta-analysis and meta-regression is introduced. The algorithm (ClusterZ) analyses both coordinates and reported t statistic or Z score, standardised by the number of subjects. Statistical significance is determined not by coordinate density, but by a random effects meta-analyses of reported effects performed cluster-wise using standard statistical methods and taking account of censoring inherent in the published summary results. Type 1 error control is achieved using the false cluster discovery rate (FCDR), which is based on the false discovery rate. This controls both the family wise error rate under the null hypothesis that coordinates are randomly drawn from a standard stereotaxic space, and the proportion of significant clusters that are expected under the null. Such control is necessary to avoid propagating and even amplifying the very issues motivating the meta-analysis in the first place. ClusterZ is demonstrated on both numerically simulated data and on real data from reports of grey matter loss in multiple sclerosis (MS) and syndromes suggestive of MS, and of painful stimulus in healthy controls. The software implementation is available to download and use freely. Copyright © 2017 Elsevier Inc. All rights reserved.

  12. PyClone: statistical inference of clonal population structure in cancer.

    PubMed

    Roth, Andrew; Khattra, Jaswinder; Yap, Damian; Wan, Adrian; Laks, Emma; Biele, Justina; Ha, Gavin; Aparicio, Samuel; Bouchard-Côté, Alexandre; Shah, Sohrab P

    2014-04-01

    We introduce PyClone, a statistical model for inference of clonal population structures in cancers. PyClone is a Bayesian clustering method for grouping sets of deeply sequenced somatic mutations into putative clonal clusters while estimating their cellular prevalences and accounting for allelic imbalances introduced by segmental copy-number changes and normal-cell contamination. Single-cell sequencing validation demonstrates PyClone's accuracy.

  13. Geovisual analytics to enhance spatial scan statistic interpretation: an analysis of U.S. cervical cancer mortality

    PubMed Central

    Chen, Jin; Roth, Robert E; Naito, Adam T; Lengerich, Eugene J; MacEachren, Alan M

    2008-01-01

    Background Kulldorff's spatial scan statistic and its software implementation – SaTScan – are widely used for detecting and evaluating geographic clusters. However, two issues make using the method and interpreting its results non-trivial: (1) the method lacks cartographic support for understanding the clusters in geographic context and (2) results from the method are sensitive to parameter choices related to cluster scaling (abbreviated as scaling parameters), but the system provides no direct support for making these choices. We employ both established and novel geovisual analytics methods to address these issues and to enhance the interpretation of SaTScan results. We demonstrate our geovisual analytics approach in a case study analysis of cervical cancer mortality in the U.S. Results We address the first issue by providing an interactive visual interface to support the interpretation of SaTScan results. Our research to address the second issue prompted a broader discussion about the sensitivity of SaTScan results to parameter choices. Sensitivity has two components: (1) the method can identify clusters that, while being statistically significant, have heterogeneous contents comprised of both high-risk and low-risk locations and (2) the method can identify clusters that are unstable in location and size as the spatial scan scaling parameter is varied. To investigate cluster result stability, we conducted multiple SaTScan runs with systematically selected parameters. The results, when scanning a large spatial dataset (e.g., U.S. data aggregated by county), demonstrate that no single spatial scan scaling value is known to be optimal to identify clusters that exist at different scales; instead, multiple scans that vary the parameters are necessary. We introduce a novel method of measuring and visualizing reliability that facilitates identification of homogeneous clusters that are stable across analysis scales. Finally, we propose a logical approach to proceed through the analysis of SaTScan results. Conclusion The geovisual analytics approach described in this manuscript facilitates the interpretation of spatial cluster detection methods by providing cartographic representation of SaTScan results and by providing visualization methods and tools that support selection of SaTScan parameters. Our methods distinguish between heterogeneous and homogeneous clusters and assess the stability of clusters across analytic scales. Method We analyzed the cervical cancer mortality data for the United States aggregated by county between 2000 and 2004. We ran SaTScan on the dataset fifty times with different parameter choices. Our geovisual analytics approach couples SaTScan with our visual analytic platform, allowing users to interactively explore and compare SaTScan results produced by different parameter choices. The Standardized Mortality Ratio and reliability scores are visualized for all the counties to identify stable, homogeneous clusters. We evaluated our analysis result by comparing it to that produced by other independent techniques including the Empirical Bayes Smoothing and Kafadar spatial smoother methods. The geovisual analytics approach introduced here is developed and implemented in our Java-based Visual Inquiry Toolkit. PMID:18992163

  14. Optimizing the maximum reported cluster size in the spatial scan statistic for ordinal data.

    PubMed

    Kim, Sehwi; Jung, Inkyung

    2017-01-01

    The spatial scan statistic is an important tool for spatial cluster detection. There have been numerous studies on scanning window shapes. However, little research has been done on the maximum scanning window size or maximum reported cluster size. Recently, Han et al. proposed to use the Gini coefficient to optimize the maximum reported cluster size. However, the method has been developed and evaluated only for the Poisson model. We adopt the Gini coefficient to be applicable to the spatial scan statistic for ordinal data to determine the optimal maximum reported cluster size. Through a simulation study and application to a real data example, we evaluate the performance of the proposed approach. With some sophisticated modification, the Gini coefficient can be effectively employed for the ordinal model. The Gini coefficient most often picked the optimal maximum reported cluster sizes that were the same as or smaller than the true cluster sizes with very high accuracy. It seems that we can obtain a more refined collection of clusters by using the Gini coefficient. The Gini coefficient developed specifically for the ordinal model can be useful for optimizing the maximum reported cluster size for ordinal data and helpful for properly and informatively discovering cluster patterns.

  15. Optimizing the maximum reported cluster size in the spatial scan statistic for ordinal data

    PubMed Central

    Kim, Sehwi

    2017-01-01

    The spatial scan statistic is an important tool for spatial cluster detection. There have been numerous studies on scanning window shapes. However, little research has been done on the maximum scanning window size or maximum reported cluster size. Recently, Han et al. proposed to use the Gini coefficient to optimize the maximum reported cluster size. However, the method has been developed and evaluated only for the Poisson model. We adopt the Gini coefficient to be applicable to the spatial scan statistic for ordinal data to determine the optimal maximum reported cluster size. Through a simulation study and application to a real data example, we evaluate the performance of the proposed approach. With some sophisticated modification, the Gini coefficient can be effectively employed for the ordinal model. The Gini coefficient most often picked the optimal maximum reported cluster sizes that were the same as or smaller than the true cluster sizes with very high accuracy. It seems that we can obtain a more refined collection of clusters by using the Gini coefficient. The Gini coefficient developed specifically for the ordinal model can be useful for optimizing the maximum reported cluster size for ordinal data and helpful for properly and informatively discovering cluster patterns. PMID:28753674

  16. Point process statistics in atom probe tomography.

    PubMed

    Philippe, T; Duguay, S; Grancher, G; Blavette, D

    2013-09-01

    We present a review of spatial point processes as statistical models that we have designed for the analysis and treatment of atom probe tomography (APT) data. As a major advantage, these methods do not require sampling. The mean distance to nearest neighbour is an attractive approach to exhibit a non-random atomic distribution. A χ(2) test based on distance distributions to nearest neighbour has been developed to detect deviation from randomness. Best-fit methods based on first nearest neighbour distance (1 NN method) and pair correlation function are presented and compared to assess the chemical composition of tiny clusters. Delaunay tessellation for cluster selection has been also illustrated. These statistical tools have been applied to APT experiments on microelectronics materials. Copyright © 2012 Elsevier B.V. All rights reserved.

  17. Geovisual analytics to enhance spatial scan statistic interpretation: an analysis of U.S. cervical cancer mortality.

    PubMed

    Chen, Jin; Roth, Robert E; Naito, Adam T; Lengerich, Eugene J; Maceachren, Alan M

    2008-11-07

    Kulldorff's spatial scan statistic and its software implementation - SaTScan - are widely used for detecting and evaluating geographic clusters. However, two issues make using the method and interpreting its results non-trivial: (1) the method lacks cartographic support for understanding the clusters in geographic context and (2) results from the method are sensitive to parameter choices related to cluster scaling (abbreviated as scaling parameters), but the system provides no direct support for making these choices. We employ both established and novel geovisual analytics methods to address these issues and to enhance the interpretation of SaTScan results. We demonstrate our geovisual analytics approach in a case study analysis of cervical cancer mortality in the U.S. We address the first issue by providing an interactive visual interface to support the interpretation of SaTScan results. Our research to address the second issue prompted a broader discussion about the sensitivity of SaTScan results to parameter choices. Sensitivity has two components: (1) the method can identify clusters that, while being statistically significant, have heterogeneous contents comprised of both high-risk and low-risk locations and (2) the method can identify clusters that are unstable in location and size as the spatial scan scaling parameter is varied. To investigate cluster result stability, we conducted multiple SaTScan runs with systematically selected parameters. The results, when scanning a large spatial dataset (e.g., U.S. data aggregated by county), demonstrate that no single spatial scan scaling value is known to be optimal to identify clusters that exist at different scales; instead, multiple scans that vary the parameters are necessary. We introduce a novel method of measuring and visualizing reliability that facilitates identification of homogeneous clusters that are stable across analysis scales. Finally, we propose a logical approach to proceed through the analysis of SaTScan results. The geovisual analytics approach described in this manuscript facilitates the interpretation of spatial cluster detection methods by providing cartographic representation of SaTScan results and by providing visualization methods and tools that support selection of SaTScan parameters. Our methods distinguish between heterogeneous and homogeneous clusters and assess the stability of clusters across analytic scales. We analyzed the cervical cancer mortality data for the United States aggregated by county between 2000 and 2004. We ran SaTScan on the dataset fifty times with different parameter choices. Our geovisual analytics approach couples SaTScan with our visual analytic platform, allowing users to interactively explore and compare SaTScan results produced by different parameter choices. The Standardized Mortality Ratio and reliability scores are visualized for all the counties to identify stable, homogeneous clusters. We evaluated our analysis result by comparing it to that produced by other independent techniques including the Empirical Bayes Smoothing and Kafadar spatial smoother methods. The geovisual analytics approach introduced here is developed and implemented in our Java-based Visual Inquiry Toolkit.

  18. Detection of Clostridium difficile infection clusters, using the temporal scan statistic, in a community hospital in southern Ontario, Canada, 2006-2011.

    PubMed

    Faires, Meredith C; Pearl, David L; Ciccotelli, William A; Berke, Olaf; Reid-Smith, Richard J; Weese, J Scott

    2014-05-12

    In hospitals, Clostridium difficile infection (CDI) surveillance relies on unvalidated guidelines or threshold criteria to identify outbreaks. This can result in false-positive and -negative cluster alarms. The application of statistical methods to identify and understand CDI clusters may be a useful alternative or complement to standard surveillance techniques. The objectives of this study were to investigate the utility of the temporal scan statistic for detecting CDI clusters and determine if there are significant differences in the rate of CDI cases by month, season, and year in a community hospital. Bacteriology reports of patients identified with a CDI from August 2006 to February 2011 were collected. For patients detected with CDI from March 2010 to February 2011, stool specimens were obtained. Clostridium difficile isolates were characterized by ribotyping and investigated for the presence of toxin genes by PCR. CDI clusters were investigated using a retrospective temporal scan test statistic. Statistically significant clusters were compared to known CDI outbreaks within the hospital. A negative binomial regression model was used to identify associations between year, season, month and the rate of CDI cases. Overall, 86 CDI cases were identified. Eighteen specimens were analyzed and nine ribotypes were classified with ribotype 027 (n = 6) the most prevalent. The temporal scan statistic identified significant CDI clusters at the hospital (n = 5), service (n = 6), and ward (n = 4) levels (P ≤ 0.05). Three clusters were concordant with the one C. difficile outbreak identified by hospital personnel. Two clusters were identified as potential outbreaks. The negative binomial model indicated years 2007-2010 (P ≤ 0.05) had decreased CDI rates compared to 2006 and spring had an increased CDI rate compared to the fall (P = 0.023). Application of the temporal scan statistic identified several clusters, including potential outbreaks not detected by hospital personnel. The identification of time periods with decreased or increased CDI rates may have been a result of specific hospital events. Understanding the clustering of CDIs can aid in the interpretation of surveillance data and lead to the development of better early detection systems.

  19. Kappa statistic for the clustered dichotomous responses from physicians and patients

    PubMed Central

    Kang, Chaeryon; Qaqish, Bahjat; Monaco, Jane; Sheridan, Stacey L.; Cai, Jianwen

    2013-01-01

    The bootstrap method for estimating the standard error of the kappa statistic in the presence of clustered data is evaluated. Such data arise, for example, in assessing agreement between physicians and their patients regarding their understanding of the physician-patient interaction and discussions. We propose a computationally efficient procedure for generating correlated dichotomous responses for physicians and assigned patients for simulation studies. The simulation result demonstrates that the proposed bootstrap method produces better estimate of the standard error and better coverage performance compared to the asymptotic standard error estimate that ignores dependence among patients within physicians with at least a moderately large number of clusters. An example of an application to a coronary heart disease prevention study is presented. PMID:23533082

  20. A Survey of Popular R Packages for Cluster Analysis

    ERIC Educational Resources Information Center

    Flynt, Abby; Dean, Nema

    2016-01-01

    Cluster analysis is a set of statistical methods for discovering new group/class structure when exploring data sets. This article reviews the following popular libraries/commands in the R software language for applying different types of cluster analysis: from the stats library, the kmeans, and hclust functions; the mclust library; the poLCA…

  1. Using Cluster Analysis for Data Mining in Educational Technology Research

    ERIC Educational Resources Information Center

    Antonenko, Pavlo D.; Toy, Serkan; Niederhauser, Dale S.

    2012-01-01

    Cluster analysis is a group of statistical methods that has great potential for analyzing the vast amounts of web server-log data to understand student learning from hyperlinked information resources. In this methodological paper we provide an introduction to cluster analysis for educational technology researchers and illustrate its use through…

  2. The writer independent online handwriting recognition system frog on hand and cluster generative statistical dynamic time warping.

    PubMed

    Bahlmann, Claus; Burkhardt, Hans

    2004-03-01

    In this paper, we give a comprehensive description of our writer-independent online handwriting recognition system frog on hand. The focus of this work concerns the presentation of the classification/training approach, which we call cluster generative statistical dynamic time warping (CSDTW). CSDTW is a general, scalable, HMM-based method for variable-sized, sequential data that holistically combines cluster analysis and statistical sequence modeling. It can handle general classification problems that rely on this sequential type of data, e.g., speech recognition, genome processing, robotics, etc. Contrary to previous attempts, clustering and statistical sequence modeling are embedded in a single feature space and use a closely related distance measure. We show character recognition experiments of frog on hand using CSDTW on the UNIPEN online handwriting database. The recognition accuracy is significantly higher than reported results of other handwriting recognition systems. Finally, we describe the real-time implementation of frog on hand on a Linux Compaq iPAQ embedded device.

  3. Applications of modern statistical methods to analysis of data in physical science

    NASA Astrophysics Data System (ADS)

    Wicker, James Eric

    Modern methods of statistical and computational analysis offer solutions to dilemmas confronting researchers in physical science. Although the ideas behind modern statistical and computational analysis methods were originally introduced in the 1970's, most scientists still rely on methods written during the early era of computing. These researchers, who analyze increasingly voluminous and multivariate data sets, need modern analysis methods to extract the best results from their studies. The first section of this work showcases applications of modern linear regression. Since the 1960's, many researchers in spectroscopy have used classical stepwise regression techniques to derive molecular constants. However, problems with thresholds of entry and exit for model variables plagues this analysis method. Other criticisms of this kind of stepwise procedure include its inefficient searching method, the order in which variables enter or leave the model and problems with overfitting data. We implement an information scoring technique that overcomes the assumptions inherent in the stepwise regression process to calculate molecular model parameters. We believe that this kind of information based model evaluation can be applied to more general analysis situations in physical science. The second section proposes new methods of multivariate cluster analysis. The K-means algorithm and the EM algorithm, introduced in the 1960's and 1970's respectively, formed the basis of multivariate cluster analysis methodology for many years. However, several shortcomings of these methods include strong dependence on initial seed values and inaccurate results when the data seriously depart from hypersphericity. We propose new cluster analysis methods based on genetic algorithms that overcomes the strong dependence on initial seed values. In addition, we propose a generalization of the Genetic K-means algorithm which can accurately identify clusters with complex hyperellipsoidal covariance structures. We then use this new algorithm in a genetic algorithm based Expectation-Maximization process that can accurately calculate parameters describing complex clusters in a mixture model routine. Using the accuracy of this GEM algorithm, we assign information scores to cluster calculations in order to best identify the number of mixture components in a multivariate data set. We will showcase how these algorithms can be used to process multivariate data from astronomical observations.

  4. Penalized likelihood and multi-objective spatial scans for the detection and inference of irregular clusters

    PubMed Central

    2010-01-01

    Background Irregularly shaped spatial clusters are difficult to delineate. A cluster found by an algorithm often spreads through large portions of the map, impacting its geographical meaning. Penalized likelihood methods for Kulldorff's spatial scan statistics have been used to control the excessive freedom of the shape of clusters. Penalty functions based on cluster geometry and non-connectivity have been proposed recently. Another approach involves the use of a multi-objective algorithm to maximize two objectives: the spatial scan statistics and the geometric penalty function. Results & Discussion We present a novel scan statistic algorithm employing a function based on the graph topology to penalize the presence of under-populated disconnection nodes in candidate clusters, the disconnection nodes cohesion function. A disconnection node is defined as a region within a cluster, such that its removal disconnects the cluster. By applying this function, the most geographically meaningful clusters are sifted through the immense set of possible irregularly shaped candidate cluster solutions. To evaluate the statistical significance of solutions for multi-objective scans, a statistical approach based on the concept of attainment function is used. In this paper we compared different penalized likelihoods employing the geometric and non-connectivity regularity functions and the novel disconnection nodes cohesion function. We also build multi-objective scans using those three functions and compare them with the previous penalized likelihood scans. An application is presented using comprehensive state-wide data for Chagas' disease in puerperal women in Minas Gerais state, Brazil. Conclusions We show that, compared to the other single-objective algorithms, multi-objective scans present better performance, regarding power, sensitivity and positive predicted value. The multi-objective non-connectivity scan is faster and better suited for the detection of moderately irregularly shaped clusters. The multi-objective cohesion scan is most effective for the detection of highly irregularly shaped clusters. PMID:21034451

  5. Application of microarray analysis on computer cluster and cloud platforms.

    PubMed

    Bernau, C; Boulesteix, A-L; Knaus, J

    2013-01-01

    Analysis of recent high-dimensional biological data tends to be computationally intensive as many common approaches such as resampling or permutation tests require the basic statistical analysis to be repeated many times. A crucial advantage of these methods is that they can be easily parallelized due to the computational independence of the resampling or permutation iterations, which has induced many statistics departments to establish their own computer clusters. An alternative is to rent computing resources in the cloud, e.g. at Amazon Web Services. In this article we analyze whether a selection of statistical projects, recently implemented at our department, can be efficiently realized on these cloud resources. Moreover, we illustrate an opportunity to combine computer cluster and cloud resources. In order to compare the efficiency of computer cluster and cloud implementations and their respective parallelizations we use microarray analysis procedures and compare their runtimes on the different platforms. Amazon Web Services provide various instance types which meet the particular needs of the different statistical projects we analyzed in this paper. Moreover, the network capacity is sufficient and the parallelization is comparable in efficiency to standard computer cluster implementations. Our results suggest that many statistical projects can be efficiently realized on cloud resources. It is important to mention, however, that workflows can change substantially as a result of a shift from computer cluster to cloud computing.

  6. Zonation in the deep benthic megafauna : Application of a general test.

    PubMed

    Gardiner, Frederick P; Haedrich, Richard L

    1978-01-01

    A test based on Maxwell-Boltzman statistics, instead of the formerly suggested but inappropriate Bose-Einstein statistics (Pielou and Routledge, 1976), examines the distribution of the boundaries of species' ranges distributed along a gradient, and indicates whether they are random or clustered (zoned). The test is most useful as a preliminary to the application of more instructive but less statistically rigorous methods such as cluster analysis. The test indicates zonation is marked in the deep benthic megafauna living between 200 and 3000 m, but below 3000 m little zonation may be found.

  7. New output improvements for CLASSY

    NASA Technical Reports Server (NTRS)

    Rassbach, M. E. (Principal Investigator)

    1981-01-01

    Additional output data and formats for the CLASSY clustering algorithm were developed. Four such aids to the CLASSY user are described. These are: (1) statistical measures; (2) special map types; (3) formats for standard output; and (4) special cluster display method.

  8. Global, local and focused geographic clustering for case-control data with residential histories

    PubMed Central

    Jacquez, Geoffrey M; Kaufmann, Andy; Meliker, Jaymie; Goovaerts, Pierre; AvRuskin, Gillian; Nriagu, Jerome

    2005-01-01

    Background This paper introduces a new approach for evaluating clustering in case-control data that accounts for residential histories. Although many statistics have been proposed for assessing local, focused and global clustering in health outcomes, few, if any, exist for evaluating clusters when individuals are mobile. Methods Local, global and focused tests for residential histories are developed based on sets of matrices of nearest neighbor relationships that reflect the changing topology of cases and controls. Exposure traces are defined that account for the latency between exposure and disease manifestation, and that use exposure windows whose duration may vary. Several of the methods so derived are applied to evaluate clustering of residential histories in a case-control study of bladder cancer in south eastern Michigan. These data are still being collected and the analysis is conducted for demonstration purposes only. Results Statistically significant clustering of residential histories of cases was found but is likely due to delayed reporting of cases by one of the hospitals participating in the study. Conclusion Data with residential histories are preferable when causative exposures and disease latencies occur on a long enough time span that human mobility matters. To analyze such data, methods are needed that take residential histories into account. PMID:15784151

  9. Logo image clustering based on advanced statistics

    NASA Astrophysics Data System (ADS)

    Wei, Yi; Kamel, Mohamed; He, Yiwei

    2007-11-01

    In recent years, there has been a growing interest in the research of image content description techniques. Among those, image clustering is one of the most frequently discussed topics. Similar to image recognition, image clustering is also a high-level representation technique. However it focuses on the coarse categorization rather than the accurate recognition. Based on wavelet transform (WT) and advanced statistics, the authors propose a novel approach that divides various shaped logo images into groups according to the external boundary of each logo image. Experimental results show that the presented method is accurate, fast and insensitive to defects.

  10. Cluster and propensity based approximation of a network

    PubMed Central

    2013-01-01

    Background The models in this article generalize current models for both correlation networks and multigraph networks. Correlation networks are widely applied in genomics research. In contrast to general networks, it is straightforward to test the statistical significance of an edge in a correlation network. It is also easy to decompose the underlying correlation matrix and generate informative network statistics such as the module eigenvector. However, correlation networks only capture the connections between numeric variables. An open question is whether one can find suitable decompositions of the similarity measures employed in constructing general networks. Multigraph networks are attractive because they support likelihood based inference. Unfortunately, it is unclear how to adjust current statistical methods to detect the clusters inherent in many data sets. Results Here we present an intuitive and parsimonious parametrization of a general similarity measure such as a network adjacency matrix. The cluster and propensity based approximation (CPBA) of a network not only generalizes correlation network methods but also multigraph methods. In particular, it gives rise to a novel and more realistic multigraph model that accounts for clustering and provides likelihood based tests for assessing the significance of an edge after controlling for clustering. We present a novel Majorization-Minimization (MM) algorithm for estimating the parameters of the CPBA. To illustrate the practical utility of the CPBA of a network, we apply it to gene expression data and to a bi-partite network model for diseases and disease genes from the Online Mendelian Inheritance in Man (OMIM). Conclusions The CPBA of a network is theoretically appealing since a) it generalizes correlation and multigraph network methods, b) it improves likelihood based significance tests for edge counts, c) it directly models higher-order relationships between clusters, and d) it suggests novel clustering algorithms. The CPBA of a network is implemented in Fortran 95 and bundled in the freely available R package PropClust. PMID:23497424

  11. Subtyping of Children with Developmental Dyslexia via Bootstrap Aggregated Clustering and the Gap Statistic: Comparison with the Double-Deficit Hypothesis

    ERIC Educational Resources Information Center

    King, Wayne M.; Giess, Sally A.; Lombardino, Linda J.

    2007-01-01

    Background: The marked degree of heterogeneity in persons with developmental dyslexia has motivated the investigation of possible subtypes. Attempts have proceeded both from theoretical models of reading and the application of unsupervised learning (clustering) methods. Previous cluster analyses of data obtained from persons with reading…

  12. Accounting for Multiple Births in Neonatal and Perinatal Trials: Systematic Review and Case Study

    PubMed Central

    Hibbs, Anna Maria; Black, Dennis; Palermo, Lisa; Cnaan, Avital; Luan, Xianqun; Truog, William E; Walsh, Michele C; Ballard, Roberta A

    2010-01-01

    Objectives To determine the prevalence in the neonatal literature of statistical approaches accounting for the unique clustering patterns of multiple births. To explore the sensitivity of an actual trial to several analytic approaches to multiples. Methods A systematic review of recent perinatal trials assessed the prevalence of studies accounting for clustering of multiples. The NO CLD trial served as a case study of the sensitivity of the outcome to several statistical strategies. We calculated odds ratios using non-clustered (logistic regression) and clustered (generalized estimating equations, multiple outputation) analyses. Results In the systematic review, most studies did not describe the randomization of twins and did not account for clustering. Of those studies that did, exclusion of multiples and generalized estimating equations were the most common strategies. The NO CLD study included 84 infants with a sibling enrolled in the study. Multiples were more likely than singletons to be white and were born to older mothers (p<0.01). Analyses that accounted for clustering were statistically significant; analyses assuming independence were not. Conclusions The statistical approach to multiples can influence the odds ratio and width of confidence intervals, thereby affecting the interpretation of a study outcome. A minority of perinatal studies address this issue. PMID:19969305

  13. Sulfur in Cometary Dust

    NASA Technical Reports Server (NTRS)

    Fomenkova, M. N.

    1997-01-01

    The computer-intensive project consisted of the analysis and synthesis of existing data on composition of comet Halley dust particles. The main objective was to obtain a complete inventory of sulfur containing compounds in the comet Halley dust by building upon the existing classification of organic and inorganic compounds and applying a variety of statistical techniques for cluster and cross-correlational analyses. A student hired for this project wrote and tested the software to perform cluster analysis. The following tasks were carried out: (1) selecting the data from existing database for the proposed project; (2) finding access to a standard library of statistical routines for cluster analysis; (3) reformatting the data as necessary for input into the library routines; (4) performing cluster analysis and constructing hierarchical cluster trees using three methods to define the proximity of clusters; (5) presenting the output results in different formats to facilitate the interpretation of the obtained cluster trees; (6) selecting groups of data points common for all three trees as stable clusters. We have also considered the chemistry of sulfur in inorganic compounds.

  14. Kappa statistic for clustered dichotomous responses from physicians and patients.

    PubMed

    Kang, Chaeryon; Qaqish, Bahjat; Monaco, Jane; Sheridan, Stacey L; Cai, Jianwen

    2013-09-20

    The bootstrap method for estimating the standard error of the kappa statistic in the presence of clustered data is evaluated. Such data arise, for example, in assessing agreement between physicians and their patients regarding their understanding of the physician-patient interaction and discussions. We propose a computationally efficient procedure for generating correlated dichotomous responses for physicians and assigned patients for simulation studies. The simulation result demonstrates that the proposed bootstrap method produces better estimate of the standard error and better coverage performance compared with the asymptotic standard error estimate that ignores dependence among patients within physicians with at least a moderately large number of clusters. We present an example of an application to a coronary heart disease prevention study. Copyright © 2013 John Wiley & Sons, Ltd.

  15. Minimal spanning tree algorithm for γ-ray source detection in sparse photon images: cluster parameters and selection strategies

    DOE PAGES

    Campana, R.; Bernieri, E.; Massaro, E.; ...

    2013-05-22

    We present that the minimal spanning tree (MST) algorithm is a graph-theoretical cluster-finding method. We previously applied it to γ-ray bidimensional images, showing that it is quite sensitive in finding faint sources. Possible sources are associated with the regions where the photon arrival directions clusterize. MST selects clusters starting from a particular “tree” connecting all the point of the image and performing a cut based on the angular distance between photons, with a number of events higher than a given threshold. In this paper, we show how a further filtering, based on some parameters linked to the cluster properties, canmore » be applied to reduce spurious detections. We find that the most efficient parameter for this secondary selection is the magnitudeM of a cluster, defined as the product of its number of events by its clustering degree. We test the sensitivity of the method by means of simulated and real Fermi-Large Area Telescope (LAT) fields. Our results show that √M is strongly correlated with other statistical significance parameters, derived from a wavelet based algorithm and maximum likelihood (ML) analysis, and that it can be used as a good estimator of statistical significance of MST detections. Finally, we apply the method to a 2-year LAT image at energies higher than 3 GeV, and we show the presence of new clusters, likely associated with BL Lac objects.« less

  16. Comparison of cluster-based and source-attribution methods for estimating transmission risk using large HIV sequence databases.

    PubMed

    Le Vu, Stéphane; Ratmann, Oliver; Delpech, Valerie; Brown, Alison E; Gill, O Noel; Tostevin, Anna; Fraser, Christophe; Volz, Erik M

    2018-06-01

    Phylogenetic clustering of HIV sequences from a random sample of patients can reveal epidemiological transmission patterns, but interpretation is hampered by limited theoretical support and statistical properties of clustering analysis remain poorly understood. Alternatively, source attribution methods allow fitting of HIV transmission models and thereby quantify aspects of disease transmission. A simulation study was conducted to assess error rates of clustering methods for detecting transmission risk factors. We modeled HIV epidemics among men having sex with men and generated phylogenies comparable to those that can be obtained from HIV surveillance data in the UK. Clustering and source attribution approaches were applied to evaluate their ability to identify patient attributes as transmission risk factors. We find that commonly used methods show a misleading association between cluster size or odds of clustering and covariates that are correlated with time since infection, regardless of their influence on transmission. Clustering methods usually have higher error rates and lower sensitivity than source attribution method for identifying transmission risk factors. But neither methods provide robust estimates of transmission risk ratios. Source attribution method can alleviate drawbacks from phylogenetic clustering but formal population genetic modeling may be required to estimate quantitative transmission risk factors. Copyright © 2017 The Authors. Published by Elsevier B.V. All rights reserved.

  17. Statistical Significance for Hierarchical Clustering

    PubMed Central

    Kimes, Patrick K.; Liu, Yufeng; Hayes, D. Neil; Marron, J. S.

    2017-01-01

    Summary Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other fields for their ability to simultaneously uncover multiple layers of clustering structure. A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation. Few approaches have been proposed for addressing this problem in the context of hierarchical clustering, for which the problem is further complicated by the natural tree structure of the partition, and the multiplicity of tests required to parse the layers of nested clusters. In this paper, we propose a Monte Carlo based approach for testing statistical significance in hierarchical clustering which addresses these issues. The approach is implemented as a sequential testing procedure guaranteeing control of the family-wise error rate. Theoretical justification is provided for our approach, and its power to detect true clustering structure is illustrated through several simulation studies and applications to two cancer gene expression datasets. PMID:28099990

  18. Cluster analysis as a prediction tool for pregnancy outcomes.

    PubMed

    Banjari, Ines; Kenjerić, Daniela; Šolić, Krešimir; Mandić, Milena L

    2015-03-01

    Considering specific physiology changes during gestation and thinking of pregnancy as a "critical window", classification of pregnant women at early pregnancy can be considered as crucial. The paper demonstrates the use of a method based on an approach from intelligent data mining, cluster analysis. Cluster analysis method is a statistical method which makes possible to group individuals based on sets of identifying variables. The method was chosen in order to determine possibility for classification of pregnant women at early pregnancy to analyze unknown correlations between different variables so that the certain outcomes could be predicted. 222 pregnant women from two general obstetric offices' were recruited. The main orient was set on characteristics of these pregnant women: their age, pre-pregnancy body mass index (BMI) and haemoglobin value. Cluster analysis gained a 94.1% classification accuracy rate with three branch- es or groups of pregnant women showing statistically significant correlations with pregnancy outcomes. The results are showing that pregnant women both of older age and higher pre-pregnancy BMI have a significantly higher incidence of delivering baby of higher birth weight but they gain significantly less weight during pregnancy. Their babies are also longer, and these women have significantly higher probability for complications during pregnancy (gestosis) and higher probability of induced or caesarean delivery. We can conclude that the cluster analysis method can appropriately classify pregnant women at early pregnancy to predict certain outcomes.

  19. Clustering Scientific Publications Based on Citation Relations: A Systematic Comparison of Different Methods.

    PubMed

    Šubelj, Lovro; van Eck, Nees Jan; Waltman, Ludo

    2016-01-01

    Clustering methods are applied regularly in the bibliometric literature to identify research areas or scientific fields. These methods are for instance used to group publications into clusters based on their relations in a citation network. In the network science literature, many clustering methods, often referred to as graph partitioning or community detection techniques, have been developed. Focusing on the problem of clustering the publications in a citation network, we present a systematic comparison of the performance of a large number of these clustering methods. Using a number of different citation networks, some of them relatively small and others very large, we extensively study the statistical properties of the results provided by different methods. In addition, we also carry out an expert-based assessment of the results produced by different methods. The expert-based assessment focuses on publications in the field of scientometrics. Our findings seem to indicate that there is a trade-off between different properties that may be considered desirable for a good clustering of publications. Overall, map equation methods appear to perform best in our analysis, suggesting that these methods deserve more attention from the bibliometric community.

  20. Clustering Scientific Publications Based on Citation Relations: A Systematic Comparison of Different Methods

    PubMed Central

    Šubelj, Lovro; van Eck, Nees Jan; Waltman, Ludo

    2016-01-01

    Clustering methods are applied regularly in the bibliometric literature to identify research areas or scientific fields. These methods are for instance used to group publications into clusters based on their relations in a citation network. In the network science literature, many clustering methods, often referred to as graph partitioning or community detection techniques, have been developed. Focusing on the problem of clustering the publications in a citation network, we present a systematic comparison of the performance of a large number of these clustering methods. Using a number of different citation networks, some of them relatively small and others very large, we extensively study the statistical properties of the results provided by different methods. In addition, we also carry out an expert-based assessment of the results produced by different methods. The expert-based assessment focuses on publications in the field of scientometrics. Our findings seem to indicate that there is a trade-off between different properties that may be considered desirable for a good clustering of publications. Overall, map equation methods appear to perform best in our analysis, suggesting that these methods deserve more attention from the bibliometric community. PMID:27124610

  1. Clustering of change patterns using Fourier coefficients.

    PubMed

    Kim, Jaehee; Kim, Haseong

    2008-01-15

    To understand the behavior of genes, it is important to explore how the patterns of gene expression change over a time period because biologically related gene groups can share the same change patterns. Many clustering algorithms have been proposed to group observation data. However, because of the complexity of the underlying functions there have not been many studies on grouping data based on change patterns. In this study, the problem of finding similar change patterns is induced to clustering with the derivative Fourier coefficients. The sample Fourier coefficients not only provide information about the underlying functions, but also reduce the dimension. In addition, as their limiting distribution is a multivariate normal, a model-based clustering method incorporating statistical properties would be appropriate. This work is aimed at discovering gene groups with similar change patterns that share similar biological properties. We developed a statistical model using derivative Fourier coefficients to identify similar change patterns of gene expression. We used a model-based method to cluster the Fourier series estimation of derivatives. The model-based method is advantageous over other methods in our proposed model because the sample Fourier coefficients asymptotically follow the multivariate normal distribution. Change patterns are automatically estimated with the Fourier representation in our model. Our model was tested in simulations and on real gene data sets. The simulation results showed that the model-based clustering method with the sample Fourier coefficients has a lower clustering error rate than K-means clustering. Even when the number of repeated time points was small, the same results were obtained. We also applied our model to cluster change patterns of yeast cell cycle microarray expression data with alpha-factor synchronization. It showed that, as the method clusters with the probability-neighboring data, the model-based clustering with our proposed model yielded biologically interpretable results. We expect that our proposed Fourier analysis with suitably chosen smoothing parameters could serve as a useful tool in classifying genes and interpreting possible biological change patterns. The R program is available upon the request.

  2. Spatiotemporal clusters of malaria cases at village level, northwest Ethiopia.

    PubMed

    Alemu, Kassahun; Worku, Alemayehu; Berhane, Yemane; Kumie, Abera

    2014-06-06

    Malaria attacks are not evenly distributed in space and time. In highland areas with low endemicity, malaria transmission is highly variable and malaria acquisition risk for individuals is unevenly distributed even within a neighbourhood. Characterizing the spatiotemporal distribution of malaria cases in high-altitude villages is necessary to prioritize the risk areas and facilitate interventions. Spatial scan statistics using the Bernoulli method were employed to identify spatial and temporal clusters of malaria in high-altitude villages. Daily malaria data were collected, using a passive surveillance system, from patients visiting local health facilities. Georeference data were collected at villages using hand-held global positioning system devices and linked to patient data. Bernoulli model using Bayesian approaches and Marcov Chain Monte Carlo (MCMC) methods were used to identify the effects of factors on spatial clusters of malaria cases. The deviance information criterion (DIC) was used to assess the goodness-of-fit of the different models. The smaller the DIC, the better the model fit. Malaria cases were clustered in both space and time in high-altitude villages. Spatial scan statistics identified a total of 56 spatial clusters of malaria in high-altitude villages. Of these, 39 were the most likely clusters (LLR = 15.62, p < 0.00001) and 17 were secondary clusters (LLR = 7.05, p < 0.03). The significant most likely temporal malaria clusters were detected between August and December (LLR = 17.87, p < 0.001). Travel away home, males and age above 15 years had statistically significant effect on malaria clusters at high-altitude villages. The study identified spatial clusters of malaria cases occurring at high elevation villages within the district. A patient who travelled away from home to a malaria-endemic area might be the most probable source of malaria infection in a high-altitude village. Malaria interventions in high altitude villages should address factors associated with malaria clustering.

  3. Scoring clustering solutions by their biological relevance.

    PubMed

    Gat-Viks, I; Sharan, R; Shamir, R

    2003-12-12

    A central step in the analysis of gene expression data is the identification of groups of genes that exhibit similar expression patterns. Clustering gene expression data into homogeneous groups was shown to be instrumental in functional annotation, tissue classification, regulatory motif identification, and other applications. Although there is a rich literature on clustering algorithms for gene expression analysis, very few works addressed the systematic comparison and evaluation of clustering results. Typically, different clustering algorithms yield different clustering solutions on the same data, and there is no agreed upon guideline for choosing among them. We developed a novel statistically based method for assessing a clustering solution according to prior biological knowledge. Our method can be used to compare different clustering solutions or to optimize the parameters of a clustering algorithm. The method is based on projecting vectors of biological attributes of the clustered elements onto the real line, such that the ratio of between-groups and within-group variance estimators is maximized. The projected data are then scored using a non-parametric analysis of variance test, and the score's confidence is evaluated. We validate our approach using simulated data and show that our scoring method outperforms several extant methods, including the separation to homogeneity ratio and the silhouette measure. We apply our method to evaluate results of several clustering methods on yeast cell-cycle gene expression data. The software is available from the authors upon request.

  4. Detection of Anomalies in Hydrometric Data Using Artificial Intelligence Techniques

    NASA Astrophysics Data System (ADS)

    Lauzon, N.; Lence, B. J.

    2002-12-01

    This work focuses on the detection of anomalies in hydrometric data sequences, such as 1) outliers, which are individual data having statistical properties that differ from those of the overall population; 2) shifts, which are sudden changes over time in the statistical properties of the historical records of data; and 3) trends, which are systematic changes over time in the statistical properties. For the purpose of the design and management of water resources systems, it is important to be aware of these anomalies in hydrometric data, for they can induce a bias in the estimation of water quantity and quality parameters. These anomalies may be viewed as specific patterns affecting the data, and therefore pattern recognition techniques can be used for identifying them. However, the number of possible patterns is very large for each type of anomaly and consequently large computing capacities are required to account for all possibilities using the standard statistical techniques, such as cluster analysis. Artificial intelligence techniques, such as the Kohonen neural network and fuzzy c-means, are clustering techniques commonly used for pattern recognition in several areas of engineering and have recently begun to be used for the analysis of natural systems. They require much less computing capacity than the standard statistical techniques, and therefore are well suited for the identification of outliers, shifts and trends in hydrometric data. This work constitutes a preliminary study, using synthetic data representing hydrometric data that can be found in Canada. The analysis of the results obtained shows that the Kohonen neural network and fuzzy c-means are reasonably successful in identifying anomalies. This work also addresses the problem of uncertainties inherent to the calibration procedures that fit the clusters to the possible patterns for both the Kohonen neural network and fuzzy c-means. Indeed, for the same database, different sets of clusters can be established with these calibration procedures. A simple method for analyzing uncertainties associated with the Kohonen neural network and fuzzy c-means is developed here. The method combines the results from several sets of clusters, either from the Kohonen neural network or fuzzy c-means, so as to provide an overall diagnosis as to the identification of outliers, shifts and trends. The results indicate an improvement in the performance for identifying anomalies when the method of combining cluster sets is used, compared with when only one cluster set is used.

  5. Energy spectra of X-ray clusters of galaxies

    NASA Technical Reports Server (NTRS)

    Avni, Y.

    1976-01-01

    A procedure for estimating the ranges of parameters that describe the spectra of X-rays from clusters of galaxies is presented. The applicability of the method is proved by statistical simulations of cluster spectra; such a proof is necessary because of the nonlinearity of the spectral functions. Implications for the spectra of the Perseus, Coma, and Virgo clusters are discussed. The procedure can be applied in more general problems of parameter estimation.

  6. Postgraduate Taught Portfolio Review--The Cluster Approach, Non-Subject-Based Grouping of Courses and Relevant Performance Indicators

    ERIC Educational Resources Information Center

    Konstantinidis-Pereira, Alicja

    2018-01-01

    This paper summarises a new method of grouping postgraduate taught (PGT) courses introduced at Oxford Brookes University as a part of a Portfolio Review. Instead of classifying courses by subject, the new cluster approach uses statistical methods to group the courses based on factors including flexibility of study options, level of specialisation,…

  7. DIMM-SC: a Dirichlet mixture model for clustering droplet-based single cell transcriptomic data.

    PubMed

    Sun, Zhe; Wang, Ting; Deng, Ke; Wang, Xiao-Feng; Lafyatis, Robert; Ding, Ying; Hu, Ming; Chen, Wei

    2018-01-01

    Single cell transcriptome sequencing (scRNA-Seq) has become a revolutionary tool to study cellular and molecular processes at single cell resolution. Among existing technologies, the recently developed droplet-based platform enables efficient parallel processing of thousands of single cells with direct counting of transcript copies using Unique Molecular Identifier (UMI). Despite the technology advances, statistical methods and computational tools are still lacking for analyzing droplet-based scRNA-Seq data. Particularly, model-based approaches for clustering large-scale single cell transcriptomic data are still under-explored. We developed DIMM-SC, a Dirichlet Mixture Model for clustering droplet-based Single Cell transcriptomic data. This approach explicitly models UMI count data from scRNA-Seq experiments and characterizes variations across different cell clusters via a Dirichlet mixture prior. We performed comprehensive simulations to evaluate DIMM-SC and compared it with existing clustering methods such as K-means, CellTree and Seurat. In addition, we analyzed public scRNA-Seq datasets with known cluster labels and in-house scRNA-Seq datasets from a study of systemic sclerosis with prior biological knowledge to benchmark and validate DIMM-SC. Both simulation studies and real data applications demonstrated that overall, DIMM-SC achieves substantially improved clustering accuracy and much lower clustering variability compared to other existing clustering methods. More importantly, as a model-based approach, DIMM-SC is able to quantify the clustering uncertainty for each single cell, facilitating rigorous statistical inference and biological interpretations, which are typically unavailable from existing clustering methods. DIMM-SC has been implemented in a user-friendly R package with a detailed tutorial available on www.pitt.edu/∼wec47/singlecell.html. wei.chen@chp.edu or hum@ccf.org. Supplementary data are available at Bioinformatics online. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com

  8. Degree-based statistic and center persistency for brain connectivity analysis.

    PubMed

    Yoo, Kwangsun; Lee, Peter; Chung, Moo K; Sohn, William S; Chung, Sun Ju; Na, Duk L; Ju, Daheen; Jeong, Yong

    2017-01-01

    Brain connectivity analyses have been widely performed to investigate the organization and functioning of the brain, or to observe changes in neurological or psychiatric conditions. However, connectivity analysis inevitably introduces the problem of mass-univariate hypothesis testing. Although, several cluster-wise correction methods have been suggested to address this problem and shown to provide high sensitivity, these approaches fundamentally have two drawbacks: the lack of spatial specificity (localization power) and the arbitrariness of an initial cluster-forming threshold. In this study, we propose a novel method, degree-based statistic (DBS), performing cluster-wise inference. DBS is designed to overcome the above-mentioned two shortcomings. From a network perspective, a few brain regions are of critical importance and considered to play pivotal roles in network integration. Regarding this notion, DBS defines a cluster as a set of edges of which one ending node is shared. This definition enables the efficient detection of clusters and their center nodes. Furthermore, a new measure of a cluster, center persistency (CP) was introduced. The efficiency of DBS with a known "ground truth" simulation was demonstrated. Then they applied DBS to two experimental datasets and showed that DBS successfully detects the persistent clusters. In conclusion, by adopting a graph theoretical concept of degrees and borrowing the concept of persistence from algebraic topology, DBS could sensitively identify clusters with centric nodes that would play pivotal roles in an effect of interest. DBS is potentially widely applicable to variable cognitive or clinical situations and allows us to obtain statistically reliable and easily interpretable results. Hum Brain Mapp 38:165-181, 2017. © 2016 Wiley Periodicals, Inc. © 2016 Wiley Periodicals, Inc.

  9. Applying the Anderson-Darling test to suicide clusters: evidence of contagion at U. S. universities?

    PubMed

    MacKenzie, Donald W

    2013-01-01

    Suicide clusters at Cornell University and the Massachusetts Institute of Technology (MIT) prompted popular and expert speculation of suicide contagion. However, some clustering is to be expected in any random process. This work tested whether suicide clusters at these two universities differed significantly from those expected under a homogeneous Poisson process, in which suicides occur randomly and independently of one another. Suicide dates were collected for MIT and Cornell for 1990-2012. The Anderson-Darling statistic was used to test the goodness-of-fit of the intervals between suicides to distribution expected under the Poisson process. Suicides at MIT were consistent with the homogeneous Poisson process, while those at Cornell showed clustering inconsistent with such a process (p = .05). The Anderson-Darling test provides a statistically powerful means to identify suicide clustering in small samples. Practitioners can use this method to test for clustering in relevant communities. The difference in clustering behavior between the two institutions suggests that more institutions should be studied to determine the prevalence of suicide clustering in universities and its causes.

  10. Case-control geographic clustering for residential histories accounting for risk factors and covariates.

    PubMed

    Jacquez, Geoffrey M; Meliker, Jaymie R; Avruskin, Gillian A; Goovaerts, Pierre; Kaufmann, Andy; Wilson, Mark L; Nriagu, Jerome

    2006-08-03

    Methods for analyzing space-time variation in risk in case-control studies typically ignore residential mobility. We develop an approach for analyzing case-control data for mobile individuals and apply it to study bladder cancer in 11 counties in southeastern Michigan. At this time data collection is incomplete and no inferences should be drawn - we analyze these data to demonstrate the novel methods. Global, local and focused clustering of residential histories for 219 cases and 437 controls is quantified using time-dependent nearest neighbor relationships. Business address histories for 268 industries that release known or suspected bladder cancer carcinogens are analyzed. A logistic model accounting for smoking, gender, age, race and education specifies the probability of being a case, and is incorporated into the cluster randomization procedures. Sensitivity of clustering to definition of the proximity metric is assessed for 1 to 75 k nearest neighbors. Global clustering is partly explained by the covariates but remains statistically significant at 12 of the 14 levels of k considered. After accounting for the covariates 26 Local clusters are found in Lapeer, Ingham, Oakland and Jackson counties, with the clusters in Ingham and Oakland counties appearing in 1950 and persisting to the present. Statistically significant focused clusters are found about the business address histories of 22 industries located in Oakland (19 clusters), Ingham (2) and Jackson (1) counties. Clusters in central and southeastern Oakland County appear in the 1930's and persist to the present day. These methods provide a systematic approach for evaluating a series of increasingly realistic alternative hypotheses regarding the sources of excess risk. So long as selection of cases and controls is population-based and not geographically biased, these tools can provide insights into geographic risk factors that were not specifically assessed in the case-control study design.

  11. Extracting Aggregation Free Energies of Mixed Clusters from Simulations of Small Systems: Application to Ionic Surfactant Micelles.

    PubMed

    Zhang, X; Patel, L A; Beckwith, O; Schneider, R; Weeden, C J; Kindt, J T

    2017-11-14

    Micelle cluster distributions from molecular dynamics simulations of a solvent-free coarse-grained model of sodium octyl sulfate (SOS) were analyzed using an improved method to extract equilibrium association constants from small-system simulations containing one or two micelle clusters at equilibrium with free surfactants and counterions. The statistical-thermodynamic and mathematical foundations of this partition-enabled analysis of cluster histograms (PEACH) approach are presented. A dramatic reduction in computational time for analysis was achieved through a strategy similar to the selector variable method to circumvent the need for exhaustive enumeration of the possible partitions of surfactants and counterions into clusters. Using statistics from a set of small-system (up to 60 SOS molecules) simulations as input, equilibrium association constants for micelle clusters were obtained as a function of both number of surfactants and number of associated counterions through a global fitting procedure. The resulting free energies were able to accurately predict micelle size and charge distributions in a large (560 molecule) system. The evolution of micelle size and charge with SOS concentration as predicted by the PEACH-derived free energies and by a phenomenological four-parameter model fit, along with the sensitivity of these predictions to variations in cluster definitions, are analyzed and discussed.

  12. Using spatial analysis to demonstrate the heterogeneity of the cardiovascular drug-prescribing pattern in Taiwan

    PubMed Central

    2011-01-01

    Background Geographic Information Systems (GIS) combined with spatial analytical methods could be helpful in examining patterns of drug use. Little attention has been paid to geographic variation of cardiovascular prescription use in Taiwan. The main objective was to use local spatial association statistics to test whether or not the cardiovascular medication-prescribing pattern is homogenous across 352 townships in Taiwan. Methods The statistical methods used were the global measures of Moran's I and Local Indicators of Spatial Association (LISA). While Moran's I provides information on the overall spatial distribution of the data, LISA provides information on types of spatial association at the local level. LISA statistics can also be used to identify influential locations in spatial association analysis. The major classes of prescription cardiovascular drugs were taken from Taiwan's National Health Insurance Research Database (NHIRD), which has a coverage rate of over 97%. The dosage of each prescription was converted into defined daily doses to measure the consumption of each class of drugs. Data were analyzed with ArcGIS and GeoDa at the township level. Results The LISA statistics showed an unusual use of cardiovascular medications in the southern townships with high local variation. Patterns of drug use also showed more low-low spatial clusters (cold spots) than high-high spatial clusters (hot spots), and those low-low associations were clustered in the rural areas. Conclusions The cardiovascular drug prescribing patterns were heterogeneous across Taiwan. In particular, a clear pattern of north-south disparity exists. Such spatial clustering helps prioritize the target areas that require better education concerning drug use. PMID:21609462

  13. SparRec: An effective matrix completion framework of missing data imputation for GWAS

    NASA Astrophysics Data System (ADS)

    Jiang, Bo; Ma, Shiqian; Causey, Jason; Qiao, Linbo; Hardin, Matthew Price; Bitts, Ian; Johnson, Daniel; Zhang, Shuzhong; Huang, Xiuzhen

    2016-10-01

    Genome-wide association studies present computational challenges for missing data imputation, while the advances of genotype technologies are generating datasets of large sample sizes with sample sets genotyped on multiple SNP chips. We present a new framework SparRec (Sparse Recovery) for imputation, with the following properties: (1) The optimization models of SparRec, based on low-rank and low number of co-clusters of matrices, are different from current statistics methods. While our low-rank matrix completion (LRMC) model is similar to Mendel-Impute, our matrix co-clustering factorization (MCCF) model is completely new. (2) SparRec, as other matrix completion methods, is flexible to be applied to missing data imputation for large meta-analysis with different cohorts genotyped on different sets of SNPs, even when there is no reference panel. This kind of meta-analysis is very challenging for current statistics based methods. (3) SparRec has consistent performance and achieves high recovery accuracy even when the missing data rate is as high as 90%. Compared with Mendel-Impute, our low-rank based method achieves similar accuracy and efficiency, while the co-clustering based method has advantages in running time. The testing results show that SparRec has significant advantages and competitive performance over other state-of-the-art existing statistics methods including Beagle and fastPhase.

  14. [Cluster analysis applicability to fitness evaluation of cosmonauts on long-term missions of the International space station].

    PubMed

    Egorov, A D; Stepantsov, V I; Nosovskiĭ, A M; Shipov, A A

    2009-01-01

    Cluster analysis was applied to evaluate locomotion training (running and running intermingled with walking) of 13 cosmonauts on long-term ISS missions by the parameters of duration (min), distance (m) and intensity (km/h). Based on the results of analyses, the cosmonauts were distributed into three steady groups of 2, 5 and 6 persons. Distance and speed showed a statistical rise (p < 0.03) from group 1 to group 3. Duration of physical locomotion training was not statistically different in the groups (p = 0.125). Therefore, cluster analysis is an adequate method of evaluating fitness of cosmonauts on long-term missions.

  15. Taxonomy and clustering in collaborative systems: The case of the on-line encyclopedia Wikipedia

    NASA Astrophysics Data System (ADS)

    Capocci, A.; Rao, F.; Caldarelli, G.

    2008-01-01

    In this paper we investigate the nature and structure of the relation between imposed classifications and real clustering in a particular case of a scale-free network given by the on-line encyclopedia Wikipedia. We find a statistical similarity in the distributions of community sizes both by using the top-down approach of the categories division present in the archive and in the bottom-up procedure of community detection given by an algorithm based on the spectral properties of the graph. Regardless of the statistically similar behaviour, the two methods provide a rather different division of the articles, thereby signaling that the nature and presence of power laws is a general feature for these systems and cannot be used as a benchmark to evaluate the suitability of a clustering method.

  16. Water quality analysis of the Rapur area, Andhra Pradesh, South India using multivariate techniques

    NASA Astrophysics Data System (ADS)

    Nagaraju, A.; Sreedhar, Y.; Thejaswi, A.; Sayadi, Mohammad Hossein

    2017-10-01

    The groundwater samples from Rapur area were collected from different sites to evaluate the major ion chemistry. The large number of data can lead to difficulties in the integration, interpretation, and representation of the results. Two multivariate statistical methods, hierarchical cluster analysis (HCA) and factor analysis (FA), were applied to evaluate their usefulness to classify and identify geochemical processes controlling groundwater geochemistry. Four statistically significant clusters were obtained from 30 sampling stations. This has resulted two important clusters viz., cluster 1 (pH, Si, CO3, Mg, SO4, Ca, K, HCO3, alkalinity, Na, Na + K, Cl, and hardness) and cluster 2 (EC and TDS) which are released to the study area from different sources. The application of different multivariate statistical techniques, such as principal component analysis (PCA), assists in the interpretation of complex data matrices for a better understanding of water quality of a study area. From PCA, it is clear that the first factor (factor 1), accounted for 36.2% of the total variance, was high positive loading in EC, Mg, Cl, TDS, and hardness. Based on the PCA scores, four significant cluster groups of sampling locations were detected on the basis of similarity of their water quality.

  17. Accounting for isotopic clustering in Fourier transform mass spectrometry data analysis for clinical diagnostic studies.

    PubMed

    Kakourou, Alexia; Vach, Werner; Nicolardi, Simone; van der Burgt, Yuri; Mertens, Bart

    2016-10-01

    Mass spectrometry based clinical proteomics has emerged as a powerful tool for high-throughput protein profiling and biomarker discovery. Recent improvements in mass spectrometry technology have boosted the potential of proteomic studies in biomedical research. However, the complexity of the proteomic expression introduces new statistical challenges in summarizing and analyzing the acquired data. Statistical methods for optimally processing proteomic data are currently a growing field of research. In this paper we present simple, yet appropriate methods to preprocess, summarize and analyze high-throughput MALDI-FTICR mass spectrometry data, collected in a case-control fashion, while dealing with the statistical challenges that accompany such data. The known statistical properties of the isotopic distribution of the peptide molecules are used to preprocess the spectra and translate the proteomic expression into a condensed data set. Information on either the intensity level or the shape of the identified isotopic clusters is used to derive summary measures on which diagnostic rules for disease status allocation will be based. Results indicate that both the shape of the identified isotopic clusters and the overall intensity level carry information on the class outcome and can be used to predict the presence or absence of the disease.

  18. Bringing Clouds into Our Lab! - The Influence of Turbulence on the Early Stage Rain Droplets

    NASA Astrophysics Data System (ADS)

    Yavuz, Mehmet Altug; Kunnen, Rudie; Heijst, Gertjan; Clercx, Herman

    2015-11-01

    We are investigating a droplet-laden flow in an air-filled turbulence chamber, forced by speaker-driven air jets. The speakers are running in a random manner; yet they allow us to control and define the statistics of the turbulence. We study the motion of droplets with tunable size (Stokes numbers ~ 0.13 - 9) in a turbulent flow, mimicking the early stages of raindrop formation. 3D Particle Tracking Velocimetry (PTV) together with Laser Induced Fluorescence (LIF) methods are chosen as the experimental method to track the droplets and collect data for statistical analysis. Thereby it is possible to study the spatial distribution of the droplets in turbulence using the so-called Radial Distribution Function (RDF), a statistical measure to quantify the clustering of particles. Additionally, 3D-PTV technique allows us to measure velocity statistics of the droplets and the influence of the turbulence on droplet trajectories, both individually and collectively. In this contribution, we will present the clustering probability quantified by the RDF for different Stokes numbers. We will explain the physics underlying the influence of turbulence on droplet cluster behavior. This study supported by FOM/NWO Netherlands.

  19. ICAP: An Interactive Cluster Analysis Procedure for analyzing remotely sensed data. [to classify the radiance data to produce a thematic map

    NASA Technical Reports Server (NTRS)

    Wharton, S. W.

    1980-01-01

    An Interactive Cluster Analysis Procedure (ICAP) was developed to derive classifier training statistics from remotely sensed data. The algorithm interfaces the rapid numerical processing capacity of a computer with the human ability to integrate qualitative information. Control of the clustering process alternates between the algorithm, which creates new centroids and forms clusters and the analyst, who evaluate and elect to modify the cluster structure. Clusters can be deleted or lumped pairwise, or new centroids can be added. A summary of the cluster statistics can be requested to facilitate cluster manipulation. The ICAP was implemented in APL (A Programming Language), an interactive computer language. The flexibility of the algorithm was evaluated using data from different LANDSAT scenes to simulate two situations: one in which the analyst is assumed to have no prior knowledge about the data and wishes to have the clusters formed more or less automatically; and the other in which the analyst is assumed to have some knowledge about the data structure and wishes to use that information to closely supervise the clustering process. For comparison, an existing clustering method was also applied to the two data sets.

  20. Genome-scale cluster analysis of replicated microarrays using shrinkage correlation coefficient.

    PubMed

    Yao, Jianchao; Chang, Chunqi; Salmi, Mari L; Hung, Yeung Sam; Loraine, Ann; Roux, Stanley J

    2008-06-18

    Currently, clustering with some form of correlation coefficient as the gene similarity metric has become a popular method for profiling genomic data. The Pearson correlation coefficient and the standard deviation (SD)-weighted correlation coefficient are the two most widely-used correlations as the similarity metrics in clustering microarray data. However, these two correlations are not optimal for analyzing replicated microarray data generated by most laboratories. An effective correlation coefficient is needed to provide statistically sufficient analysis of replicated microarray data. In this study, we describe a novel correlation coefficient, shrinkage correlation coefficient (SCC), that fully exploits the similarity between the replicated microarray experimental samples. The methodology considers both the number of replicates and the variance within each experimental group in clustering expression data, and provides a robust statistical estimation of the error of replicated microarray data. The value of SCC is revealed by its comparison with two other correlation coefficients that are currently the most widely-used (Pearson correlation coefficient and SD-weighted correlation coefficient) using statistical measures on both synthetic expression data as well as real gene expression data from Saccharomyces cerevisiae. Two leading clustering methods, hierarchical and k-means clustering were applied for the comparison. The comparison indicated that using SCC achieves better clustering performance. Applying SCC-based hierarchical clustering to the replicated microarray data obtained from germinating spores of the fern Ceratopteris richardii, we discovered two clusters of genes with shared expression patterns during spore germination. Functional analysis suggested that some of the genetic mechanisms that control germination in such diverse plant lineages as mosses and angiosperms are also conserved among ferns. This study shows that SCC is an alternative to the Pearson correlation coefficient and the SD-weighted correlation coefficient, and is particularly useful for clustering replicated microarray data. This computational approach should be generally useful for proteomic data or other high-throughput analysis methodology.

  1. Hierarchical Bayesian modelling of gene expression time series across irregularly sampled replicates and clusters.

    PubMed

    Hensman, James; Lawrence, Neil D; Rattray, Magnus

    2013-08-20

    Time course data from microarrays and high-throughput sequencing experiments require simple, computationally efficient and powerful statistical models to extract meaningful biological signal, and for tasks such as data fusion and clustering. Existing methodologies fail to capture either the temporal or replicated nature of the experiments, and often impose constraints on the data collection process, such as regularly spaced samples, or similar sampling schema across replications. We propose hierarchical Gaussian processes as a general model of gene expression time-series, with application to a variety of problems. In particular, we illustrate the method's capacity for missing data imputation, data fusion and clustering.The method can impute data which is missing both systematically and at random: in a hold-out test on real data, performance is significantly better than commonly used imputation methods. The method's ability to model inter- and intra-cluster variance leads to more biologically meaningful clusters. The approach removes the necessity for evenly spaced samples, an advantage illustrated on a developmental Drosophila dataset with irregular replications. The hierarchical Gaussian process model provides an excellent statistical basis for several gene-expression time-series tasks. It has only a few additional parameters over a regular GP, has negligible additional complexity, is easily implemented and can be integrated into several existing algorithms. Our experiments were implemented in python, and are available from the authors' website: http://staffwww.dcs.shef.ac.uk/people/J.Hensman/.

  2. Statistical framework and noise sensitivity of the amplitude radial correlation contrast method.

    PubMed

    Kipervaser, Zeev Gideon; Pelled, Galit; Goelman, Gadi

    2007-09-01

    A statistical framework for the amplitude radial correlation contrast (RCC) method, which integrates a conventional pixel threshold approach with cluster-size statistics, is presented. The RCC method uses functional MRI (fMRI) data to group neighboring voxels in terms of their degree of temporal cross correlation and compares coherences in different brain states (e.g., stimulation OFF vs. ON). By defining the RCC correlation map as the difference between two RCC images, the map distribution of two OFF states is shown to be normal, enabling the definition of the pixel cutoff. The empirical cluster-size null distribution obtained after the application of the pixel cutoff is used to define a cluster-size cutoff that allows 5% false positives. Assuming that the fMRI signal equals the task-induced response plus noise, an analytical expression of amplitude-RCC dependency on noise is obtained and used to define the pixel threshold. In vivo and ex vivo data obtained during rat forepaw electric stimulation are used to fine-tune this threshold. Calculating the spatial coherences within in vivo and ex vivo images shows enhanced coherence in the in vivo data, but no dependency on the anesthesia method, magnetic field strength, or depth of anesthesia, strengthening the generality of the proposed cutoffs. Copyright (c) 2007 Wiley-Liss, Inc.

  3. Extracting Galaxy Cluster Gas Inhomogeneity from X-Ray Surface Brightness: A Statistical Approach and Application to Abell 3667

    NASA Astrophysics Data System (ADS)

    Kawahara, Hajime; Reese, Erik D.; Kitayama, Tetsu; Sasaki, Shin; Suto, Yasushi

    2008-11-01

    Our previous analysis indicates that small-scale fluctuations in the intracluster medium (ICM) from cosmological hydrodynamic simulations follow the lognormal probability density function. In order to test the lognormal nature of the ICM directly against X-ray observations of galaxy clusters, we develop a method of extracting statistical information about the three-dimensional properties of the fluctuations from the two-dimensional X-ray surface brightness. We first create a set of synthetic clusters with lognormal fluctuations around their mean profile given by spherical isothermal β-models, later considering polytropic temperature profiles as well. Performing mock observations of these synthetic clusters, we find that the resulting X-ray surface brightness fluctuations also follow the lognormal distribution fairly well. Systematic analysis of the synthetic clusters provides an empirical relation between the three-dimensional density fluctuations and the two-dimensional X-ray surface brightness. We analyze Chandra observations of the galaxy cluster Abell 3667, and find that its X-ray surface brightness fluctuations follow the lognormal distribution. While the lognormal model was originally motivated by cosmological hydrodynamic simulations, this is the first observational confirmation of the lognormal signature in a real cluster. Finally we check the synthetic cluster results against clusters from cosmological hydrodynamic simulations. As a result of the complex structure exhibited by simulated clusters, the empirical relation between the two- and three-dimensional fluctuation properties calibrated with synthetic clusters when applied to simulated clusters shows large scatter. Nevertheless we are able to reproduce the true value of the fluctuation amplitude of simulated clusters within a factor of 2 from their two-dimensional X-ray surface brightness alone. Our current methodology combined with existing observational data is useful in describing and inferring the statistical properties of the three-dimensional inhomogeneity in galaxy clusters.

  4. R package to estimate intracluster correlation coefficient with confidence interval for binary data.

    PubMed

    Chakraborty, Hrishikesh; Hossain, Akhtar

    2018-03-01

    The Intracluster Correlation Coefficient (ICC) is a major parameter of interest in cluster randomized trials that measures the degree to which responses within the same cluster are correlated. There are several types of ICC estimators and its confidence intervals (CI) suggested in the literature for binary data. Studies have compared relative weaknesses and advantages of ICC estimators as well as its CI for binary data and suggested situations where one is advantageous in practical research. The commonly used statistical computing systems currently facilitate estimation of only a very few variants of ICC and its CI. To address the limitations of current statistical packages, we developed an R package, ICCbin, to facilitate estimating ICC and its CI for binary responses using different methods. The ICCbin package is designed to provide estimates of ICC in 16 different ways including analysis of variance methods, moments based estimation, direct probabilistic methods, correlation based estimation, and resampling method. CI of ICC is estimated using 5 different methods. It also generates cluster binary data using exchangeable correlation structure. ICCbin package provides two functions for users. The function rcbin() generates cluster binary data and the function iccbin() estimates ICC and it's CI. The users can choose appropriate ICC and its CI estimate from the wide selection of estimates from the outputs. The R package ICCbin presents very flexible and easy to use ways to generate cluster binary data and to estimate ICC and it's CI for binary response using different methods. The package ICCbin is freely available for use with R from the CRAN repository (https://cran.r-project.org/package=ICCbin). We believe that this package can be a very useful tool for researchers to design cluster randomized trials with binary outcome. Copyright © 2017 Elsevier B.V. All rights reserved.

  5. Structural parameters of young star clusters: fractal analysis

    NASA Astrophysics Data System (ADS)

    Hetem, A.

    2017-07-01

    A unified view of star formation in the Universe demand detailed and in-depth studies of young star clusters. This work is related to our previous study of fractal statistics estimated for a sample of young stellar clusters (Gregorio-Hetem et al. 2015, MNRAS 448, 2504). The structural properties can lead to significant conclusions about the early stages of cluster formation: 1) virial conditions can be used to distinguish warm collapsed; 2) bound or unbound behaviour can lead to conclusions about expansion; and 3) fractal statistics are correlated to the dynamical evolution and age. The technique of error bars estimation most used in the literature is to adopt inferential methods (like bootstrap) to estimate deviation and variance, which are valid only for an artificially generated cluster. In this paper, we expanded the number of studied clusters, in order to enhance the investigation of the cluster properties and dynamic evolution. The structural parameters were compared with fractal statistics and reveal that the clusters radial density profile show a tendency of the mean separation of the stars increase with the average surface density. The sample can be divided into two groups showing different dynamic behaviour, but they have the same dynamic evolution, since the entire sample was revealed as being expanding objects, for which the substructures do not seem to have been completely erased. These results are in agreement with the simulations adopting low surface densities and supervirial conditions.

  6. Optimization method of superpixel analysis for multi-contrast Jones matrix tomography (Conference Presentation)

    NASA Astrophysics Data System (ADS)

    Miyazawa, Arata; Hong, Young-Joo; Makita, Shuichi; Kasaragod, Deepa K.; Miura, Masahiro; Yasuno, Yoshiaki

    2017-02-01

    Local statistics are widely utilized for quantification and image processing of OCT. For example, local mean is used to reduce speckle, local variation of polarization state (degree-of-polarization-uniformity (DOPU)) is used to visualize melanin. Conventionally, these statistics are calculated in a rectangle kernel whose size is uniform over the image. However, the fixed size and shape of the kernel result in a tradeoff between image sharpness and statistical accuracy. Superpixel is a cluster of pixels which is generated by grouping image pixels based on the spatial proximity and similarity of signal values. Superpixels have variant size and flexible shapes which preserve the tissue structure. Here we demonstrate a new superpixel method which is tailored for multifunctional Jones matrix OCT (JM-OCT). This new method forms the superpixels by clustering image pixels in a 6-dimensional (6-D) feature space (spatial two dimensions and four dimensions of optical features). All image pixels were clustered based on their spatial proximity and optical feature similarity. The optical features are scattering, OCT-A, birefringence and DOPU. The method is applied to retinal OCT. Generated superpixels preserve the tissue structures such as retinal layers, sclera, vessels, and retinal pigment epithelium. Hence, superpixel can be utilized as a local statistics kernel which would be more suitable than a uniform rectangle kernel. Superpixelized image also can be used for further image processing and analysis. Since it reduces the number of pixels to be analyzed, it reduce the computational cost of such image processing.

  7. Wildfire cluster detection using space-time scan statistics

    NASA Astrophysics Data System (ADS)

    Tonini, M.; Tuia, D.; Ratle, F.; Kanevski, M.

    2009-04-01

    The aim of the present study is to identify spatio-temporal clusters of fires sequences using space-time scan statistics. These statistical methods are specifically designed to detect clusters and assess their significance. Basically, scan statistics work by comparing a set of events occurring inside a scanning window (or a space-time cylinder for spatio-temporal data) with those that lie outside. Windows of increasing size scan the zone across space and time: the likelihood ratio is calculated for each window (comparing the ratio "observed cases over expected" inside and outside): the window with the maximum value is assumed to be the most probable cluster, and so on. Under the null hypothesis of spatial and temporal randomness, these events are distributed according to a known discrete-state random process (Poisson or Bernoulli), which parameters can be estimated. Given this assumption, it is possible to test whether or not the null hypothesis holds in a specific area. In order to deal with fires data, the space-time permutation scan statistic has been applied since it does not require the explicit specification of the population-at risk in each cylinder. The case study is represented by Florida daily fire detection using the Moderate Resolution Imaging Spectroradiometer (MODIS) active fire product during the period 2003-2006. As result, statistically significant clusters have been identified. Performing the analyses over the entire frame period, three out of the five most likely clusters have been identified in the forest areas, on the North of the country; the other two clusters cover a large zone in the South, corresponding to agricultural land and the prairies in the Everglades. Furthermore, the analyses have been performed separately for the four years to analyze if the wildfires recur each year during the same period. It emerges that clusters of forest fires are more frequent in hot seasons (spring and summer), while in the South areas they are widely present along the whole year. The analysis of fires distribution to evaluate if they are statistically more frequent in some area or/and in some period of the year, can be useful to support fire management and to focus on prevention measures.

  8. A comparison of performance of automatic cloud coverage assessment algorithm for Formosat-2 image using clustering-based and spatial thresholding methods

    NASA Astrophysics Data System (ADS)

    Hsu, Kuo-Hsien

    2012-11-01

    Formosat-2 image is a kind of high-spatial-resolution (2 meters GSD) remote sensing satellite data, which includes one panchromatic band and four multispectral bands (Blue, Green, Red, near-infrared). An essential sector in the daily processing of received Formosat-2 image is to estimate the cloud statistic of image using Automatic Cloud Coverage Assessment (ACCA) algorithm. The information of cloud statistic of image is subsequently recorded as an important metadata for image product catalog. In this paper, we propose an ACCA method with two consecutive stages: preprocessing and post-processing analysis. For pre-processing analysis, the un-supervised K-means classification, Sobel's method, thresholding method, non-cloudy pixels reexamination, and cross-band filter method are implemented in sequence for cloud statistic determination. For post-processing analysis, Box-Counting fractal method is implemented. In other words, the cloud statistic is firstly determined via pre-processing analysis, the correctness of cloud statistic of image of different spectral band is eventually cross-examined qualitatively and quantitatively via post-processing analysis. The selection of an appropriate thresholding method is very critical to the result of ACCA method. Therefore, in this work, We firstly conduct a series of experiments of the clustering-based and spatial thresholding methods that include Otsu's, Local Entropy(LE), Joint Entropy(JE), Global Entropy(GE), and Global Relative Entropy(GRE) method, for performance comparison. The result shows that Otsu's and GE methods both perform better than others for Formosat-2 image. Additionally, our proposed ACCA method by selecting Otsu's method as the threshoding method has successfully extracted the cloudy pixels of Formosat-2 image for accurate cloud statistic estimation.

  9. A ground truth based comparative study on clustering of gene expression data.

    PubMed

    Zhu, Yitan; Wang, Zuyi; Miller, David J; Clarke, Robert; Xuan, Jianhua; Hoffman, Eric P; Wang, Yue

    2008-05-01

    Given the variety of available clustering methods for gene expression data analysis, it is important to develop an appropriate and rigorous validation scheme to assess the performance and limitations of the most widely used clustering algorithms. In this paper, we present a ground truth based comparative study on the functionality, accuracy, and stability of five data clustering methods, namely hierarchical clustering, K-means clustering, self-organizing maps, standard finite normal mixture fitting, and a caBIG toolkit (VIsual Statistical Data Analyzer--VISDA), tested on sample clustering of seven published microarray gene expression datasets and one synthetic dataset. We examined the performance of these algorithms in both data-sufficient and data-insufficient cases using quantitative performance measures, including cluster number detection accuracy and mean and standard deviation of partition accuracy. The experimental results showed that VISDA, an interactive coarse-to-fine maximum likelihood fitting algorithm, is a solid performer on most of the datasets, while K-means clustering and self-organizing maps optimized by the mean squared compactness criterion generally produce more stable solutions than the other methods.

  10. State estimation and prediction using clustered particle filters.

    PubMed

    Lee, Yoonsang; Majda, Andrew J

    2016-12-20

    Particle filtering is an essential tool to improve uncertain model predictions by incorporating noisy observational data from complex systems including non-Gaussian features. A class of particle filters, clustered particle filters, is introduced for high-dimensional nonlinear systems, which uses relatively few particles compared with the standard particle filter. The clustered particle filter captures non-Gaussian features of the true signal, which are typical in complex nonlinear dynamical systems such as geophysical systems. The method is also robust in the difficult regime of high-quality sparse and infrequent observations. The key features of the clustered particle filtering are coarse-grained localization through the clustering of the state variables and particle adjustment to stabilize the method; each observation affects only neighbor state variables through clustering and particles are adjusted to prevent particle collapse due to high-quality observations. The clustered particle filter is tested for the 40-dimensional Lorenz 96 model with several dynamical regimes including strongly non-Gaussian statistics. The clustered particle filter shows robust skill in both achieving accurate filter results and capturing non-Gaussian statistics of the true signal. It is further extended to multiscale data assimilation, which provides the large-scale estimation by combining a cheap reduced-order forecast model and mixed observations of the large- and small-scale variables. This approach enables the use of a larger number of particles due to the computational savings in the forecast model. The multiscale clustered particle filter is tested for one-dimensional dispersive wave turbulence using a forecast model with model errors.

  11. State estimation and prediction using clustered particle filters

    PubMed Central

    Lee, Yoonsang; Majda, Andrew J.

    2016-01-01

    Particle filtering is an essential tool to improve uncertain model predictions by incorporating noisy observational data from complex systems including non-Gaussian features. A class of particle filters, clustered particle filters, is introduced for high-dimensional nonlinear systems, which uses relatively few particles compared with the standard particle filter. The clustered particle filter captures non-Gaussian features of the true signal, which are typical in complex nonlinear dynamical systems such as geophysical systems. The method is also robust in the difficult regime of high-quality sparse and infrequent observations. The key features of the clustered particle filtering are coarse-grained localization through the clustering of the state variables and particle adjustment to stabilize the method; each observation affects only neighbor state variables through clustering and particles are adjusted to prevent particle collapse due to high-quality observations. The clustered particle filter is tested for the 40-dimensional Lorenz 96 model with several dynamical regimes including strongly non-Gaussian statistics. The clustered particle filter shows robust skill in both achieving accurate filter results and capturing non-Gaussian statistics of the true signal. It is further extended to multiscale data assimilation, which provides the large-scale estimation by combining a cheap reduced-order forecast model and mixed observations of the large- and small-scale variables. This approach enables the use of a larger number of particles due to the computational savings in the forecast model. The multiscale clustered particle filter is tested for one-dimensional dispersive wave turbulence using a forecast model with model errors. PMID:27930332

  12. Accounting for noise when clustering biological data.

    PubMed

    Sloutsky, Roman; Jimenez, Nicolas; Swamidass, S Joshua; Naegle, Kristen M

    2013-07-01

    Clustering is a powerful and commonly used technique that organizes and elucidates the structure of biological data. Clustering data from gene expression, metabolomics and proteomics experiments has proven to be useful at deriving a variety of insights, such as the shared regulation or function of biochemical components within networks. However, experimental measurements of biological processes are subject to substantial noise-stemming from both technical and biological variability-and most clustering algorithms are sensitive to this noise. In this article, we explore several methods of accounting for noise when analyzing biological data sets through clustering. Using a toy data set and two different case studies-gene expression and protein phosphorylation-we demonstrate the sensitivity of clustering algorithms to noise. Several methods of accounting for this noise can be used to establish when clustering results can be trusted. These methods span a range of assumptions about the statistical properties of the noise and can therefore be applied to virtually any biological data source.

  13. Identifying Clusters of Active Transportation Using Spatial Scan Statistics

    PubMed Central

    Huang, Lan; Stinchcomb, David G.; Pickle, Linda W.; Dill, Jennifer; Berrigan, David

    2009-01-01

    Background There is an intense interest in the possibility that neighborhood characteristics influence active transportation such as walking or biking. The purpose of this paper is to illustrate how a spatial cluster identification method can evaluate the geographic variation of active transportation and identify neighborhoods with unusually high/low levels of active transportation. Methods Self-reported walking/biking prevalence, demographic characteristics, street connectivity variables, and neighborhood socioeconomic data were collected from respondents to the 2001 California Health Interview Survey (CHIS; N=10,688) in Los Angeles County (LAC) and San Diego County (SDC). Spatial scan statistics were used to identify clusters of high or low prevalence (with and without age-adjustment) and the quantity of time spent walking and biking. The data, a subset from the 2001 CHIS, were analyzed in 2007–2008. Results Geographic clusters of significantly high or low prevalence of walking and biking were detected in LAC and SDC. Structural variables such as street connectivity and shorter block lengths are consistently associated with higher levels of active transportation, but associations between active transportation and socioeconomic variables at the individual and neighborhood levels are mixed. Only one cluster with less time spent walking and biking among walkers/bikers was detected in LAC, and this was of borderline significance. Age-adjustment affects the clustering pattern of walking/biking prevalence in LAC, but not in SDC. Conclusions The use of spatial scan statistics to identify significant clustering of health behaviors such as active transportation adds to the more traditional regression analysis that examines associations between behavior and environmental factors by identifying specific geographic areas with unusual levels of the behavior independent of predefined administrative units. PMID:19589451

  14. Could the clinical interpretability of subgroups detected using clustering methods be improved by using a novel two-stage approach?

    PubMed

    Kent, Peter; Stochkendahl, Mette Jensen; Christensen, Henrik Wulff; Kongsted, Alice

    2015-01-01

    Recognition of homogeneous subgroups of patients can usefully improve prediction of their outcomes and the targeting of treatment. There are a number of research approaches that have been used to recognise homogeneity in such subgroups and to test their implications. One approach is to use statistical clustering techniques, such as Cluster Analysis or Latent Class Analysis, to detect latent relationships between patient characteristics. Influential patient characteristics can come from diverse domains of health, such as pain, activity limitation, physical impairment, social role participation, psychological factors, biomarkers and imaging. However, such 'whole person' research may result in data-driven subgroups that are complex, difficult to interpret and challenging to recognise clinically. This paper describes a novel approach to applying statistical clustering techniques that may improve the clinical interpretability of derived subgroups and reduce sample size requirements. This approach involves clustering in two sequential stages. The first stage involves clustering within health domains and therefore requires creating as many clustering models as there are health domains in the available data. This first stage produces scoring patterns within each domain. The second stage involves clustering using the scoring patterns from each health domain (from the first stage) to identify subgroups across all domains. We illustrate this using chest pain data from the baseline presentation of 580 patients. The new two-stage clustering resulted in two subgroups that approximated the classic textbook descriptions of musculoskeletal chest pain and atypical angina chest pain. The traditional single-stage clustering resulted in five clusters that were also clinically recognisable but displayed less distinct differences. In this paper, a new approach to using clustering techniques to identify clinically useful subgroups of patients is suggested. Research designs, statistical methods and outcome metrics suitable for performing that testing are also described. This approach has potential benefits but requires broad testing, in multiple patient samples, to determine its clinical value. The usefulness of the approach is likely to be context-specific, depending on the characteristics of the available data and the research question being asked of it.

  15. A cluster-based approach to selecting representative stimuli from the International Affective Picture System (IAPS) database.

    PubMed

    Constantinescu, Alexandra C; Wolters, Maria; Moore, Adam; MacPherson, Sarah E

    2017-06-01

    The International Affective Picture System (IAPS; Lang, Bradley, & Cuthbert, 2008) is a stimulus database that is frequently used to investigate various aspects of emotional processing. Despite its extensive use, selecting IAPS stimuli for a research project is not usually done according to an established strategy, but rather is tailored to individual studies. Here we propose a standard, replicable method for stimulus selection based on cluster analysis, which re-creates the group structure that is most likely to have produced the valence arousal, and dominance norms associated with the IAPS images. Our method includes screening the database for outliers, identifying a suitable clustering solution, and then extracting the desired number of stimuli on the basis of their level of certainty of belonging to the cluster they were assigned to. Our method preserves statistical power in studies by maximizing the likelihood that the stimuli belong to the cluster structure fitted to them, and by filtering stimuli according to their certainty of cluster membership. In addition, although our cluster-based method is illustrated using the IAPS, it can be extended to other stimulus databases.

  16. Model-based clustering for RNA-seq data.

    PubMed

    Si, Yaqing; Liu, Peng; Li, Pinghua; Brutnell, Thomas P

    2014-01-15

    RNA-seq technology has been widely adopted as an attractive alternative to microarray-based methods to study global gene expression. However, robust statistical tools to analyze these complex datasets are still lacking. By grouping genes with similar expression profiles across treatments, cluster analysis provides insight into gene functions and networks, and hence is an important technique for RNA-seq data analysis. In this manuscript, we derive clustering algorithms based on appropriate probability models for RNA-seq data. An expectation-maximization algorithm and another two stochastic versions of expectation-maximization algorithms are described. In addition, a strategy for initialization based on likelihood is proposed to improve the clustering algorithms. Moreover, we present a model-based hybrid-hierarchical clustering method to generate a tree structure that allows visualization of relationships among clusters as well as flexibility of choosing the number of clusters. Results from both simulation studies and analysis of a maize RNA-seq dataset show that our proposed methods provide better clustering results than alternative methods such as the K-means algorithm and hierarchical clustering methods that are not based on probability models. An R package, MBCluster.Seq, has been developed to implement our proposed algorithms. This R package provides fast computation and is publicly available at http://www.r-project.org

  17. Network Data: Statistical Theory and New Models

    DTIC Science & Technology

    2016-02-17

    SECURITY CLASSIFICATION OF: During this period of review, Bin Yu worked on many thrusts of high-dimensional statistical theory and methodologies. Her...research covered a wide range of topics in statistics including analysis and methods for spectral clustering for sparse and structured networks...2,7,8,21], sparse modeling (e.g. Lasso) [4,10,11,17,18,19], statistical guarantees for the EM algorithm [3], statistical analysis of algorithm leveraging

  18. A scoping review of spatial cluster analysis techniques for point-event data.

    PubMed

    Fritz, Charles E; Schuurman, Nadine; Robertson, Colin; Lear, Scott

    2013-05-01

    Spatial cluster analysis is a uniquely interdisciplinary endeavour, and so it is important to communicate and disseminate ideas, innovations, best practices and challenges across practitioners, applied epidemiology researchers and spatial statisticians. In this research we conducted a scoping review to systematically search peer-reviewed journal databases for research that has employed spatial cluster analysis methods on individual-level, address location, or x and y coordinate derived data. To illustrate the thematic issues raised by our results, methods were tested using a dataset where known clusters existed. Point pattern methods, spatial clustering and cluster detection tests, and a locally weighted spatial regression model were most commonly used for individual-level, address location data (n = 29). The spatial scan statistic was the most popular method for address location data (n = 19). Six themes were identified relating to the application of spatial cluster analysis methods and subsequent analyses, which we recommend researchers to consider; exploratory analysis, visualization, spatial resolution, aetiology, scale and spatial weights. It is our intention that researchers seeking direction for using spatial cluster analysis methods, consider the caveats and strengths of each approach, but also explore the numerous other methods available for this type of analysis. Applied spatial epidemiology researchers and practitioners should give special consideration to applying multiple tests to a dataset. Future research should focus on developing frameworks for selecting appropriate methods and the corresponding spatial weighting schemes.

  19. Familial clustering of overweight and obesity among schoolchildren in northern China

    PubMed Central

    Li, Zengning; Luo, Bin; Du, Limei; Hu, Huanyu; Xie, Ying

    2014-01-01

    Background: We aimed to study the prevalence of overweight and obesity and to assess its familial clustering among schoolchildren in northern China. Methods: A cross-sectional study was conducted on 95,292 schoolchildren in northern China to investigate the prevalence of overweight and obesity. A group of overweight and obese children (n = 450) was selected using a cluster sampling method. Answers from a questionnaire on their and their families’ nutrition and behaviors were recorded and analyzed statistically. Results: The prevalence of overweight and obesity in schoolchildren was 27.4% and 13.2%, respectively. The prevalence of overweight and obesity were significantly higher in boys than in girls. The prevalence of familial clustering of overweight and obesity was 75.3% and 20.3%, respectively. The prevalence of overweight in first-generation (parents) and second-generation (grandparents) relatives was 54.6% and 53.1%, respectively. There was a linear trend toward correlation between age and the rates of overweight and obesity. The familial clustering of obesity with family income reached statistical significance. Conclusion: The prevalence of overweight and obesity was extremely high, especially among boys and their fathers. Evidence of familial clustering of overweight and obesity among schoolchildren and their parental family members in northern China is emerging. PMID:25664106

  20. Selection of the Maximum Spatial Cluster Size of the Spatial Scan Statistic by Using the Maximum Clustering Set-Proportion Statistic.

    PubMed

    Ma, Yue; Yin, Fei; Zhang, Tao; Zhou, Xiaohua Andrew; Li, Xiaosong

    2016-01-01

    Spatial scan statistics are widely used in various fields. The performance of these statistics is influenced by parameters, such as maximum spatial cluster size, and can be improved by parameter selection using performance measures. Current performance measures are based on the presence of clusters and are thus inapplicable to data sets without known clusters. In this work, we propose a novel overall performance measure called maximum clustering set-proportion (MCS-P), which is based on the likelihood of the union of detected clusters and the applied dataset. MCS-P was compared with existing performance measures in a simulation study to select the maximum spatial cluster size. Results of other performance measures, such as sensitivity and misclassification, suggest that the spatial scan statistic achieves accurate results in most scenarios with the maximum spatial cluster sizes selected using MCS-P. Given that previously known clusters are not required in the proposed strategy, selection of the optimal maximum cluster size with MCS-P can improve the performance of the scan statistic in applications without identified clusters.

  1. Selection of the Maximum Spatial Cluster Size of the Spatial Scan Statistic by Using the Maximum Clustering Set-Proportion Statistic

    PubMed Central

    Ma, Yue; Yin, Fei; Zhang, Tao; Zhou, Xiaohua Andrew; Li, Xiaosong

    2016-01-01

    Spatial scan statistics are widely used in various fields. The performance of these statistics is influenced by parameters, such as maximum spatial cluster size, and can be improved by parameter selection using performance measures. Current performance measures are based on the presence of clusters and are thus inapplicable to data sets without known clusters. In this work, we propose a novel overall performance measure called maximum clustering set–proportion (MCS-P), which is based on the likelihood of the union of detected clusters and the applied dataset. MCS-P was compared with existing performance measures in a simulation study to select the maximum spatial cluster size. Results of other performance measures, such as sensitivity and misclassification, suggest that the spatial scan statistic achieves accurate results in most scenarios with the maximum spatial cluster sizes selected using MCS-P. Given that previously known clusters are not required in the proposed strategy, selection of the optimal maximum cluster size with MCS-P can improve the performance of the scan statistic in applications without identified clusters. PMID:26820646

  2. Performance map of a cluster detection test using extended power

    PubMed Central

    2013-01-01

    Background Conventional power studies possess limited ability to assess the performance of cluster detection tests. In particular, they cannot evaluate the accuracy of the cluster location, which is essential in such assessments. Furthermore, they usually estimate power for one or a few particular alternative hypotheses and thus cannot assess performance over an entire region. Takahashi and Tango developed the concept of extended power that indicates both the rate of null hypothesis rejection and the accuracy of the cluster location. We propose a systematic assessment method, using here extended power, to produce a map showing the performance of cluster detection tests over an entire region. Methods To explore the behavior of a cluster detection test on identical cluster types at any possible location, we successively applied four different spatial and epidemiological parameters. These parameters determined four cluster collections, each covering the entire study region. We simulated 1,000 datasets for each cluster and analyzed them with Kulldorff’s spatial scan statistic. From the area under the extended power curve, we constructed a map for each parameter set showing the performance of the test across the entire region. Results Consistent with previous studies, the performance of the spatial scan statistic increased with the baseline incidence of disease, the size of the at-risk population and the strength of the cluster (i.e., the relative risk). Performance was heterogeneous, however, even for very similar clusters (i.e., similar with respect to the aforementioned factors), suggesting the influence of other factors. Conclusions The area under the extended power curve is a single measure of performance and, although needing further exploration, it is suitable to conduct a systematic spatial evaluation of performance. The performance map we propose enables epidemiologists to assess cluster detection tests across an entire study region. PMID:24156765

  3. Methodological approaches in analysing observational data: A practical example on how to address clustering and selection bias.

    PubMed

    Trutschel, Diana; Palm, Rebecca; Holle, Bernhard; Simon, Michael

    2017-11-01

    Because not every scientific question on effectiveness can be answered with randomised controlled trials, research methods that minimise bias in observational studies are required. Two major concerns influence the internal validity of effect estimates: selection bias and clustering. Hence, to reduce the bias of the effect estimates, more sophisticated statistical methods are needed. To introduce statistical approaches such as propensity score matching and mixed models into representative real-world analysis and to conduct the implementation in statistical software R to reproduce the results. Additionally, the implementation in R is presented to allow the results to be reproduced. We perform a two-level analytic strategy to address the problems of bias and clustering: (i) generalised models with different abilities to adjust for dependencies are used to analyse binary data and (ii) the genetic matching and covariate adjustment methods are used to adjust for selection bias. Hence, we analyse the data from two population samples, the sample produced by the matching method and the full sample. The different analysis methods in this article present different results but still point in the same direction. In our example, the estimate of the probability of receiving a case conference is higher in the treatment group than in the control group. Both strategies, genetic matching and covariate adjustment, have their limitations but complement each other to provide the whole picture. The statistical approaches were feasible for reducing bias but were nevertheless limited by the sample used. For each study and obtained sample, the pros and cons of the different methods have to be weighted. Copyright © 2017 The Author(s). Published by Elsevier Ltd.. All rights reserved.

  4. The spatial clustering of obesity: does the built environment matter?

    PubMed

    Huang, R; Moudon, A V; Cook, A J; Drewnowski, A

    2015-12-01

    Obesity rates in the USA show distinct geographical patterns. The present study used spatial cluster detection methods and individual-level data to locate obesity clusters and to analyse them in relation to the neighbourhood built environment. The 2008-2009 Seattle Obesity Study provided data on the self-reported height, weight, and sociodemographic characteristics of 1602 King County adults. Home addresses were geocoded. Clusters of high or low body mass index were identified using Anselin's Local Moran's I and a spatial scan statistic with regression models that searched for unmeasured neighbourhood-level factors from residuals, adjusting for measured individual-level covariates. Spatially continuous values of objectively measured features of the local neighbourhood built environment (SmartMaps) were constructed for seven variables obtained from tax rolls and commercial databases. Both the Local Moran's I and a spatial scan statistic identified similar spatial concentrations of obesity. High and low obesity clusters were attenuated after adjusting for age, gender, race, education and income, and they disappeared once neighbourhood residential property values and residential density were included in the model. Using individual-level data to detect obesity clusters with two cluster detection methods, the present study showed that the spatial concentration of obesity was wholly explained by neighbourhood composition and socioeconomic characteristics. These characteristics may serve to more precisely locate obesity prevention and intervention programmes. © 2014 The British Dietetic Association Ltd.

  5. Cluster Analysis in Nursing Research: An Introduction, Historical Perspective, and Future Directions.

    PubMed

    Dunn, Heather; Quinn, Laurie; Corbridge, Susan J; Eldeirawi, Kamal; Kapella, Mary; Collins, Eileen G

    2017-05-01

    The use of cluster analysis in the nursing literature is limited to the creation of classifications of homogeneous groups and the discovery of new relationships. As such, it is important to provide clarity regarding its use and potential. The purpose of this article is to provide an introduction to distance-based, partitioning-based, and model-based cluster analysis methods commonly utilized in the nursing literature, provide a brief historical overview on the use of cluster analysis in nursing literature, and provide suggestions for future research. An electronic search included three bibliographic databases, PubMed, CINAHL and Web of Science. Key terms were cluster analysis and nursing. The use of cluster analysis in the nursing literature is increasing and expanding. The increased use of cluster analysis in the nursing literature is positioning this statistical method to result in insights that have the potential to change clinical practice.

  6. Determining the Optimal Number of Clusters with the Clustergram

    NASA Technical Reports Server (NTRS)

    Fluegemann, Joseph K.; Davies, Misty D.; Aguirre, Nathan D.

    2011-01-01

    Cluster analysis aids research in many different fields, from business to biology to aerospace. It consists of using statistical techniques to group objects in large sets of data into meaningful classes. However, this process of ordering data points presents much uncertainty because it involves several steps, many of which are subject to researcher judgment as well as inconsistencies depending on the specific data type and research goals. These steps include the method used to cluster the data, the variables on which the cluster analysis will be operating, the number of resulting clusters, and parts of the interpretation process. In most cases, the number of clusters must be guessed or estimated before employing the clustering method. Many remedies have been proposed, but none is unassailable and certainly not for all data types. Thus, the aim of current research for better techniques of determining the number of clusters is generally confined to demonstrating that the new technique excels other methods in performance for several disparate data types. Our research makes use of a new cluster-number-determination technique based on the clustergram: a graph that shows how the number of objects in the cluster and the cluster mean (the ordinate) change with the number of clusters (the abscissa). We use the features of the clustergram to make the best determination of the cluster-number.

  7. Robustness of serial clustering of extra-tropical cyclones to the choice of tracking method

    NASA Astrophysics Data System (ADS)

    Pinto, Joaquim G.; Ulbrich, Sven; Karremann, Melanie K.; Stephenson, David B.; Economou, Theodoros; Shaffrey, Len C.

    2016-04-01

    Cyclone families are a frequent synoptic weather feature in the Euro-Atlantic area in winter. Given appropriate large-scale conditions, the occurrence of such series (clusters) of storms may lead to large socio-economic impacts and cumulative losses. Recent studies analyzing Reanalysis data using single cyclone tracking methods have shown that serial clustering of cyclones occurs on both flanks and downstream regions of the North Atlantic storm track. This study explores the sensitivity of serial clustering to the choice of tracking method. With this aim, the IMILAST cyclone track database based on ERA-interim data is analysed. Clustering is estimated by the dispersion (ratio of variance to mean) of winter (DJF) cyclones passages near each grid point over the Euro-Atlantic area. Results indicate that while the general pattern of clustering is identified for all methods, there are considerable differences in detail. This can primarily be attributed to the differences in the variance of cyclone counts between the methods, which range up to one order of magnitude. Nevertheless, clustering over the Eastern North Atlantic and Western Europe can be identified for all methods and can thus be generally considered as a robust feature. The statistical links between large-scale patterns like the NAO and clustering are obtained for all methods, though with different magnitudes. We conclude that the occurrence of cyclone clustering over the Eastern North Atlantic and Western Europe is largely independent from the choice of tracking method and hence from the definition of a cyclone.

  8. Identifying and characterizing hepatitis C virus hotspots in Massachusetts: a spatial epidemiological approach.

    PubMed

    Stopka, Thomas J; Goulart, Michael A; Meyers, David J; Hutcheson, Marga; Barton, Kerri; Onofrey, Shauna; Church, Daniel; Donahue, Ashley; Chui, Kenneth K H

    2017-04-20

    Hepatitis C virus (HCV) infections have increased during the past decade but little is known about geographic clustering patterns. We used a unique analytical approach, combining geographic information systems (GIS), spatial epidemiology, and statistical modeling to identify and characterize HCV hotspots, statistically significant clusters of census tracts with elevated HCV counts and rates. We compiled sociodemographic and HCV surveillance data (n = 99,780 cases) for Massachusetts census tracts (n = 1464) from 2002 to 2013. We used a five-step spatial epidemiological approach, calculating incremental spatial autocorrelations and Getis-Ord Gi* statistics to identify clusters. We conducted logistic regression analyses to determine factors associated with the HCV hotspots. We identified nine HCV clusters, with the largest in Boston, New Bedford/Fall River, Worcester, and Springfield (p < 0.05). In multivariable analyses, we found that HCV hotspots were independently and positively associated with the percent of the population that was Hispanic (adjusted odds ratio [AOR]: 1.07; 95% confidence interval [CI]: 1.04, 1.09) and the percent of households receiving food stamps (AOR: 1.83; 95% CI: 1.22, 2.74). HCV hotspots were independently and negatively associated with the percent of the population that were high school graduates or higher (AOR: 0.91; 95% CI: 0.89, 0.93) and the percent of the population in the "other" race/ethnicity category (AOR: 0.88; 95% CI: 0.85, 0.91). We identified locations where HCV clusters were a concern, and where enhanced HCV prevention, treatment, and care can help combat the HCV epidemic in Massachusetts. GIS, spatial epidemiological and statistical analyses provided a rigorous approach to identify hotspot clusters of disease, which can inform public health policy and intervention targeting. Further studies that incorporate spatiotemporal cluster analyses, Bayesian spatial and geostatistical models, spatially weighted regression analyses, and assessment of associations between HCV clustering and the built environment are needed to expand upon our combined spatial epidemiological and statistical methods.

  9. Clustering performance comparison using K-means and expectation maximization algorithms.

    PubMed

    Jung, Yong Gyu; Kang, Min Soo; Heo, Jun

    2014-11-14

    Clustering is an important means of data mining based on separating data categories by similar features. Unlike the classification algorithm, clustering belongs to the unsupervised type of algorithms. Two representatives of the clustering algorithms are the K -means and the expectation maximization (EM) algorithm. Linear regression analysis was extended to the category-type dependent variable, while logistic regression was achieved using a linear combination of independent variables. To predict the possibility of occurrence of an event, a statistical approach is used. However, the classification of all data by means of logistic regression analysis cannot guarantee the accuracy of the results. In this paper, the logistic regression analysis is applied to EM clusters and the K -means clustering method for quality assessment of red wine, and a method is proposed for ensuring the accuracy of the classification results.

  10. The capacity limitations of orientation summary statistics

    PubMed Central

    Attarha, Mouna; Moore, Cathleen M.

    2015-01-01

    The simultaneous–sequential method was used to test the processing capacity of establishing mean orientation summaries. Four clusters of oriented Gabor patches were presented in the peripheral visual field. One of the clusters had a mean orientation that was tilted either left or right while the mean orientations of the other three clusters were roughly vertical. All four clusters were presented at the same time in the simultaneous condition whereas the clusters appeared in temporal subsets of two in the sequential condition. Performance was lower when the means of all four clusters had to be processed concurrently than when only two had to be processed in the same amount of time. The advantage for establishing fewer summaries at a given time indicates that the processing of mean orientation engages limited-capacity processes (Experiment 1). This limitation cannot be attributed to crowding, low target-distractor discriminability, or a limited-capacity comparison process (Experiments 2 and 3). In contrast to the limitations of establishing multiple summary representations, establishing a single summary representation unfolds without interference (Experiment 4). When interpreted in the context of recent work on the capacity of summary statistics, these findings encourage reevaluation of the view that early visual perception consists of summary statistic representations that unfold independently across multiple areas of the visual field. PMID:25810160

  11. Disease clusters, exact distributions of maxima, and P-values.

    PubMed

    Grimson, R C

    1993-10-01

    This paper presents combinatorial (exact) methods that are useful in the analysis of disease cluster data obtained from small environments, such as buildings and neighbourhoods. Maxwell-Boltzmann and Fermi-Dirac occupancy models are compared in terms of appropriateness of representation of disease incidence patterns (space and/or time) in these environments. The methods are illustrated by a statistical analysis of the incidence pattern of bone fractures in a setting wherein fracture clustering was alleged to be occurring. One of the methodological results derived in this paper is the exact distribution of the maximum cell frequency in occupancy models.

  12. The CLASSY clustering algorithm: Description, evaluation, and comparison with the iterative self-organizing clustering system (ISOCLS). [used for LACIE data

    NASA Technical Reports Server (NTRS)

    Lennington, R. K.; Malek, H.

    1978-01-01

    A clustering method, CLASSY, was developed, which alternates maximum likelihood iteration with a procedure for splitting, combining, and eliminating the resulting statistics. The method maximizes the fit of a mixture of normal distributions to the observed first through fourth central moments of the data and produces an estimate of the proportions, means, and covariances in this mixture. The mathematical model which is the basic for CLASSY and the actual operation of the algorithm is described. Data comparing the performances of CLASSY and ISOCLS on simulated and actual LACIE data are presented.

  13. Relative efficiency and sample size for cluster randomized trials with variable cluster sizes.

    PubMed

    You, Zhiying; Williams, O Dale; Aban, Inmaculada; Kabagambe, Edmond Kato; Tiwari, Hemant K; Cutter, Gary

    2011-02-01

    The statistical power of cluster randomized trials depends on two sample size components, the number of clusters per group and the numbers of individuals within clusters (cluster size). Variable cluster sizes are common and this variation alone may have significant impact on study power. Previous approaches have taken this into account by either adjusting total sample size using a designated design effect or adjusting the number of clusters according to an assessment of the relative efficiency of unequal versus equal cluster sizes. This article defines a relative efficiency of unequal versus equal cluster sizes using noncentrality parameters, investigates properties of this measure, and proposes an approach for adjusting the required sample size accordingly. We focus on comparing two groups with normally distributed outcomes using t-test, and use the noncentrality parameter to define the relative efficiency of unequal versus equal cluster sizes and show that statistical power depends only on this parameter for a given number of clusters. We calculate the sample size required for an unequal cluster sizes trial to have the same power as one with equal cluster sizes. Relative efficiency based on the noncentrality parameter is straightforward to calculate and easy to interpret. It connects the required mean cluster size directly to the required sample size with equal cluster sizes. Consequently, our approach first determines the sample size requirements with equal cluster sizes for a pre-specified study power and then calculates the required mean cluster size while keeping the number of clusters unchanged. Our approach allows adjustment in mean cluster size alone or simultaneous adjustment in mean cluster size and number of clusters, and is a flexible alternative to and a useful complement to existing methods. Comparison indicated that we have defined a relative efficiency that is greater than the relative efficiency in the literature under some conditions. Our measure of relative efficiency might be less than the measure in the literature under some conditions, underestimating the relative efficiency. The relative efficiency of unequal versus equal cluster sizes defined using the noncentrality parameter suggests a sample size approach that is a flexible alternative and a useful complement to existing methods.

  14. Clustangles: An Open Library for Clustering Angular Data.

    PubMed

    Sargsyan, Karen; Hua, Yun Hao; Lim, Carmay

    2015-08-24

    Dihedral angles are good descriptors of the numerous conformations visited by large, flexible systems, but their analysis requires directional statistics. A single package including the various multivariate statistical methods for angular data that accounts for the distinct topology of such data does not exist. Here, we present a lightweight standalone, operating-system independent package called Clustangles to fill this gap. Clustangles will be useful in analyzing the ever-increasing number of structures in the Protein Data Bank and clustering the copious conformations from increasingly long molecular dynamics simulations.

  15. Effect of spatial smoothing on t-maps: arguments for going back from t-maps to masked contrast images.

    PubMed

    Reimold, Matthias; Slifstein, Mark; Heinz, Andreas; Mueller-Schauenburg, Wolfgang; Bares, Roland

    2006-06-01

    Voxelwise statistical analysis has become popular in explorative functional brain mapping with fMRI or PET. Usually, results are presented as voxelwise levels of significance (t-maps), and for clusters that survive correction for multiple testing the coordinates of the maximum t-value are reported. Before calculating a voxelwise statistical test, spatial smoothing is required to achieve a reasonable statistical power. Little attention is being given to the fact that smoothing has a nonlinear effect on the voxel variances and thus the local characteristics of a t-map, which becomes most evident after smoothing over different types of tissue. We investigated the related artifacts, for example, white matter peaks whose position depend on the relative variance (variance over contrast) of the surrounding regions, and suggest improving spatial precision with 'masked contrast images': color-codes are attributed to the voxelwise contrast, and significant clusters (e.g., detected with statistical parametric mapping, SPM) are enlarged by including contiguous pixels with a contrast above the mean contrast in the original cluster, provided they satisfy P < 0.05. The potential benefit is demonstrated with simulations and data from a [11C]Carfentanil PET study. We conclude that spatial smoothing may lead to critical, sometimes-counterintuitive artifacts in t-maps, especially in subcortical brain regions. If significant clusters are detected, for example, with SPM, the suggested method is one way to improve spatial precision and may give the investigator a more direct sense of the underlying data. Its simplicity and the fact that no further assumptions are needed make it a useful complement for standard methods of statistical mapping.

  16. Analysis of basic clustering algorithms for numerical estimation of statistical averages in biomolecules.

    PubMed

    Anandakrishnan, Ramu; Onufriev, Alexey

    2008-03-01

    In statistical mechanics, the equilibrium properties of a physical system of particles can be calculated as the statistical average over accessible microstates of the system. In general, these calculations are computationally intractable since they involve summations over an exponentially large number of microstates. Clustering algorithms are one of the methods used to numerically approximate these sums. The most basic clustering algorithms first sub-divide the system into a set of smaller subsets (clusters). Then, interactions between particles within each cluster are treated exactly, while all interactions between different clusters are ignored. These smaller clusters have far fewer microstates, making the summation over these microstates, tractable. These algorithms have been previously used for biomolecular computations, but remain relatively unexplored in this context. Presented here, is a theoretical analysis of the error and computational complexity for the two most basic clustering algorithms that were previously applied in the context of biomolecular electrostatics. We derive a tight, computationally inexpensive, error bound for the equilibrium state of a particle computed via these clustering algorithms. For some practical applications, it is the root mean square error, which can be significantly lower than the error bound, that may be more important. We how that there is a strong empirical relationship between error bound and root mean square error, suggesting that the error bound could be used as a computationally inexpensive metric for predicting the accuracy of clustering algorithms for practical applications. An example of error analysis for such an application-computation of average charge of ionizable amino-acids in proteins-is given, demonstrating that the clustering algorithm can be accurate enough for practical purposes.

  17. Propensity score to detect baseline imbalance in cluster randomized trials: the role of the c-statistic.

    PubMed

    Leyrat, Clémence; Caille, Agnès; Foucher, Yohann; Giraudeau, Bruno

    2016-01-22

    Despite randomization, baseline imbalance and confounding bias may occur in cluster randomized trials (CRTs). Covariate imbalance may jeopardize the validity of statistical inferences if they occur on prognostic factors. Thus, the diagnosis of a such imbalance is essential to adjust statistical analysis if required. We developed a tool based on the c-statistic of the propensity score (PS) model to detect global baseline covariate imbalance in CRTs and assess the risk of confounding bias. We performed a simulation study to assess the performance of the proposed tool and applied this method to analyze the data from 2 published CRTs. The proposed method had good performance for large sample sizes (n =500 per arm) and when the number of unbalanced covariates was not too small as compared with the total number of baseline covariates (≥40% of unbalanced covariates). We also provide a strategy for pre selection of the covariates needed to be included in the PS model to enhance imbalance detection. The proposed tool could be useful in deciding whether covariate adjustment is required before performing statistical analyses of CRTs.

  18. Quantifying opening-mode fracture spatial organization in horizontal wellbore image logs, core and outcrop: Application to Upper Cretaceous Frontier Formation tight gas sandstones, USA

    NASA Astrophysics Data System (ADS)

    Li, J. Z.; Laubach, S. E.; Gale, J. F. W.; Marrett, R. A.

    2018-03-01

    The Upper Cretaceous Frontier Formation is a naturally fractured gas-producing sandstone in Wyoming. Regionally, random and statistically more clustered than random patterns exist in the same upper to lower shoreface depositional facies. East-west- and north-south-striking regional fractures sampled using image logs and cores from three horizontal wells exhibit clustered patterns, whereas data collected from east-west-striking fractures in outcrop have patterns that are indistinguishable from random. Image log data analyzed with the correlation count method shows clusters ∼35 m wide and spaced ∼50 to 90 m apart as well as clusters up to 12 m wide with periodic inter-cluster spacings. A hierarchy of cluster sizes exists; organization within clusters is likely fractal. These rocks have markedly different structural and burial histories, so regional differences in degree of clustering are unsurprising. Clustered patterns correspond to fractures having core quartz deposition contemporaneous with fracture opening, circumstances that some models suggest might affect spacing patterns by interfering with fracture growth. Our results show that quantifying and identifying patterns as statistically more or less clustered than random delineates differences in fracture patterns that are not otherwise apparent but that may influence gas and water production, and therefore may be economically important.

  19. Information Extraction from Large-Multi-Layer Social Networks

    DTIC Science & Technology

    2015-08-06

    mization [4]. Methods that fall into this category include spec- tral algorithms, modularity methods, and methods that rely on statistical inference...Snijders and Chris Baerveldt, “A multilevel network study of the effects of delinquent behavior on friendship evolution,” Journal of mathematical sociol- ogy...1970. [10] Ulrike Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, no. 4, pp. 395–416, Dec. 2007. [11] R. A. Fisher, “On

  20. Case-control geographic clustering for residential histories accounting for risk factors and covariates

    PubMed Central

    2006-01-01

    Background Methods for analyzing space-time variation in risk in case-control studies typically ignore residential mobility. We develop an approach for analyzing case-control data for mobile individuals and apply it to study bladder cancer in 11 counties in southeastern Michigan. At this time data collection is incomplete and no inferences should be drawn – we analyze these data to demonstrate the novel methods. Global, local and focused clustering of residential histories for 219 cases and 437 controls is quantified using time-dependent nearest neighbor relationships. Business address histories for 268 industries that release known or suspected bladder cancer carcinogens are analyzed. A logistic model accounting for smoking, gender, age, race and education specifies the probability of being a case, and is incorporated into the cluster randomization procedures. Sensitivity of clustering to definition of the proximity metric is assessed for 1 to 75 k nearest neighbors. Results Global clustering is partly explained by the covariates but remains statistically significant at 12 of the 14 levels of k considered. After accounting for the covariates 26 Local clusters are found in Lapeer, Ingham, Oakland and Jackson counties, with the clusters in Ingham and Oakland counties appearing in 1950 and persisting to the present. Statistically significant focused clusters are found about the business address histories of 22 industries located in Oakland (19 clusters), Ingham (2) and Jackson (1) counties. Clusters in central and southeastern Oakland County appear in the 1930's and persist to the present day. Conclusion These methods provide a systematic approach for evaluating a series of increasingly realistic alternative hypotheses regarding the sources of excess risk. So long as selection of cases and controls is population-based and not geographically biased, these tools can provide insights into geographic risk factors that were not specifically assessed in the case-control study design. PMID:16887016

  1. A power comparison of generalized additive models and the spatial scan statistic in a case-control setting.

    PubMed

    Young, Robin L; Weinberg, Janice; Vieira, Verónica; Ozonoff, Al; Webster, Thomas F

    2010-07-19

    A common, important problem in spatial epidemiology is measuring and identifying variation in disease risk across a study region. In application of statistical methods, the problem has two parts. First, spatial variation in risk must be detected across the study region and, second, areas of increased or decreased risk must be correctly identified. The location of such areas may give clues to environmental sources of exposure and disease etiology. One statistical method applicable in spatial epidemiologic settings is a generalized additive model (GAM) which can be applied with a bivariate LOESS smoother to account for geographic location as a possible predictor of disease status. A natural hypothesis when applying this method is whether residential location of subjects is associated with the outcome, i.e. is the smoothing term necessary? Permutation tests are a reasonable hypothesis testing method and provide adequate power under a simple alternative hypothesis. These tests have yet to be compared to other spatial statistics. This research uses simulated point data generated under three alternative hypotheses to evaluate the properties of the permutation methods and compare them to the popular spatial scan statistic in a case-control setting. Case 1 was a single circular cluster centered in a circular study region. The spatial scan statistic had the highest power though the GAM method estimates did not fall far behind. Case 2 was a single point source located at the center of a circular cluster and Case 3 was a line source at the center of the horizontal axis of a square study region. Each had linearly decreasing logodds with distance from the point. The GAM methods outperformed the scan statistic in Cases 2 and 3. Comparing sensitivity, measured as the proportion of the exposure source correctly identified as high or low risk, the GAM methods outperformed the scan statistic in all three Cases. The GAM permutation testing methods provide a regression-based alternative to the spatial scan statistic. Across all hypotheses examined in this research, the GAM methods had competing or greater power estimates and sensitivities exceeding that of the spatial scan statistic.

  2. A power comparison of generalized additive models and the spatial scan statistic in a case-control setting

    PubMed Central

    2010-01-01

    Background A common, important problem in spatial epidemiology is measuring and identifying variation in disease risk across a study region. In application of statistical methods, the problem has two parts. First, spatial variation in risk must be detected across the study region and, second, areas of increased or decreased risk must be correctly identified. The location of such areas may give clues to environmental sources of exposure and disease etiology. One statistical method applicable in spatial epidemiologic settings is a generalized additive model (GAM) which can be applied with a bivariate LOESS smoother to account for geographic location as a possible predictor of disease status. A natural hypothesis when applying this method is whether residential location of subjects is associated with the outcome, i.e. is the smoothing term necessary? Permutation tests are a reasonable hypothesis testing method and provide adequate power under a simple alternative hypothesis. These tests have yet to be compared to other spatial statistics. Results This research uses simulated point data generated under three alternative hypotheses to evaluate the properties of the permutation methods and compare them to the popular spatial scan statistic in a case-control setting. Case 1 was a single circular cluster centered in a circular study region. The spatial scan statistic had the highest power though the GAM method estimates did not fall far behind. Case 2 was a single point source located at the center of a circular cluster and Case 3 was a line source at the center of the horizontal axis of a square study region. Each had linearly decreasing logodds with distance from the point. The GAM methods outperformed the scan statistic in Cases 2 and 3. Comparing sensitivity, measured as the proportion of the exposure source correctly identified as high or low risk, the GAM methods outperformed the scan statistic in all three Cases. Conclusions The GAM permutation testing methods provide a regression-based alternative to the spatial scan statistic. Across all hypotheses examined in this research, the GAM methods had competing or greater power estimates and sensitivities exceeding that of the spatial scan statistic. PMID:20642827

  3. Comparative analysis on the selection of number of clusters in community detection

    NASA Astrophysics Data System (ADS)

    Kawamoto, Tatsuro; Kabashima, Yoshiyuki

    2018-02-01

    We conduct a comparative analysis on various estimates of the number of clusters in community detection. An exhaustive comparison requires testing of all possible combinations of frameworks, algorithms, and assessment criteria. In this paper we focus on the framework based on a stochastic block model, and investigate the performance of greedy algorithms, statistical inference, and spectral methods. For the assessment criteria, we consider modularity, map equation, Bethe free energy, prediction errors, and isolated eigenvalues. From the analysis, the tendency of overfit and underfit that the assessment criteria and algorithms have becomes apparent. In addition, we propose that the alluvial diagram is a suitable tool to visualize statistical inference results and can be useful to determine the number of clusters.

  4. Two worlds collide: Image analysis methods for quantifying structural variation in cluster molecular dynamics

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Steenbergen, K. G., E-mail: kgsteen@gmail.com; Gaston, N.

    2014-02-14

    Inspired by methods of remote sensing image analysis, we analyze structural variation in cluster molecular dynamics (MD) simulations through a unique application of the principal component analysis (PCA) and Pearson Correlation Coefficient (PCC). The PCA analysis characterizes the geometric shape of the cluster structure at each time step, yielding a detailed and quantitative measure of structural stability and variation at finite temperature. Our PCC analysis captures bond structure variation in MD, which can be used to both supplement the PCA analysis as well as compare bond patterns between different cluster sizes. Relying only on atomic position data, without requirement formore » a priori structural input, PCA and PCC can be used to analyze both classical and ab initio MD simulations for any cluster composition or electronic configuration. Taken together, these statistical tools represent powerful new techniques for quantitative structural characterization and isomer identification in cluster MD.« less

  5. Two worlds collide: image analysis methods for quantifying structural variation in cluster molecular dynamics.

    PubMed

    Steenbergen, K G; Gaston, N

    2014-02-14

    Inspired by methods of remote sensing image analysis, we analyze structural variation in cluster molecular dynamics (MD) simulations through a unique application of the principal component analysis (PCA) and Pearson Correlation Coefficient (PCC). The PCA analysis characterizes the geometric shape of the cluster structure at each time step, yielding a detailed and quantitative measure of structural stability and variation at finite temperature. Our PCC analysis captures bond structure variation in MD, which can be used to both supplement the PCA analysis as well as compare bond patterns between different cluster sizes. Relying only on atomic position data, without requirement for a priori structural input, PCA and PCC can be used to analyze both classical and ab initio MD simulations for any cluster composition or electronic configuration. Taken together, these statistical tools represent powerful new techniques for quantitative structural characterization and isomer identification in cluster MD.

  6. Methods of editing cloud and atmospheric layer affected pixels from satellite data

    NASA Technical Reports Server (NTRS)

    Nixon, P. R. (Principal Investigator); Wiegand, C. L.; Richardson, A. J.; Johnson, M. P.

    1982-01-01

    Practical methods of computer screening cloud-contaminated pixels from data of various satellite systems are proposed. Examples are given of the location of clouds and representative landscape features in HCMM spectral space of reflectance (VIS) vs emission (IR). Methods of screening out cloud affected HCMM are discussed. The character of subvisible absorbing-emitting atmospheric layers (subvisible cirrus or SCi) in HCMM data is considered and radiosonde soundings are examined in relation to the presence of SCi. The statistical characteristics of multispectral meteorological satellite data in clear and SCi affected areas are discussed. Examples in TIROS-N and NOAA-7 data from several states and Mexico are presented. The VIS-IR cluster screening method for removing clouds is applied to a 262, 144 pixel HCMM scene from south Texas and northeast Mexico. The SCi that remain after cluster screening are sited out by applying a statistically determined IR limit.

  7. Examining the effectiveness of discriminant function analysis and cluster analysis in species identification of male field crickets based on their calling songs.

    PubMed

    Jaiswara, Ranjana; Nandi, Diptarup; Balakrishnan, Rohini

    2013-01-01

    Traditional taxonomy based on morphology has often failed in accurate species identification owing to the occurrence of cryptic species, which are reproductively isolated but morphologically identical. Molecular data have thus been used to complement morphology in species identification. The sexual advertisement calls in several groups of acoustically communicating animals are species-specific and can thus complement molecular data as non-invasive tools for identification. Several statistical tools and automated identifier algorithms have been used to investigate the efficiency of acoustic signals in species identification. Despite a plethora of such methods, there is a general lack of knowledge regarding the appropriate usage of these methods in specific taxa. In this study, we investigated the performance of two commonly used statistical methods, discriminant function analysis (DFA) and cluster analysis, in identification and classification based on acoustic signals of field cricket species belonging to the subfamily Gryllinae. Using a comparative approach we evaluated the optimal number of species and calling song characteristics for both the methods that lead to most accurate classification and identification. The accuracy of classification using DFA was high and was not affected by the number of taxa used. However, a constraint in using discriminant function analysis is the need for a priori classification of songs. Accuracy of classification using cluster analysis, which does not require a priori knowledge, was maximum for 6-7 taxa and decreased significantly when more than ten taxa were analysed together. We also investigated the efficacy of two novel derived acoustic features in improving the accuracy of identification. Our results show that DFA is a reliable statistical tool for species identification using acoustic signals. Our results also show that cluster analysis of acoustic signals in crickets works effectively for species classification and identification.

  8. Relative risk estimates from spatial and space-time scan statistics: Are they biased?

    PubMed Central

    Prates, Marcos O.; Kulldorff, Martin; Assunção, Renato M.

    2014-01-01

    The purely spatial and space-time scan statistics have been successfully used by many scientists to detect and evaluate geographical disease clusters. Although the scan statistic has high power in correctly identifying a cluster, no study has considered the estimates of the cluster relative risk in the detected cluster. In this paper we evaluate whether there is any bias on these estimated relative risks. Intuitively, one may expect that the estimated relative risks has upward bias, since the scan statistic cherry picks high rate areas to include in the cluster. We show that this intuition is correct for clusters with low statistical power, but with medium to high power the bias becomes negligible. The same behaviour is not observed for the prospective space-time scan statistic, where there is an increasing conservative downward bias of the relative risk as the power to detect the cluster increases. PMID:24639031

  9. Disparity : scalable anomaly detection for clusters.

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Desai, N.; Bradshaw, R.; Lusk, E.

    2008-01-01

    In this paper, we describe disparity, a tool that does parallel, scalable anomaly detection for clusters. Disparity uses basic statistical methods and scalable reduction operations to perform data reduction on client nodes and uses these results to locate node anomalies. We discuss the implementation of disparity and present results of its use on a SiCortex SC5832 system.

  10. Improved Test Planning and Analysis Through the Use of Advanced Statistical Methods

    NASA Technical Reports Server (NTRS)

    Green, Lawrence L.; Maxwell, Katherine A.; Glass, David E.; Vaughn, Wallace L.; Barger, Weston; Cook, Mylan

    2016-01-01

    The goal of this work is, through computational simulations, to provide statistically-based evidence to convince the testing community that a distributed testing approach is superior to a clustered testing approach for most situations. For clustered testing, numerous, repeated test points are acquired at a limited number of test conditions. For distributed testing, only one or a few test points are requested at many different conditions. The statistical techniques of Analysis of Variance (ANOVA), Design of Experiments (DOE) and Response Surface Methods (RSM) are applied to enable distributed test planning, data analysis and test augmentation. The D-Optimal class of DOE is used to plan an optimally efficient single- and multi-factor test. The resulting simulated test data are analyzed via ANOVA and a parametric model is constructed using RSM. Finally, ANOVA can be used to plan a second round of testing to augment the existing data set with new data points. The use of these techniques is demonstrated through several illustrative examples. To date, many thousands of comparisons have been performed and the results strongly support the conclusion that the distributed testing approach outperforms the clustered testing approach.

  11. Clustering of Fast-Food Restaurants Around Schools: A Novel Application of Spatial Statistics to the Study of Food Environments

    PubMed Central

    Austin, S. Bryn; Melly, Steven J.; Sanchez, Brisa N.; Patel, Aarti; Buka, Stephen; Gortmaker, Steven L.

    2005-01-01

    Objectives. We examined the concentration of fast food restaurants in areas proximal to schools to characterize school neighborhood food environments. Methods. We used geocoded databases of restaurant and school addresses to examine locational patterns of fast-food restaurants and kindergartens and primary and secondary schools in Chicago. We used the bivariate K function statistical method to quantify the degree of clustering (spatial dependence) of fast-food restaurants around school locations. Results. The median distance from any school in Chicago to the nearest fast-food restaurant was 0.52 km, a distance that an adult can walk in little more than 5 minutes, and 78% of schools had at least 1 fast-food restaurant within 800 m. Fast-food restaurants were statistically significantly clustered in areas within a short walking distance from schools, with an estimated 3 to 4 times as many fast-food restaurants within 1.5 km from schools than would be expected if the restaurants were distributed throughout the city in a way unrelated to school locations. Conclusions. Fast-food restaurants are concentrated within a short walking distance from schools, exposing children to poor-quality food environments in their school neighborhoods. PMID:16118369

  12. An adaptive data-driven method for accurate prediction of remaining useful life of rolling bearings

    NASA Astrophysics Data System (ADS)

    Peng, Yanfeng; Cheng, Junsheng; Liu, Yanfei; Li, Xuejun; Peng, Zhihua

    2018-06-01

    A novel data-driven method based on Gaussian mixture model (GMM) and distance evaluation technique (DET) is proposed to predict the remaining useful life (RUL) of rolling bearings. The data sets are clustered by GMM to divide all data sets into several health states adaptively and reasonably. The number of clusters is determined by the minimum description length principle. Thus, either the health state of the data sets or the number of the states is obtained automatically. Meanwhile, the abnormal data sets can be recognized during the clustering process and removed from the training data sets. After obtaining the health states, appropriate features are selected by DET for increasing the classification and prediction accuracy. In the prediction process, each vibration signal is decomposed into several components by empirical mode decomposition. Some common statistical parameters of the components are calculated first and then the features are clustered using GMM to divide the data sets into several health states and remove the abnormal data sets. Thereafter, appropriate statistical parameters of the generated components are selected using DET. Finally, least squares support vector machine is utilized to predict the RUL of rolling bearings. Experimental results indicate that the proposed method reliably predicts the RUL of rolling bearings.

  13. Generation and optimization of superpixels as image processing kernels for Jones matrix optical coherence tomography

    PubMed Central

    Miyazawa, Arata; Hong, Young-Joo; Makita, Shuichi; Kasaragod, Deepa; Yasuno, Yoshiaki

    2017-01-01

    Jones matrix-based polarization sensitive optical coherence tomography (JM-OCT) simultaneously measures optical intensity, birefringence, degree of polarization uniformity, and OCT angiography. The statistics of the optical features in a local region, such as the local mean of the OCT intensity, are frequently used for image processing and the quantitative analysis of JM-OCT. Conventionally, local statistics have been computed with fixed-size rectangular kernels. However, this results in a trade-off between image sharpness and statistical accuracy. We introduce a superpixel method to JM-OCT for generating the flexible kernels of local statistics. A superpixel is a cluster of image pixels that is formed by the pixels’ spatial and signal value proximities. An algorithm for superpixel generation specialized for JM-OCT and its optimization methods are presented in this paper. The spatial proximity is in two-dimensional cross-sectional space and the signal values are the four optical features. Hence, the superpixel method is a six-dimensional clustering technique for JM-OCT pixels. The performance of the JM-OCT superpixels and its optimization methods are evaluated in detail using JM-OCT datasets of posterior eyes. The superpixels were found to well preserve tissue structures, such as layer structures, sclera, vessels, and retinal pigment epithelium. And hence, they are more suitable for local statistics kernels than conventional uniform rectangular kernels. PMID:29082073

  14. Intrinsic alignment in redMaPPer clusters – II. Radial alignment of satellites towards cluster centres

    DOE PAGES

    Huang, Hung-Jin; Mandelbaum, Rachel; Freeman, Peter E.; ...

    2017-11-23

    We study the orientations of satellite galaxies in redMaPPer clusters constructed from the Sloan Digital Sky Survey at 0.1 < z < 0.35 to determine whether there is any preferential tendency for satellites to point radially towards cluster centres. Here, we analyse the satellite alignment (SA) signal based on three shape measurement methods (re-Gaussianization, de Vaucouleurs, and isophotal shapes), which trace galaxy light profiles at different radii. The measured SA signal depends on these shape measurement methods. We detect the strongest SA signal in isophotal shapes, followed by de Vaucouleurs shapes. While no net SA signal is detected using re-Gaussianizationmore » shapes across the entire sample, the observed SA signal reaches a statistically significant level when limiting to a subsample of higher luminosity satellites. We further investigate the impact of noise, systematics, and real physical isophotal twisting effects in the comparison between the SA signal detected via different shape measurement methods. Unlike previous studies, which only consider the dependence of SA on a few parameters, here we explore a total of 17 galaxy and cluster properties, using a statistical model averaging technique to naturally account for parameter correlations and identify significant SA predictors. We find that the measured SA signal is strongest for satellites with the following characteristics: higher luminosity, smaller distance to the cluster centre, rounder in shape, higher bulge fraction, and distributed preferentially along the major axis directions of their centrals. Finally, we provide physical explanations for the identified dependences and discuss the connection to theories of SA.« less

  15. Detection of calcification clusters in digital breast tomosynthesis slices at different dose levels utilizing a SRSAR reconstruction and JAFROC

    NASA Astrophysics Data System (ADS)

    Timberg, P.; Dustler, M.; Petersson, H.; Tingberg, A.; Zackrisson, S.

    2015-03-01

    Purpose: To investigate detection performance for calcification clusters in reconstructed digital breast tomosynthesis (DBT) slices at different dose levels using a Super Resolution and Statistical Artifact Reduction (SRSAR) reconstruction method. Method: Simulated calcifications with irregular profile (0.2 mm diameter) where combined to form clusters that were added to projection images (1-3 per abnormal image) acquired on a DBT system (Mammomat Inspiration, Siemens). The projection images were dose reduced by software to form 35 abnormal cases and 25 normal cases as if acquired at 100%, 75% and 50% dose level (AGD of approximately 1.6 mGy for a 53 mm standard breast, measured according to EUREF v0.15). A standard FBP and a SRSAR reconstruction method (utilizing IRIS (iterative reconstruction filters), and outlier detection using Maximum-Intensity Projections and Average-Intensity Projections) were used to reconstruct single central slices to be used in a Free-response task (60 images per observer and dose level). Six observers participated and their task was to detect the clusters and assign confidence rating in randomly presented images from the whole image set (balanced by dose level). Each trial was separated by one weeks to reduce possible memory bias. The outcome was analyzed for statistical differences using Jackknifed Alternative Free-response Receiver Operating Characteristics. Results: The results indicate that it is possible reduce the dose by 50% with SRSAR without jeopardizing cluster detection. Conclusions: The detection performance for clusters can be maintained at a lower dose level by using SRSAR reconstruction.

  16. Intrinsic alignment in redMaPPer clusters – II. Radial alignment of satellites towards cluster centres

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Huang, Hung-Jin; Mandelbaum, Rachel; Freeman, Peter E.

    We study the orientations of satellite galaxies in redMaPPer clusters constructed from the Sloan Digital Sky Survey at 0.1 < z < 0.35 to determine whether there is any preferential tendency for satellites to point radially towards cluster centres. Here, we analyse the satellite alignment (SA) signal based on three shape measurement methods (re-Gaussianization, de Vaucouleurs, and isophotal shapes), which trace galaxy light profiles at different radii. The measured SA signal depends on these shape measurement methods. We detect the strongest SA signal in isophotal shapes, followed by de Vaucouleurs shapes. While no net SA signal is detected using re-Gaussianizationmore » shapes across the entire sample, the observed SA signal reaches a statistically significant level when limiting to a subsample of higher luminosity satellites. We further investigate the impact of noise, systematics, and real physical isophotal twisting effects in the comparison between the SA signal detected via different shape measurement methods. Unlike previous studies, which only consider the dependence of SA on a few parameters, here we explore a total of 17 galaxy and cluster properties, using a statistical model averaging technique to naturally account for parameter correlations and identify significant SA predictors. We find that the measured SA signal is strongest for satellites with the following characteristics: higher luminosity, smaller distance to the cluster centre, rounder in shape, higher bulge fraction, and distributed preferentially along the major axis directions of their centrals. Finally, we provide physical explanations for the identified dependences and discuss the connection to theories of SA.« less

  17. Cluster Detection Tests in Spatial Epidemiology: A Global Indicator for Performance Assessment

    PubMed Central

    Guttmann, Aline; Li, Xinran; Feschet, Fabien; Gaudart, Jean; Demongeot, Jacques; Boire, Jean-Yves; Ouchchane, Lemlih

    2015-01-01

    In cluster detection of disease, the use of local cluster detection tests (CDTs) is current. These methods aim both at locating likely clusters and testing for their statistical significance. New or improved CDTs are regularly proposed to epidemiologists and must be subjected to performance assessment. Because location accuracy has to be considered, performance assessment goes beyond the raw estimation of type I or II errors. As no consensus exists for performance evaluations, heterogeneous methods are used, and therefore studies are rarely comparable. A global indicator of performance, which assesses both spatial accuracy and usual power, would facilitate the exploration of CDTs behaviour and help between-studies comparisons. The Tanimoto coefficient (TC) is a well-known measure of similarity that can assess location accuracy but only for one detected cluster. In a simulation study, performance is measured for many tests. From the TC, we here propose two statistics, the averaged TC and the cumulated TC, as indicators able to provide a global overview of CDTs performance for both usual power and location accuracy. We evidence the properties of these two indicators and the superiority of the cumulated TC to assess performance. We tested these indicators to conduct a systematic spatial assessment displayed through performance maps. PMID:26086911

  18. Clustering by reordering of similarity and Laplacian matrices: Application to galaxy clusters

    NASA Astrophysics Data System (ADS)

    Mahmoud, E.; Shoukry, A.; Takey, A.

    2018-04-01

    Similarity metrics, kernels and similarity-based algorithms have gained much attention due to their increasing applications in information retrieval, data mining, pattern recognition and machine learning. Similarity Graphs are often adopted as the underlying representation of similarity matrices and are at the origin of known clustering algorithms such as spectral clustering. Similarity matrices offer the advantage of working in object-object (two-dimensional) space where visualization of clusters similarities is available instead of object-features (multi-dimensional) space. In this paper, sparse ɛ-similarity graphs are constructed and decomposed into strong components using appropriate methods such as Dulmage-Mendelsohn permutation (DMperm) and/or Reverse Cuthill-McKee (RCM) algorithms. The obtained strong components correspond to groups (clusters) in the input (feature) space. Parameter ɛi is estimated locally, at each data point i from a corresponding narrow range of the number of nearest neighbors. Although more advanced clustering techniques are available, our method has the advantages of simplicity, better complexity and direct visualization of the clusters similarities in a two-dimensional space. Also, no prior information about the number of clusters is needed. We conducted our experiments on two and three dimensional, low and high-sized synthetic datasets as well as on an astronomical real-dataset. The results are verified graphically and analyzed using gap statistics over a range of neighbors to verify the robustness of the algorithm and the stability of the results. Combining the proposed algorithm with gap statistics provides a promising tool for solving clustering problems. An astronomical application is conducted for confirming the existence of 45 galaxy clusters around the X-ray positions of galaxy clusters in the redshift range [0.1..0.8]. We re-estimate the photometric redshifts of the identified galaxy clusters and obtain acceptable values compared to published spectroscopic redshifts with a 0.029 standard deviation of their differences.

  19. A Cyber-Attack Detection Model Based on Multivariate Analyses

    NASA Astrophysics Data System (ADS)

    Sakai, Yuto; Rinsaka, Koichiro; Dohi, Tadashi

    In the present paper, we propose a novel cyber-attack detection model based on two multivariate-analysis methods to the audit data observed on a host machine. The statistical techniques used here are the well-known Hayashi's quantification method IV and cluster analysis method. We quantify the observed qualitative audit event sequence via the quantification method IV, and collect similar audit event sequence in the same groups based on the cluster analysis. It is shown in simulation experiments that our model can improve the cyber-attack detection accuracy in some realistic cases where both normal and attack activities are intermingled.

  20. Functional region prediction with a set of appropriate homologous sequences-an index for sequence selection by integrating structure and sequence information with spatial statistics

    PubMed Central

    2012-01-01

    Background The detection of conserved residue clusters on a protein structure is one of the effective strategies for the prediction of functional protein regions. Various methods, such as Evolutionary Trace, have been developed based on this strategy. In such approaches, the conserved residues are identified through comparisons of homologous amino acid sequences. Therefore, the selection of homologous sequences is a critical step. It is empirically known that a certain degree of sequence divergence in the set of homologous sequences is required for the identification of conserved residues. However, the development of a method to select homologous sequences appropriate for the identification of conserved residues has not been sufficiently addressed. An objective and general method to select appropriate homologous sequences is desired for the efficient prediction of functional regions. Results We have developed a novel index to select the sequences appropriate for the identification of conserved residues, and implemented the index within our method to predict the functional regions of a protein. The implementation of the index improved the performance of the functional region prediction. The index represents the degree of conserved residue clustering on the tertiary structure of the protein. For this purpose, the structure and sequence information were integrated within the index by the application of spatial statistics. Spatial statistics is a field of statistics in which not only the attributes but also the geometrical coordinates of the data are considered simultaneously. Higher degrees of clustering generate larger index scores. We adopted the set of homologous sequences with the highest index score, under the assumption that the best prediction accuracy is obtained when the degree of clustering is the maximum. The set of sequences selected by the index led to higher functional region prediction performance than the sets of sequences selected by other sequence-based methods. Conclusions Appropriate homologous sequences are selected automatically and objectively by the index. Such sequence selection improved the performance of functional region prediction. As far as we know, this is the first approach in which spatial statistics have been applied to protein analyses. Such integration of structure and sequence information would be useful for other bioinformatics problems. PMID:22643026

  1. Statistical design and analysis plan for an impact evaluation of an HIV treatment and prevention intervention for female sex workers in Zimbabwe: a study protocol for a cluster randomised controlled trial.

    PubMed

    Hargreaves, James R; Fearon, Elizabeth; Davey, Calum; Phillips, Andrew; Cambiano, Valentina; Cowan, Frances M

    2016-01-05

    Pragmatic cluster-randomised trials should seek to make unbiased estimates of effect and be reported according to CONSORT principles, and the study population should be representative of the target population. This is challenging when conducting trials amongst 'hidden' populations without a sample frame. We describe a pair-matched cluster-randomised trial of a combination HIV-prevention intervention to reduce the proportion of female sex workers (FSW) with a detectable HIV viral load in Zimbabwe, recruiting via respondent driven sampling (RDS). We will cross-sectionally survey approximately 200 FSW at baseline and at endline to characterise each of 14 sites. RDS is a variant of chain referral sampling and has been adapted to approximate random sampling. Primary analysis will use the 'RDS-2' method to estimate cluster summaries and will adapt Hayes and Moulton's '2-step' method to adjust effect estimates for individual-level confounders and further adjust for cluster baseline prevalence. We will adapt CONSORT to accommodate RDS. In the absence of observable refusal rates, we will compare the recruitment process between matched pairs. We will need to investigate whether cluster-specific recruitment or the intervention itself affects the accuracy of the RDS estimation process, potentially causing differential biases. To do this, we will calculate RDS-diagnostic statistics for each cluster at each time point and compare these statistics within matched pairs and time points. Sensitivity analyses will assess the impact of potential biases arising from assumptions made by the RDS-2 estimation. We are not aware of any other completed pragmatic cluster RCTs that are recruiting participants using RDS. Our statistical design and analysis approach seeks to transparently document participant recruitment and allow an assessment of the representativeness of the study to the target population, a key aspect of pragmatic trials. The challenges we have faced in the design of this trial are likely to be shared in other contexts aiming to serve the needs of legally and/or socially marginalised populations for which no sampling frame exists and especially when the social networks of participants are both the target of intervention and the means of recruitment. The trial was registered at Pan African Clinical Trials Registry (PACTR201312000722390) on 9 December 2013.

  2. Clustering gene expression data based on predicted differential effects of GV interaction.

    PubMed

    Pan, Hai-Yan; Zhu, Jun; Han, Dan-Fu

    2005-02-01

    Microarray has become a popular biotechnology in biological and medical research. However, systematic and stochastic variabilities in microarray data are expected and unavoidable, resulting in the problem that the raw measurements have inherent "noise" within microarray experiments. Currently, logarithmic ratios are usually analyzed by various clustering methods directly, which may introduce bias interpretation in identifying groups of genes or samples. In this paper, a statistical method based on mixed model approaches was proposed for microarray data cluster analysis. The underlying rationale of this method is to partition the observed total gene expression level into various variations caused by different factors using an ANOVA model, and to predict the differential effects of GV (gene by variety) interaction using the adjusted unbiased prediction (AUP) method. The predicted GV interaction effects can then be used as the inputs of cluster analysis. We illustrated the application of our method with a gene expression dataset and elucidated the utility of our approach using an external validation.

  3. Finding Statistically Significant Communities in Networks

    PubMed Central

    Lancichinetti, Andrea; Radicchi, Filippo; Ramasco, José J.; Fortunato, Santo

    2011-01-01

    Community structure is one of the main structural features of networks, revealing both their internal organization and the similarity of their elementary units. Despite the large variety of methods proposed to detect communities in graphs, there is a big need for multi-purpose techniques, able to handle different types of datasets and the subtleties of community structure. In this paper we present OSLOM (Order Statistics Local Optimization Method), the first method capable to detect clusters in networks accounting for edge directions, edge weights, overlapping communities, hierarchies and community dynamics. It is based on the local optimization of a fitness function expressing the statistical significance of clusters with respect to random fluctuations, which is estimated with tools of Extreme and Order Statistics. OSLOM can be used alone or as a refinement procedure of partitions/covers delivered by other techniques. We have also implemented sequential algorithms combining OSLOM with other fast techniques, so that the community structure of very large networks can be uncovered. Our method has a comparable performance as the best existing algorithms on artificial benchmark graphs. Several applications on real networks are shown as well. OSLOM is implemented in a freely available software (http://www.oslom.org), and we believe it will be a valuable tool in the analysis of networks. PMID:21559480

  4. Supervised group Lasso with applications to microarray data analysis

    PubMed Central

    Ma, Shuangge; Song, Xiao; Huang, Jian

    2007-01-01

    Background A tremendous amount of efforts have been devoted to identifying genes for diagnosis and prognosis of diseases using microarray gene expression data. It has been demonstrated that gene expression data have cluster structure, where the clusters consist of co-regulated genes which tend to have coordinated functions. However, most available statistical methods for gene selection do not take into consideration the cluster structure. Results We propose a supervised group Lasso approach that takes into account the cluster structure in gene expression data for gene selection and predictive model building. For gene expression data without biological cluster information, we first divide genes into clusters using the K-means approach and determine the optimal number of clusters using the Gap method. The supervised group Lasso consists of two steps. In the first step, we identify important genes within each cluster using the Lasso method. In the second step, we select important clusters using the group Lasso. Tuning parameters are determined using V-fold cross validation at both steps to allow for further flexibility. Prediction performance is evaluated using leave-one-out cross validation. We apply the proposed method to disease classification and survival analysis with microarray data. Conclusion We analyze four microarray data sets using the proposed approach: two cancer data sets with binary cancer occurrence as outcomes and two lymphoma data sets with survival outcomes. The results show that the proposed approach is capable of identifying a small number of influential gene clusters and important genes within those clusters, and has better prediction performance than existing methods. PMID:17316436

  5. Characterizing the Temporal and Spatial Distribution of Earthquake Swarms in the Puerto Rico - Virgin Island Block

    NASA Astrophysics Data System (ADS)

    Hernandez, F. J.; Lopez, A. M.; Vanacore, E. A.

    2017-12-01

    The presence of earthquake swarms and clusters in the north and northeast of the island of Puerto Rico in the northeastern Caribbean have been recorded by the Puerto Rico Seismic Network (PRSN) since it started operations in 1974. Although clusters in the Puerto Rico-Virgin Island (PRVI) block have been observed for over forty years, the nature of their enigmatic occurrence is still poorly understood. In this study, the entire seismic catalog of the PRSN, of approximately 31,000 seismic events, has been limited to a sub-set of 18,000 events located all along north of Puerto Rico in an effort to characterize and understand the underlying mechanism of these clusters. This research uses two de-clustering methods to identify cluster events in the PRVI block. The first method, known as Model Independent Stochastic Declustering (MISD), filters the catalog sub-set into cluster and background seismic events, while the second method uses a spatio-temporal algorithm applied to the catalog in order to link the separate seismic events into clusters. After using these two methods, identified clusters were classified into either earthquake swarms or seismic sequences. Once identified, each cluster was analyzed to identify correlations against other clusters in their geographic region. Results from this research seek to : (1) unravel their earthquake clustering behavior through the use of different statistical methods and (2) better understand the mechanism for these clustering of earthquakes. Preliminary results have allowed to identify and classify 128 clusters categorized in 11 distinctive regions based on their centers, and their spatio-temporal distribution have been used to determine intra- and interplate dynamics.

  6. Transforming Graph Data for Statistical Relational Learning

    DTIC Science & Technology

    2012-10-01

    Jordan, 2003), PLSA (Hofmann, 1999), ? Classification via RMN (Taskar et al., 2003) or SVM (Hasan, Chaoji, Salem , & Zaki, 2006) ? Hierarchical...dimensionality reduction methods such as Principal 407 Rossi, McDowell, Aha, & Neville Component Analysis (PCA), Principal Factor Analysis ( PFA ), and...clustering algorithm. Journal of the Royal Statistical Society. Series C, Applied statistics, 28, 100–108. Hasan, M. A., Chaoji, V., Salem , S., & Zaki, M

  7. Identifying sighting clusters of endangered taxa with historical records.

    PubMed

    Duffy, Karl J

    2011-04-01

    The probability and time of extinction of taxa is often inferred from statistical analyses of historical records. Many of these analyses require the exclusion of multiple records within a unit of time (i.e., a month or a year). Nevertheless, spatially explicit, temporally aggregated data may be useful for identifying clusters of sightings (i.e., sighting clusters) in space and time. Identification of sighting clusters highlights changes in the historical recording of endangered taxa. I used two methods to identify sighting clusters in historical records: the Ederer-Myers-Mantel (EMM) test and the space-time permutation scan (STPS). I applied these methods to the spatially explicit sighting records of three species of orchids that are listed as endangered in the Republic of Ireland under the Wildlife Act (1976): Cephalanthera longifolia, Hammarbya paludosa, and Pseudorchis albida. Results with the EMM test were strongly affected by the choice of the time interval, and thus the number of temporal samples, used to examine the records. For example, sightings of P. albida clustered when the records were partitioned into 20-year temporal samples, but not when they were partitioned into 22-year temporal samples. Because the statistical power of EMM was low, it will not be useful when data are sparse. Nevertheless, the STPS identified regions that contained sighting clusters because it uses a flexible scanning window (defined by cylinders of varying size that move over the study area and evaluate the likelihood of clustering) to detect them, and it identified regions with high and regions with low rates of orchid sightings. The STPS analyses can be used to detect sighting clusters of endangered species that may be related to regions of extirpation and may assist in the categorization of threat status. ©2010 Society for Conservation Biology.

  8. Spatial distribution and cluster analysis of risky sexual behaviours and STDs reported by Chinese adults in Guangzhou, China: a representative population-based study

    PubMed Central

    Chen, Wen; Zhou, Fangjing; Hall, Brian J; Wang, Yu; Latkin, Carl; Ling, Li; Tucker, Joseph D

    2016-01-01

    Objectives To assess associations between residences location, risky sexual behaviours and sexually transmitted diseases (STDs) among adults living in Guangzhou, China. Methods Data were obtained from 751 Chinese adults aged 18–59 years in Guangzhou, China, using stratified random sampling by using spatial epidemiological methods. Face-to-face household interviews were conducted to collect self-report data on risky sexual behaviours and diagnosed STDs. Kulldorff’s spatial scan statistic was implemented to identify and detect spatial distribution and clusters of risky sexual behaviours and STDs. The presence and location of statistically significant clusters were mapped in the study areas using ArcGIS software. Results The prevalence of self-reported risky sexual behaviours was between 5.1% and 50.0%. The self-reported lifetime prevalence of diagnosed STDs was 7.06%. Anal intercourse clustered in an area located along the border within the rural–urban continuum (p=0.001). High rate clusters for alcohol or other drugs using before sex (p=0.008) and migrants who lived in Guangzhou <1 year (p=0.007) overlapped this cluster. Excess cases for unprotected sex (p=0.031) overlapped the cluster for college students (p<0.001). Five of nine (55.6%) students who had sexual experience during the last 12 months located in the cluster of unprotected sex. Conclusions Short-term migrants and college students reported greater risky sexual behaviours. Programmes to increase safer sex within these communities to reduce the risk of STDs are warranted in Guangzhou. Spatial analysis identified geographical clusters of risky sexual behaviours, which is critical for optimising surveillance and targeting control measures for these locations in the future. PMID:26843400

  9. Unequal cluster sizes in stepped-wedge cluster randomised trials: a systematic review

    PubMed Central

    Morris, Tom; Gray, Laura

    2017-01-01

    Objectives To investigate the extent to which cluster sizes vary in stepped-wedge cluster randomised trials (SW-CRT) and whether any variability is accounted for during the sample size calculation and analysis of these trials. Setting Any, not limited to healthcare settings. Participants Any taking part in an SW-CRT published up to March 2016. Primary and secondary outcome measures The primary outcome is the variability in cluster sizes, measured by the coefficient of variation (CV) in cluster size. Secondary outcomes include the difference between the cluster sizes assumed during the sample size calculation and those observed during the trial, any reported variability in cluster sizes and whether the methods of sample size calculation and methods of analysis accounted for any variability in cluster sizes. Results Of the 101 included SW-CRTs, 48% mentioned that the included clusters were known to vary in size, yet only 13% of these accounted for this during the calculation of the sample size. However, 69% of the trials did use a method of analysis appropriate for when clusters vary in size. Full trial reports were available for 53 trials. The CV was calculated for 23 of these: the median CV was 0.41 (IQR: 0.22–0.52). Actual cluster sizes could be compared with those assumed during the sample size calculation for 14 (26%) of the trial reports; the cluster sizes were between 29% and 480% of that which had been assumed. Conclusions Cluster sizes often vary in SW-CRTs. Reporting of SW-CRTs also remains suboptimal. The effect of unequal cluster sizes on the statistical power of SW-CRTs needs further exploration and methods appropriate to studies with unequal cluster sizes need to be employed. PMID:29146637

  10. A multimembership catalogue for 1876 open clusters using UCAC4 data

    NASA Astrophysics Data System (ADS)

    Sampedro, L.; Dias, W. S.; Alfaro, E. J.; Monteiro, H.; Molino, A.

    2017-10-01

    The main objective of this work is to determine the cluster members of 1876 open clusters, using positions and proper motions of the astrometric fourth United States Naval Observatory (USNO) CCD Astrograph Catalog (UCAC4). For this purpose, we apply three different methods, all based on a Bayesian approach, but with different formulations: a purely parametric method, another completely non-parametric algorithm and a third, recently developed by Sampedro & Alfaro, using both formulations at different steps of the whole process. The first and second statistical moments of the members' phase-space subspace, obtained after applying the three methods, are compared for every cluster. Although, on average, the three methods yield similar results, there are also specific differences between them, as well as for some particular clusters. The comparison with other published catalogues shows good agreement. We have also estimated, for the first time, the mean proper motion for a sample of 18 clusters. The results are organized in a single catalogue formed by two main files, one with the most relevant information for each cluster, partially including that in UCAC4, and the other showing the individual membership probabilities for each star in the cluster area. The final catalogue, with an interface design that enables an easy interaction with the user, is available in electronic format at the Stellar Systems Group (SSG-IAA) web site (http://ssg.iaa.es/en/content/sampedro-cluster-catalog).

  11. Connectivity-based fixel enhancement: Whole-brain statistical analysis of diffusion MRI measures in the presence of crossing fibres

    PubMed Central

    Raffelt, David A.; Smith, Robert E.; Ridgway, Gerard R.; Tournier, J-Donald; Vaughan, David N.; Rose, Stephen; Henderson, Robert; Connelly, Alan

    2015-01-01

    In brain regions containing crossing fibre bundles, voxel-average diffusion MRI measures such as fractional anisotropy (FA) are difficult to interpret, and lack within-voxel single fibre population specificity. Recent work has focused on the development of more interpretable quantitative measures that can be associated with a specific fibre population within a voxel containing crossing fibres (herein we use fixel to refer to a specific fibre population within a single voxel). Unfortunately, traditional 3D methods for smoothing and cluster-based statistical inference cannot be used for voxel-based analysis of these measures, since the local neighbourhood for smoothing and cluster formation can be ambiguous when adjacent voxels may have different numbers of fixels, or ill-defined when they belong to different tracts. Here we introduce a novel statistical method to perform whole-brain fixel-based analysis called connectivity-based fixel enhancement (CFE). CFE uses probabilistic tractography to identify structurally connected fixels that are likely to share underlying anatomy and pathology. Probabilistic connectivity information is then used for tract-specific smoothing (prior to the statistical analysis) and enhancement of the statistical map (using a threshold-free cluster enhancement-like approach). To investigate the characteristics of the CFE method, we assessed sensitivity and specificity using a large number of combinations of CFE enhancement parameters and smoothing extents, using simulated pathology generated with a range of test-statistic signal-to-noise ratios in five different white matter regions (chosen to cover a broad range of fibre bundle features). The results suggest that CFE input parameters are relatively insensitive to the characteristics of the simulated pathology. We therefore recommend a single set of CFE parameters that should give near optimal results in future studies where the group effect is unknown. We then demonstrate the proposed method by comparing apparent fibre density between motor neurone disease (MND) patients with control subjects. The MND results illustrate the benefit of fixel-specific statistical inference in white matter regions that contain crossing fibres. PMID:26004503

  12. Grouping of Bulgarian wines according to grape variety by using statistical methods

    NASA Astrophysics Data System (ADS)

    Milev, M.; Nikolova, Kr.; Ivanova, Ir.; Minkova, St.; Evtimov, T.; Krustev, St.

    2017-12-01

    68 different types of Bulgarian wines were studied in accordance with 9 optical parameters as follows: color parameters in XYZ and SIE Lab color systems, lightness, Hue angle, chroma, fluorescence intensity and emission wavelength. The main objective of this research is using hierarchical cluster analysis to evaluate the similarity and the distance between examined different types of Bulgarian wines and their grouping based on physical parameters. We have found that wines are grouped in clusters on the base of the degree of identity between them. There are two main clusters each one with two subclusters. The first one contains white wines and Sira, the second contains red wines and rose. The results from cluster analysis are presented graphically by a dendrogram. The other statistical technique used is factor analysis performed by the Method of Principal Components (PCA). The aim is to reduce the large number of variables to a few factors by grouping the correlated variables into one factor and subdividing the noncorrelated variables into different factors. Moreover the factor analysis provided the possibility to determine the parameters with the greatest influence over the distribution of samples in different clusters. In our study after the rotation of the factors with Varimax method the parameters were combined into two factors, which explain about 80 % of the total variation. The first one explains the 61.49% and correlates with color characteristics, the second one explains 18.34% from the variation and correlates with the parameters connected with fluorescence spectroscopy.

  13. Joint Clustering and Component Analysis of Correspondenceless Point Sets: Application to Cardiac Statistical Modeling.

    PubMed

    Gooya, Ali; Lekadir, Karim; Alba, Xenia; Swift, Andrew J; Wild, Jim M; Frangi, Alejandro F

    2015-01-01

    Construction of Statistical Shape Models (SSMs) from arbitrary point sets is a challenging problem due to significant shape variation and lack of explicit point correspondence across the training data set. In medical imaging, point sets can generally represent different shape classes that span healthy and pathological exemplars. In such cases, the constructed SSM may not generalize well, largely because the probability density function (pdf) of the point sets deviates from the underlying assumption of Gaussian statistics. To this end, we propose a generative model for unsupervised learning of the pdf of point sets as a mixture of distinctive classes. A Variational Bayesian (VB) method is proposed for making joint inferences on the labels of point sets, and the principal modes of variations in each cluster. The method provides a flexible framework to handle point sets with no explicit point-to-point correspondences. We also show that by maximizing the marginalized likelihood of the model, the optimal number of clusters of point sets can be determined. We illustrate this work in the context of understanding the anatomical phenotype of the left and right ventricles in heart. To this end, we use a database containing hearts of healthy subjects, patients with Pulmonary Hypertension (PH), and patients with Hypertrophic Cardiomyopathy (HCM). We demonstrate that our method can outperform traditional PCA in both generalization and specificity measures.

  14. Intraclass Correlation Coefficients for Obesity Indicators and Energy Balance-Related Behaviors among New York City Public Elementary Schools

    ERIC Educational Resources Information Center

    Gray, Heewon Lee; Burgermaster, Marissa; Tipton, Elizabeth; Contento, Isobel R.; Koch, Pamela A.; Di Noia, Jennifer

    2016-01-01

    Objective: Sample size and statistical power calculation should consider clustering effects when schools are the unit of randomization in intervention studies. The objective of the current study was to investigate how student outcomes are clustered within schools in an obesity prevention trial. Method: Baseline data from the Food, Health &…

  15. Fast clustering using adaptive density peak detection.

    PubMed

    Wang, Xiao-Feng; Xu, Yifan

    2017-12-01

    Common limitations of clustering methods include the slow algorithm convergence, the instability of the pre-specification on a number of intrinsic parameters, and the lack of robustness to outliers. A recent clustering approach proposed a fast search algorithm of cluster centers based on their local densities. However, the selection of the key intrinsic parameters in the algorithm was not systematically investigated. It is relatively difficult to estimate the "optimal" parameters since the original definition of the local density in the algorithm is based on a truncated counting measure. In this paper, we propose a clustering procedure with adaptive density peak detection, where the local density is estimated through the nonparametric multivariate kernel estimation. The model parameter is then able to be calculated from the equations with statistical theoretical justification. We also develop an automatic cluster centroid selection method through maximizing an average silhouette index. The advantage and flexibility of the proposed method are demonstrated through simulation studies and the analysis of a few benchmark gene expression data sets. The method only needs to perform in one single step without any iteration and thus is fast and has a great potential to apply on big data analysis. A user-friendly R package ADPclust is developed for public use.

  16. Reconstruction of a digital core containing clay minerals based on a clustering algorithm.

    PubMed

    He, Yanlong; Pu, Chunsheng; Jing, Cheng; Gu, Xiaoyu; Chen, Qingdong; Liu, Hongzhi; Khan, Nasir; Dong, Qiaoling

    2017-10-01

    It is difficult to obtain a core sample and information for digital core reconstruction of mature sandstone reservoirs around the world, especially for an unconsolidated sandstone reservoir. Meanwhile, reconstruction and division of clay minerals play a vital role in the reconstruction of the digital cores, although the two-dimensional data-based reconstruction methods are specifically applicable as the microstructure reservoir simulation methods for the sandstone reservoir. However, reconstruction of clay minerals is still challenging from a research viewpoint for the better reconstruction of various clay minerals in the digital cores. In the present work, the content of clay minerals was considered on the basis of two-dimensional information about the reservoir. After application of the hybrid method, and compared with the model reconstructed by the process-based method, the digital core containing clay clusters without the labels of the clusters' number, size, and texture were the output. The statistics and geometry of the reconstruction model were similar to the reference model. In addition, the Hoshen-Kopelman algorithm was used to label various connected unclassified clay clusters in the initial model and then the number and size of clay clusters were recorded. At the same time, the K-means clustering algorithm was applied to divide the labeled, large connecting clusters into smaller clusters on the basis of difference in the clusters' characteristics. According to the clay minerals' characteristics, such as types, textures, and distributions, the digital core containing clay minerals was reconstructed by means of the clustering algorithm and the clay clusters' structure judgment. The distributions and textures of the clay minerals of the digital core were reasonable. The clustering algorithm improved the digital core reconstruction and provided an alternative method for the simulation of different clay minerals in the digital cores.

  17. Advances in Significance Testing for Cluster Detection

    NASA Astrophysics Data System (ADS)

    Coleman, Deidra Andrea

    Over the past two decades, much attention has been given to data driven project goals such as the Human Genome Project and the development of syndromic surveillance systems. A major component of these types of projects is analyzing the abundance of data. Detecting clusters within the data can be beneficial as it can lead to the identification of specified sequences of DNA nucleotides that are related to important biological functions or the locations of epidemics such as disease outbreaks or bioterrorism attacks. Cluster detection techniques require efficient and accurate hypothesis testing procedures. In this dissertation, we improve upon the hypothesis testing procedures for cluster detection by enhancing distributional theory and providing an alternative method for spatial cluster detection using syndromic surveillance data. In Chapter 2, we provide an efficient method to compute the exact distribution of the number and coverage of h-clumps of a collection of words. This method involves defining a Markov chain using a minimal deterministic automaton to reduce the number of states needed for computation. We allow words of the collection to contain other words of the collection making the method more general. We use our method to compute the distributions of the number and coverage of h-clumps in the Chi motif of H. influenza.. In Chapter 3, we provide an efficient algorithm to compute the exact distribution of multiple window discrete scan statistics for higher-order, multi-state Markovian sequences. This algorithm involves defining a Markov chain to efficiently keep track of probabilities needed to compute p-values of the statistic. We use our algorithm to identify cases where the available approximation does not perform well. We also use our algorithm to detect unusual clusters of made free throw shots by National Basketball Association players during the 2009-2010 regular season. In Chapter 4, we give a procedure to detect outbreaks using syndromic surveillance data while controlling the Bayesian False Discovery Rate (BFDR). The procedure entails choosing an appropriate Bayesian model that captures the spatial dependency inherent in epidemiological data and considers all days of interest, selecting a test statistic based on a chosen measure that provides the magnitude of the maximumal spatial cluster for each day, and identifying a cutoff value that controls the BFDR for rejecting the collective null hypothesis of no outbreak over a collection of days for a specified region.We use our procedure to analyze botulism-like syndrome data collected by the North Carolina Disease Event Tracking and Epidemiologic Collection Tool (NC DETECT).

  18. A Measurement of Gravitational Lensing of the Cosmic Microwave Background by Galaxy Clusters Using Data from the South Pole Telescope

    DOE PAGES

    Baxter, E. J.; Keisler, R.; Dodelson, S.; ...

    2015-06-22

    Clusters of galaxies are expected to gravitationally lens the cosmic microwave background (CMB) and thereby generate a distinct signal in the CMB on arcminute scales. Measurements of this effect can be used to constrain the masses of galaxy clusters with CMB data alone. Here we present a measurement of lensing of the CMB by galaxy clusters using data from the South Pole Telescope (SPT). We also develop a maximum likelihood approach to extract the CMB cluster lensing signal and validate the method on mock data. We quantify the effects on our analysis of several potential sources of systematic error andmore » find that they generally act to reduce the best-fit cluster mass. It is estimated that this bias to lower cluster mass is roughly 0.85σ in units of the statistical error bar, although this estimate should be viewed as an upper limit. Furthermore, we apply our maximum likelihood technique to 513 clusters selected via their Sunyaev–Zeldovich (SZ) signatures in SPT data, and rule out the null hypothesis of no lensing at 3.1σ. The lensing-derived mass estimate for the full cluster sample is consistent with that inferred from the SZ flux: M 200,lens = 0.83 +0.38 -0.37 M 200,SZ (68% C.L., statistical error only).« less

  19. Use of multiple cluster analysis methods to explore the validity of a community outcomes concept map.

    PubMed

    Orsi, Rebecca

    2017-02-01

    Concept mapping is now a commonly-used technique for articulating and evaluating programmatic outcomes. However, research regarding validity of knowledge and outcomes produced with concept mapping is sparse. The current study describes quantitative validity analyses using a concept mapping dataset. We sought to increase the validity of concept mapping evaluation results by running multiple cluster analysis methods and then using several metrics to choose from among solutions. We present four different clustering methods based on analyses using the R statistical software package: partitioning around medoids (PAM), fuzzy analysis (FANNY), agglomerative nesting (AGNES) and divisive analysis (DIANA). We then used the Dunn and Davies-Bouldin indices to assist in choosing a valid cluster solution for a concept mapping outcomes evaluation. We conclude that the validity of the outcomes map is high, based on the analyses described. Finally, we discuss areas for further concept mapping methods research. Copyright © 2016 Elsevier Ltd. All rights reserved.

  20. Multiple imputation methods for bivariate outcomes in cluster randomised trials.

    PubMed

    DiazOrdaz, K; Kenward, M G; Gomes, M; Grieve, R

    2016-09-10

    Missing observations are common in cluster randomised trials. The problem is exacerbated when modelling bivariate outcomes jointly, as the proportion of complete cases is often considerably smaller than the proportion having either of the outcomes fully observed. Approaches taken to handling such missing data include the following: complete case analysis, single-level multiple imputation that ignores the clustering, multiple imputation with a fixed effect for each cluster and multilevel multiple imputation. We contrasted the alternative approaches to handling missing data in a cost-effectiveness analysis that uses data from a cluster randomised trial to evaluate an exercise intervention for care home residents. We then conducted a simulation study to assess the performance of these approaches on bivariate continuous outcomes, in terms of confidence interval coverage and empirical bias in the estimated treatment effects. Missing-at-random clustered data scenarios were simulated following a full-factorial design. Across all the missing data mechanisms considered, the multiple imputation methods provided estimators with negligible bias, while complete case analysis resulted in biased treatment effect estimates in scenarios where the randomised treatment arm was associated with missingness. Confidence interval coverage was generally in excess of nominal levels (up to 99.8%) following fixed-effects multiple imputation and too low following single-level multiple imputation. Multilevel multiple imputation led to coverage levels of approximately 95% throughout. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd. © 2016 The Authors. Statistics in Medicine Published by John Wiley & Sons Ltd.

  1. Statistical analysis and handling of missing data in cluster randomized trials: a systematic review.

    PubMed

    Fiero, Mallorie H; Huang, Shuang; Oren, Eyal; Bell, Melanie L

    2016-02-09

    Cluster randomized trials (CRTs) randomize participants in groups, rather than as individuals and are key tools used to assess interventions in health research where treatment contamination is likely or if individual randomization is not feasible. Two potential major pitfalls exist regarding CRTs, namely handling missing data and not accounting for clustering in the primary analysis. The aim of this review was to evaluate approaches for handling missing data and statistical analysis with respect to the primary outcome in CRTs. We systematically searched for CRTs published between August 2013 and July 2014 using PubMed, Web of Science, and PsycINFO. For each trial, two independent reviewers assessed the extent of the missing data and method(s) used for handling missing data in the primary and sensitivity analyses. We evaluated the primary analysis and determined whether it was at the cluster or individual level. Of the 86 included CRTs, 80 (93%) trials reported some missing outcome data. Of those reporting missing data, the median percent of individuals with a missing outcome was 19% (range 0.5 to 90%). The most common way to handle missing data in the primary analysis was complete case analysis (44, 55%), whereas 18 (22%) used mixed models, six (8%) used single imputation, four (5%) used unweighted generalized estimating equations, and two (2%) used multiple imputation. Fourteen (16%) trials reported a sensitivity analysis for missing data, but most assumed the same missing data mechanism as in the primary analysis. Overall, 67 (78%) trials accounted for clustering in the primary analysis. High rates of missing outcome data are present in the majority of CRTs, yet handling missing data in practice remains suboptimal. Researchers and applied statisticians should carry out appropriate missing data methods, which are valid under plausible assumptions in order to increase statistical power in trials and reduce the possibility of bias. Sensitivity analysis should be performed, with weakened assumptions regarding the missing data mechanism to explore the robustness of results reported in the primary analysis.

  2. Weakly supervised image semantic segmentation based on clustering superpixels

    NASA Astrophysics Data System (ADS)

    Yan, Xiong; Liu, Xiaohua

    2018-04-01

    In this paper, we propose an image semantic segmentation model which is trained from image-level labeled images. The proposed model starts with superpixel segmenting, and features of the superpixels are extracted by trained CNN. We introduce a superpixel-based graph followed by applying the graph partition method to group correlated superpixels into clusters. For the acquisition of inter-label correlations between the image-level labels in dataset, we not only utilize label co-occurrence statistics but also exploit visual contextual cues simultaneously. At last, we formulate the task of mapping appropriate image-level labels to the detected clusters as a problem of convex minimization. Experimental results on MSRC-21 dataset and LableMe dataset show that the proposed method has a better performance than most of the weakly supervised methods and is even comparable to fully supervised methods.

  3. Defining syndromes using cattle meat inspection data for syndromic surveillance purposes: a statistical approach with the 2005-2010 data from ten French slaughterhouses.

    PubMed

    Dupuy, Céline; Morignat, Eric; Maugey, Xavier; Vinard, Jean-Luc; Hendrikx, Pascal; Ducrot, Christian; Calavas, Didier; Gay, Emilie

    2013-04-30

    The slaughterhouse is a central processing point for food animals and thus a source of both demographic data (age, breed, sex) and health-related data (reason for condemnation and condemned portions) that are not available through other sources. Using these data for syndromic surveillance is therefore tempting. However many possible reasons for condemnation and condemned portions exist, making the definition of relevant syndromes challenging.The objective of this study was to determine a typology of cattle with at least one portion of the carcass condemned in order to define syndromes. Multiple factor analysis (MFA) in combination with clustering methods was performed using both health-related data and demographic data. Analyses were performed on 381,186 cattle with at least one portion of the carcass condemned among the 1,937,917 cattle slaughtered in ten French abattoirs. Results of the MFA and clustering methods led to 12 clusters considered as stable according to year of slaughter and slaughterhouse. One cluster was specific to a disease of public health importance (cysticercosis). Two clusters were linked to the slaughtering process (fecal contamination of heart or lungs and deterioration lesions). Two clusters respectively characterized by chronic liver lesions and chronic peritonitis could be linked to diseases of economic importance to farmers. Three clusters could be linked respectively to reticulo-pericarditis, fatty liver syndrome and farmer's lung syndrome, which are related to both diseases of economic importance to farmers and herd management issues. Three clusters respectively characterized by arthritis, myopathy and Dark Firm Dry (DFD) meat could notably be linked to animal welfare issues. Finally, one cluster, characterized by bronchopneumonia, could be linked to both animal health and herd management issues. The statistical approach of combining multiple factor analysis with cluster analysis showed its relevance for the detection of syndromes using available large and complex slaughterhouse data. The advantages of this statistical approach are to i) define groups of reasons for condemnation based on meat inspection data, ii) help grouping reasons for condemnation among a list of various possible reasons for condemnation for which a consensus among experts could be difficult to reach, iii) assign each animal to a single syndrome which allows the detection of changes in trends of syndromes to detect unusual patterns in known diseases and emergence of new diseases.

  4. Photometric redshifts as a tool for studying the Coma cluster galaxy populations

    NASA Astrophysics Data System (ADS)

    Adami, C.; Ilbert, O.; Pelló, R.; Cuillandre, J. C.; Durret, F.; Mazure, A.; Picat, J. P.; Ulmer, M. P.

    2008-12-01

    Aims: We apply photometric redshift techniques to an investigation of the Coma cluster galaxy luminosity function (GLF) at faint magnitudes, in particular in the u* band where basically no studies are presently available at these magnitudes. Methods: Cluster members were selected based on probability distribution function from photometric redshift calculations applied to deep u^*, B, V, R, I images covering a region of almost 1 deg2 (completeness limit R ~ 24). In the area covered only by the u* image, the GLF was also derived after a statistical background subtraction. Results: Global and local GLFs in the B, V, R, and I bands obtained with photometric redshift selection are consistent with our previous results based on a statistical background subtraction. The GLF in the u* band shows an increase in the faint end slope towards the outer regions of the cluster. The analysis of the multicolor type spatial distribution reveals that late type galaxies are distributed in clumps in the cluster outskirts, where X-ray substructures are also detected and where the GLF in the u* band is steeper. Conclusions: We can reproduce the GLFs computed with classical statistical subtraction methods by applying a photometric redshift technique. The u* GLF slope is steeper in the cluster outskirts, varying from α ~ -1 in the cluster center to α ~ -2 in the cluster periphery. The concentrations of faint late type galaxies in the cluster outskirts could explain these very steep slopes, assuming a short burst of star formation in these galaxies when entering the cluster. Based on observations obtained with MegaPrime/MegaCam, a joint project of CFHT and CEA/DAPNIA, at the Canada-France-Hawaii Telescope (CFHT) which is operated by the National Research Council (NRC) of Canada, the Institut National des Sciences de l'Univers of the Centre National de la Recherche Scientifique (CNRS) of France, and the University of Hawaii. This work is also partly based on data products produced at TERAPIX and the Canadian Astronomy Data Centre as part of the Canada-France-Hawaii Telescope Legacy Survey, a collaborative project of NRC and CNRS. Also based on data from W. M. Keck Observatory which is operated as a scientific partnership between the California Institute of Technology, the University of California, and NASA. It was made possible by the generous financial support of the W. M. Keck Foundation.

  5. A flexibly shaped space-time scan statistic for disease outbreak detection and monitoring.

    PubMed

    Takahashi, Kunihiko; Kulldorff, Martin; Tango, Toshiro; Yih, Katherine

    2008-04-11

    Early detection of disease outbreaks enables public health officials to implement disease control and prevention measures at the earliest possible time. A time periodic geographical disease surveillance system based on a cylindrical space-time scan statistic has been used extensively for disease surveillance along with the SaTScan software. In the purely spatial setting, many different methods have been proposed to detect spatial disease clusters. In particular, some spatial scan statistics are aimed at detecting irregularly shaped clusters which may not be detected by the circular spatial scan statistic. Based on the flexible purely spatial scan statistic, we propose a flexibly shaped space-time scan statistic for early detection of disease outbreaks. The performance of the proposed space-time scan statistic is compared with that of the cylindrical scan statistic using benchmark data. In order to compare their performances, we have developed a space-time power distribution by extending the purely spatial bivariate power distribution. Daily syndromic surveillance data in Massachusetts, USA, are used to illustrate the proposed test statistic. The flexible space-time scan statistic is well suited for detecting and monitoring disease outbreaks in irregularly shaped areas.

  6. An Empirical Taxonomy of Hospital Governing Board Roles

    PubMed Central

    Lee, Shoou-Yih D; Alexander, Jeffrey A; Wang, Virginia; Margolin, Frances S; Combes, John R

    2008-01-01

    Objective To develop a taxonomy of governing board roles in U.S. hospitals. Data Sources 2005 AHA Hospital Governance Survey, 2004 AHA Annual Survey of Hospitals, and Area Resource File. Study Design A governing board taxonomy was developed using cluster analysis. Results were validated and reviewed by industry experts. Differences in hospital and environmental characteristics across clusters were examined. Data Extraction Methods One-thousand three-hundred thirty-four hospitals with complete information on the study variables were included in the analysis. Principal Findings Five distinct clusters of hospital governing boards were identified. Statistical tests showed that the five clusters had high internal reliability and high internal validity. Statistically significant differences in hospital and environmental conditions were found among clusters. Conclusions The developed taxonomy provides policy makers, health care executives, and researchers a useful way to describe and understand hospital governing board roles. The taxonomy may also facilitate valid and systematic assessment of governance performance. Further, the taxonomy could be used as a framework for governing boards themselves to identify areas for improvement and direction for change. PMID:18355260

  7. Clustering change patterns using Fourier transformation with time-course gene expression data.

    PubMed

    Kim, Jaehee

    2011-01-01

    To understand the behavior of genes, it is important to explore how the patterns of gene expression change over a period of time because biologically related gene groups can share the same change patterns. In this study, the problem of finding similar change patterns is induced to clustering with the derivative Fourier coefficients. This work is aimed at discovering gene groups with similar change patterns which share similar biological properties. We developed a statistical model using derivative Fourier coefficients to identify similar change patterns of gene expression. We used a model-based method to cluster the Fourier series estimation of derivatives. We applied our model to cluster change patterns of yeast cell cycle microarray expression data with alpha-factor synchronization. It showed that, as the method clusters with the probability-neighboring data, the model-based clustering with our proposed model yielded biologically interpretable results. We expect that our proposed Fourier analysis with suitably chosen smoothing parameters could serve as a useful tool in classifying genes and interpreting possible biological change patterns.

  8. Familial clustering of overweight and obesity among schoolchildren in northern China.

    PubMed

    Li, Zengning; Luo, Bin; Du, Limei; Hu, Huanyu; Xie, Ying

    2014-01-01

    We aimed to study the prevalence of overweight and obesity and to assess its familial clustering among schoolchildren in northern China. A cross-sectional study was conducted on 95,292 schoolchildren in northern China to investigate the prevalence of overweight and obesity. A group of overweight and obese children (n = 450) was selected using a cluster sampling method. Answers from a questionnaire on their and their families' nutrition and behaviors were recorded and analyzed statistically. The prevalence of overweight and obesity in schoolchildren was 27.4% and 13.2%, respectively. The prevalence of overweight and obesity were significantly higher in boys than in girls. The prevalence of familial clustering of overweight and obesity was 75.3% and 20.3%, respectively. The prevalence of overweight in first-generation (parents) and second-generation (grandparents) relatives was 54.6% and 53.1%, respectively. There was a linear trend toward correlation between age and the rates of overweight and obesity. The familial clustering of obesity with family income reached statistical significance. The prevalence of overweight and obesity was extremely high, especially among boys and their fathers. Evidence of familial clustering of overweight and obesity among schoolchildren and their parental family members in northern China is emerging.

  9. Statistical mechanics of high-density bond percolation

    NASA Astrophysics Data System (ADS)

    Timonin, P. N.

    2018-05-01

    High-density (HD) percolation describes the percolation of specific κ -clusters, which are the compact sets of sites each connected to κ nearest filled sites at least. It takes place in the classical patterns of independently distributed sites or bonds in which the ordinary percolation transition also exists. Hence, the study of series of κ -type HD percolations amounts to the description of classical clusters' structure for which κ -clusters constitute κ -cores nested one into another. Such data are needed for description of a number of physical, biological, and information properties of complex systems on random lattices, graphs, and networks. They range from magnetic properties of semiconductor alloys to anomalies in supercooled water and clustering in biological and social networks. Here we present the statistical mechanics approach to study HD bond percolation on an arbitrary graph. It is shown that the generating function for κ -clusters' size distribution can be obtained from the partition function of the specific q -state Potts-Ising model in the q →1 limit. Using this approach we find exact κ -clusters' size distributions for the Bethe lattice and Erdos-Renyi graph. The application of the method to Euclidean lattices is also discussed.

  10. Comparison of four statistical and machine learning methods for crash severity prediction.

    PubMed

    Iranitalab, Amirfarrokh; Khattak, Aemal

    2017-11-01

    Crash severity prediction models enable different agencies to predict the severity of a reported crash with unknown severity or the severity of crashes that may be expected to occur sometime in the future. This paper had three main objectives: comparison of the performance of four statistical and machine learning methods including Multinomial Logit (MNL), Nearest Neighbor Classification (NNC), Support Vector Machines (SVM) and Random Forests (RF), in predicting traffic crash severity; developing a crash costs-based approach for comparison of crash severity prediction methods; and investigating the effects of data clustering methods comprising K-means Clustering (KC) and Latent Class Clustering (LCC), on the performance of crash severity prediction models. The 2012-2015 reported crash data from Nebraska, United States was obtained and two-vehicle crashes were extracted as the analysis data. The dataset was split into training/estimation (2012-2014) and validation (2015) subsets. The four prediction methods were trained/estimated using the training/estimation dataset and the correct prediction rates for each crash severity level, overall correct prediction rate and a proposed crash costs-based accuracy measure were obtained for the validation dataset. The correct prediction rates and the proposed approach showed NNC had the best prediction performance in overall and in more severe crashes. RF and SVM had the next two sufficient performances and MNL was the weakest method. Data clustering did not affect the prediction results of SVM, but KC improved the prediction performance of MNL, NNC and RF, while LCC caused improvement in MNL and RF but weakened the performance of NNC. Overall correct prediction rate had almost the exact opposite results compared to the proposed approach, showing that neglecting the crash costs can lead to misjudgment in choosing the right prediction method. Copyright © 2017 Elsevier Ltd. All rights reserved.

  11. A nonparametric spatial scan statistic for continuous data.

    PubMed

    Jung, Inkyung; Cho, Ho Jin

    2015-10-20

    Spatial scan statistics are widely used for spatial cluster detection, and several parametric models exist. For continuous data, a normal-based scan statistic can be used. However, the performance of the model has not been fully evaluated for non-normal data. We propose a nonparametric spatial scan statistic based on the Wilcoxon rank-sum test statistic and compared the performance of the method with parametric models via a simulation study under various scenarios. The nonparametric method outperforms the normal-based scan statistic in terms of power and accuracy in almost all cases under consideration in the simulation study. The proposed nonparametric spatial scan statistic is therefore an excellent alternative to the normal model for continuous data and is especially useful for data following skewed or heavy-tailed distributions.

  12. Voronoi distance based prospective space-time scans for point data sets: a dengue fever cluster analysis in a southeast Brazilian town

    PubMed Central

    2011-01-01

    Background The Prospective Space-Time scan statistic (PST) is widely used for the evaluation of space-time clusters of point event data. Usually a window of cylindrical shape is employed, with a circular or elliptical base in the space domain. Recently, the concept of Minimum Spanning Tree (MST) was applied to specify the set of potential clusters, through the Density-Equalizing Euclidean MST (DEEMST) method, for the detection of arbitrarily shaped clusters. The original map is cartogram transformed, such that the control points are spread uniformly. That method is quite effective, but the cartogram construction is computationally expensive and complicated. Results A fast method for the detection and inference of point data set space-time disease clusters is presented, the Voronoi Based Scan (VBScan). A Voronoi diagram is built for points representing population individuals (cases and controls). The number of Voronoi cells boundaries intercepted by the line segment joining two cases points defines the Voronoi distance between those points. That distance is used to approximate the density of the heterogeneous population and build the Voronoi distance MST linking the cases. The successive removal of edges from the Voronoi distance MST generates sub-trees which are the potential space-time clusters. Finally, those clusters are evaluated through the scan statistic. Monte Carlo replications of the original data are used to evaluate the significance of the clusters. An application for dengue fever in a small Brazilian city is presented. Conclusions The ability to promptly detect space-time clusters of disease outbreaks, when the number of individuals is large, was shown to be feasible, due to the reduced computational load of VBScan. Instead of changing the map, VBScan modifies the metric used to define the distance between cases, without requiring the cartogram construction. Numerical simulations showed that VBScan has higher power of detection, sensitivity and positive predicted value than the Elliptic PST. Furthermore, as VBScan also incorporates topological information from the point neighborhood structure, in addition to the usual geometric information, it is more robust than purely geometric methods such as the elliptic scan. Those advantages were illustrated in a real setting for dengue fever space-time clusters. PMID:21513556

  13. A two-stage method for microcalcification cluster segmentation in mammography by deformable models

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Arikidis, N.; Kazantzi, A.; Skiadopoulos, S.

    Purpose: Segmentation of microcalcification (MC) clusters in x-ray mammography is a difficult task for radiologists. Accurate segmentation is prerequisite for quantitative image analysis of MC clusters and subsequent feature extraction and classification in computer-aided diagnosis schemes. Methods: In this study, a two-stage semiautomated segmentation method of MC clusters is investigated. The first stage is targeted to accurate and time efficient segmentation of the majority of the particles of a MC cluster, by means of a level set method. The second stage is targeted to shape refinement of selected individual MCs, by means of an active contour model. Both methods aremore » applied in the framework of a rich scale-space representation, provided by the wavelet transform at integer scales. Segmentation reliability of the proposed method in terms of inter and intraobserver agreements was evaluated in a case sample of 80 MC clusters originating from the digital database for screening mammography, corresponding to 4 morphology types (punctate: 22, fine linear branching: 16, pleomorphic: 18, and amorphous: 24) of MC clusters, assessing radiologists’ segmentations quantitatively by two distance metrics (Hausdorff distance—HDIST{sub cluster}, average of minimum distance—AMINDIST{sub cluster}) and the area overlap measure (AOM{sub cluster}). The effect of the proposed segmentation method on MC cluster characterization accuracy was evaluated in a case sample of 162 pleomorphic MC clusters (72 malignant and 90 benign). Ten MC cluster features, targeted to capture morphologic properties of individual MCs in a cluster (area, major length, perimeter, compactness, and spread), were extracted and a correlation-based feature selection method yielded a feature subset to feed in a support vector machine classifier. Classification performance of the MC cluster features was estimated by means of the area under receiver operating characteristic curve (Az ± Standard Error) utilizing tenfold cross-validation methodology. A previously developed B-spline active rays segmentation method was also considered for comparison purposes. Results: Interobserver and intraobserver segmentation agreements (median and [25%, 75%] quartile range) were substantial with respect to the distance metrics HDIST{sub cluster} (2.3 [1.8, 2.9] and 2.5 [2.1, 3.2] pixels) and AMINDIST{sub cluster} (0.8 [0.6, 1.0] and 1.0 [0.8, 1.2] pixels), while moderate with respect to AOM{sub cluster} (0.64 [0.55, 0.71] and 0.59 [0.52, 0.66]). The proposed segmentation method outperformed (0.80 ± 0.04) statistically significantly (Mann-Whitney U-test, p < 0.05) the B-spline active rays segmentation method (0.69 ± 0.04), suggesting the significance of the proposed semiautomated method. Conclusions: Results indicate a reliable semiautomated segmentation method for MC clusters offered by deformable models, which could be utilized in MC cluster quantitative image analysis.« less

  14. Serial clustering of extratropical cyclones and relationship with NAO and jet intensity based on the IMILAST cyclone database

    NASA Astrophysics Data System (ADS)

    Ulbrich, Sven; Pinto, Joaquim G.; Economou, Theodoros; Stephenson, David B.; Karremann, Melanie K.; Shaffrey, Len C.

    2017-04-01

    Cyclone families are a frequent synoptic weather feature in the Euro-Atlantic area, particularly during wintertime. Given appropriate large-scale conditions, such series (clusters) of storms may cause large socio-economic impacts and cumulative losses. Recent studies analyzing reanalysis data using single cyclone tracking methods have shown that serial clustering of cyclones occurs on both flanks and downstream regions of the North Atlantic storm track. Based on winter (DJF) cyclone counts from the IMILAST cyclone database, we explore the representation of serial clustering in the ERA-Interim period and its relationship with the NAO-phase and jet intensity. With this aim, clustering is estimated by the dispersion of winter (DJF) cyclone passages for each grid point over the Euro-Atlantic area. Results indicate that clustering over the Eastern North Atlantic and Western Europe can be identified for all methods, although the exact location and the dispersion magnitude may vary. The relationship between clustering and (i) the NAO-phase and (ii) jet intensity over the North Atlantic is statistically evaluated. Results show that the NAO-index and the jet intensity show a strong contribution to clustering, even though some spread is found between methods. We conclude that the general features of clustering of extratropical cyclones over the North Atlantic and Western Europe are robust to the choice of tracking method. The same is true for the influence of the NAO and jet intensity on cyclone dispersion.

  15. Calibrating First-Order Strong Lensing Mass Estimates in Clusters of Galaxies

    NASA Astrophysics Data System (ADS)

    Reed, Brendan; Remolian, Juan; Sharon, Keren; Li, Nan; SPT Clusters Cooperation

    2018-01-01

    We investigate methods to reduce the statistical and systematic errors inherent to using the Einstein Radius as a first-order mass estimate in strong lensing galaxy clusters. By finding an empirical universal calibration function, we aim to enable a first-order mass estimate of large cluster data sets in a fraction of the time and effort of full-scale strong lensing mass modeling. We use 74 simulated cluster data from the Argonne National Laboratory in a lens redshift slice of [0.159, 0.667] with various source redshifts in the range of [1.23, 2.69]. From the simulated density maps, we calculate the exact mass enclosed within the Einstein Radius. We find that the mass inferred from the Einstein Radius alone produces an error width of ~39% with respect to the true mass. We explore an array of polynomial and exponential correction functions with dependence on cluster redshift and projected radii of the lensed images, aiming to reduce the statistical and systematic uncertainty. We find that the error on the the mass inferred from the Einstein Radius can be reduced significantly by using a universal correction function. Our study has implications for current and future large galaxy cluster surveys aiming to measure cluster mass, and the mass-concentration relation.

  16. Micro-heterogeneity versus clustering in binary mixtures of ethanol with water or alkanes.

    PubMed

    Požar, Martina; Lovrinčević, Bernarda; Zoranić, Larisa; Primorać, Tomislav; Sokolić, Franjo; Perera, Aurélien

    2016-08-24

    Ethanol is a hydrogen bonding liquid. When mixed in small concentrations with water or alkanes, it forms aggregate structures reminiscent of, respectively, the direct and inverse micellar aggregates found in emulsions, albeit at much smaller sizes. At higher concentrations, micro-heterogeneous mixing with segregated domains is found. We examine how different statistical methods, namely correlation function analysis, structure factor analysis and cluster distribution analysis, can describe efficiently these morphological changes in these mixtures. In particular, we explain how the neat alcohol pre-peak of the structure factor evolves into the domain pre-peak under mixing conditions, and how this evolution differs whether the co-solvent is water or alkane. This study clearly establishes the heuristic superiority of the correlation function/structure factor analysis to study the micro-heterogeneity, since cluster distribution analysis is insensitive to domain segregation. Correlation functions detect the domains, with a clear structure factor pre-peak signature, while the cluster techniques detect the cluster hierarchy within domains. The main conclusion is that, in micro-segregated mixtures, the domain structure is a more fundamental statistical entity than the underlying cluster structures. These findings could help better understand comparatively the radiation scattering experiments, which are sensitive to domains, versus the spectroscopy-NMR experiments, which are sensitive to clusters.

  17. Descriptive Statistics and Cluster Analysis for Extreme Rainfall in Java Island

    NASA Astrophysics Data System (ADS)

    E Komalasari, K.; Pawitan, H.; Faqih, A.

    2017-03-01

    This study aims to describe regional pattern of extreme rainfall based on maximum daily rainfall for period 1983 to 2012 in Java Island. Descriptive statistics analysis was performed to obtain centralization, variation and distribution of maximum precipitation data. Mean and median are utilized to measure central tendency data while Inter Quartile Range (IQR) and standard deviation are utilized to measure variation of data. In addition, skewness and kurtosis used to obtain shape the distribution of rainfall data. Cluster analysis using squared euclidean distance and ward method is applied to perform regional grouping. Result of this study show that mean (average) of maximum daily rainfall in Java Region during period 1983-2012 is around 80-181mm with median between 75-160mm and standard deviation between 17 to 82. Cluster analysis produces four clusters and show that western area of Java tent to have a higher annual maxima of daily rainfall than northern area, and have more variety of annual maximum value.

  18. Complex networks as a unified framework for descriptive analysis and predictive modeling in climate

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Steinhaeuser, Karsten J K; Chawla, Nitesh; Ganguly, Auroop R

    The analysis of climate data has relied heavily on hypothesis-driven statistical methods, while projections of future climate are based primarily on physics-based computational models. However, in recent years a wealth of new datasets has become available. Therefore, we take a more data-centric approach and propose a unified framework for studying climate, with an aim towards characterizing observed phenomena as well as discovering new knowledge in the climate domain. Specifically, we posit that complex networks are well-suited for both descriptive analysis and predictive modeling tasks. We show that the structural properties of climate networks have useful interpretation within the domain. Further,more » we extract clusters from these networks and demonstrate their predictive power as climate indices. Our experimental results establish that the network clusters are statistically significantly better predictors than clusters derived using a more traditional clustering approach. Using complex networks as data representation thus enables the unique opportunity for descriptive and predictive modeling to inform each other.« less

  19. Pearson's chi-square test and rank correlation inferences for clustered data.

    PubMed

    Shih, Joanna H; Fay, Michael P

    2017-09-01

    Pearson's chi-square test has been widely used in testing for association between two categorical responses. Spearman rank correlation and Kendall's tau are often used for measuring and testing association between two continuous or ordered categorical responses. However, the established statistical properties of these tests are only valid when each pair of responses are independent, where each sampling unit has only one pair of responses. When each sampling unit consists of a cluster of paired responses, the assumption of independent pairs is violated. In this article, we apply the within-cluster resampling technique to U-statistics to form new tests and rank-based correlation estimators for possibly tied clustered data. We develop large sample properties of the new proposed tests and estimators and evaluate their performance by simulations. The proposed methods are applied to a data set collected from a PET/CT imaging study for illustration. Published 2017. This article is a U.S. Government work and is in the public domain in the USA.

  20. Determining the Number of Component Clusters in the Standard Multivariate Normal Mixture Model Using Model-Selection Criteria.

    DTIC Science & Technology

    1983-06-16

    has been advocated by Gnanadesikan and 𔃾ilk (1969), and others in the literature. This suggests that, if we use the formal signficance test type...American Statistical Asso., 62, 1159-1178. Gnanadesikan , R., and Wilk, M..B. (1969). Data Analytic Methods in Multi- variate Statistical Analysis. In

  1. Spatial distribution and cluster analysis of retail drug shop characteristics and antimalarial behaviors as reported by private medicine retailers in western Kenya: informing future interventions.

    PubMed

    Rusk, Andria; Highfield, Linda; Wilkerson, J Michael; Harrell, Melissa; Obala, Andrew; Amick, Benjamin

    2016-02-19

    Efforts to improve malaria case management in sub-Saharan Africa have shifted focus to private antimalarial retailers to increase access to appropriate treatment. Demands to decrease intervention cost while increasing efficacy requires interventions tailored to geographic regions with demonstrated need. Cluster analysis presents an opportunity to meet this demand, but has not been applied to the retail sector or antimalarial retailer behaviors. This research conducted cluster analysis on medicine retailer behaviors in Kenya, to improve malaria case management and inform future interventions. Ninety-seven surveys were collected from medicine retailers working in the Webuye Health and Demographic Surveillance Site. Survey items included retailer training, education, antimalarial drug knowledge, recommending behavior, sales, and shop characteristics, and were analyzed using Kulldorff's spatial scan statistic. The Bernoulli purely spatial model for binomial data was used, comparing cases to controls. Statistical significance of found clusters was tested with a likelihood ratio test, using the null hypothesis of no clustering, and a p value based on 999 Monte Carlo simulations. The null hypothesis was rejected with p values of 0.05 or less. A statistically significant cluster of fewer than expected pharmacy-trained retailers was found (RR = .09, p = .001) when compared to the expected random distribution. Drug recommending behavior also yielded a statistically significant cluster, with fewer than expected retailers recommending the correct antimalarial medication to adults (RR = .018, p = .01), and fewer than expected shops selling that medication more often than outdated antimalarials when compared to random distribution (RR = 0.23, p = .007). All three of these clusters were co-located, overlapping in the northwest of the study area. Spatial clustering was found in the data. A concerning amount of correlation was found in one specific region in the study area where multiple behaviors converged in space, highlighting a prime target for interventions. These results also demonstrate the utility of applying geospatial methods in the study of medicine retailer behaviors, making the case for expanding this approach to other regions.

  2. Least squares regression methods for clustered ROC data with discrete covariates.

    PubMed

    Tang, Liansheng Larry; Zhang, Wei; Li, Qizhai; Ye, Xuan; Chan, Leighton

    2016-07-01

    The receiver operating characteristic (ROC) curve is a popular tool to evaluate and compare the accuracy of diagnostic tests to distinguish the diseased group from the nondiseased group when test results from tests are continuous or ordinal. A complicated data setting occurs when multiple tests are measured on abnormal and normal locations from the same subject and the measurements are clustered within the subject. Although least squares regression methods can be used for the estimation of ROC curve from correlated data, how to develop the least squares methods to estimate the ROC curve from the clustered data has not been studied. Also, the statistical properties of the least squares methods under the clustering setting are unknown. In this article, we develop the least squares ROC methods to allow the baseline and link functions to differ, and more importantly, to accommodate clustered data with discrete covariates. The methods can generate smooth ROC curves that satisfy the inherent continuous property of the true underlying curve. The least squares methods are shown to be more efficient than the existing nonparametric ROC methods under appropriate model assumptions in simulation studies. We apply the methods to a real example in the detection of glaucomatous deterioration. We also derive the asymptotic properties of the proposed methods. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.

  3. Impact of socioeconomic inequalities on geographic disparities in cancer incidence: comparison of methods for spatial disease mapping.

    PubMed

    Goungounga, Juste Aristide; Gaudart, Jean; Colonna, Marc; Giorgi, Roch

    2016-10-12

    The reliability of spatial statistics is often put into question because real spatial variations may not be found, especially in heterogeneous areas. Our objective was to compare empirically different cluster detection methods. We assessed their ability to find spatial clusters of cancer cases and evaluated the impact of the socioeconomic status (e.g., the Townsend index) on cancer incidence. Moran's I, the empirical Bayes index (EBI), and Potthoff-Whittinghill test were used to investigate the general clustering. The local cluster detection methods were: i) the spatial oblique decision tree (SpODT); ii) the spatial scan statistic of Kulldorff (SaTScan); and, iii) the hierarchical Bayesian spatial modeling (HBSM) in a univariate and multivariate setting. These methods were used with and without introducing the Townsend index of socioeconomic deprivation known to be related to the distribution of cancer incidence. Incidence data stemmed from the Cancer Registry of Isère and were limited to prostate, lung, colon-rectum, and bladder cancers diagnosed between 1999 and 2007 in men only. The study found a spatial heterogeneity (p < 0.01) and an autocorrelation for prostate (EBI = 0.02; p = 0.001), lung (EBI = 0.01; p = 0.019) and bladder (EBI = 0.007; p = 0.05) cancers. After introduction of the Townsend index, SaTScan failed in finding cancers clusters. This introduction changed the results obtained with the other methods. SpODT identified five spatial classes (p < 0.05): four in the Western and one in the Northern parts of the study area (standardized incidence ratios: 1.68, 1.39, 1.14, 1.12, and 1.16, respectively). In the univariate setting, the Bayesian smoothing method found the same clusters as the two other methods (RR >1.2). The multivariate HBSM found a spatial correlation between lung and bladder cancers (r = 0.6). In spatial analysis of cancer incidence, SpODT and HBSM may be used not only for cluster detection but also for searching for confounding or etiological factors in small areas. Moreover, the multivariate HBSM offers a flexible and meaningful modeling of spatial variations; it shows plausible previously unknown associations between various cancers.

  4. A Statistical Model for Misreported Binary Outcomes in Clustered RCTs of Education Interventions

    ERIC Educational Resources Information Center

    Schochet, Peter Z.

    2013-01-01

    In randomized control trials (RCTs) of educational interventions, there is a growing literature on impact estimation methods to adjust for missing student outcome data using such methods as multiple imputation, the construction of nonresponse weights, casewise deletion, and maximum likelihood methods (see, for example, Allison, 2002; Graham, 2009;…

  5. Real-time Mainshock Forecast by Statistical Discrimination of Foreshock Clusters

    NASA Astrophysics Data System (ADS)

    Nomura, S.; Ogata, Y.

    2016-12-01

    Foreshock discremination is one of the most effective ways for short-time forecast of large main shocks. Though many large earthquakes accompany their foreshocks, discreminating them from enormous small earthquakes is difficult and only probabilistic evaluation from their spatio-temporal features and magnitude evolution may be available. Logistic regression is the statistical learning method best suited to such binary pattern recognition problems where estimates of a-posteriori probability of class membership are required. Statistical learning methods can keep learning discreminating features from updating catalog and give probabilistic recognition of forecast in real time. We estimated a non-linear function of foreshock proportion by smooth spline bases and evaluate the possibility of foreshocks by the logit function. In this study, we classified foreshocks from earthquake catalog by the Japan Meteorological Agency by single-link clustering methods and learned spatial and temporal features of foreshocks by the probability density ratio estimation. We use the epicentral locations, time spans and difference in magnitudes for learning and forecasting. Magnitudes of main shocks are also predicted our method by incorporating b-values into our method. We discuss the spatial pattern of foreshocks from the classifier composed by our model. We also implement a back test to validate predictive performance of the model by this catalog.

  6. Percolation of the site random-cluster model by Monte Carlo method

    NASA Astrophysics Data System (ADS)

    Wang, Songsong; Zhang, Wanzhou; Ding, Chengxiang

    2015-08-01

    We propose a site random-cluster model by introducing an additional cluster weight in the partition function of the traditional site percolation. To simulate the model on a square lattice, we combine the color-assignation and the Swendsen-Wang methods to design a highly efficient cluster algorithm with a small critical slowing-down phenomenon. To verify whether or not it is consistent with the bond random-cluster model, we measure several quantities, such as the wrapping probability Re, the percolating cluster density P∞, and the magnetic susceptibility per site χp, as well as two exponents, such as the thermal exponent yt and the fractal dimension yh of the percolating cluster. We find that for different exponents of cluster weight q =1.5 , 2, 2.5 , 3, 3.5 , and 4, the numerical estimation of the exponents yt and yh are consistent with the theoretical values. The universalities of the site random-cluster model and the bond random-cluster model are completely identical. For larger values of q , we find obvious signatures of the first-order percolation transition by the histograms and the hysteresis loops of percolating cluster density and the energy per site. Our results are helpful for the understanding of the percolation of traditional statistical models.

  7. The cosmological analysis of X-ray cluster surveys - I. A new method for interpreting number counts

    NASA Astrophysics Data System (ADS)

    Clerc, N.; Pierre, M.; Pacaud, F.; Sadibekova, T.

    2012-07-01

    We present a new method aimed at simplifying the cosmological analysis of X-ray cluster surveys. It is based on purely instrumental observable quantities considered in a two-dimensional X-ray colour-magnitude diagram (hardness ratio versus count rate). The basic principle is that even in rather shallow surveys, substantial information on cluster redshift and temperature is present in the raw X-ray data and can be statistically extracted; in parallel, such diagrams can be readily predicted from an ab initio cosmological modelling. We illustrate the methodology for the case of a 100-deg2XMM survey having a sensitivity of ˜10-14 erg s-1 cm-2 and fit at the same time, the survey selection function, the cluster evolutionary scaling relations and the cosmology; our sole assumption - driven by the limited size of the sample considered in the case study - is that the local cluster scaling relations are known. We devote special attention to the realistic modelling of the count-rate measurement uncertainties and evaluate the potential of the method via a Fisher analysis. In the absence of individual cluster redshifts, the count rate and hardness ratio (CR-HR) method appears to be much more efficient than the traditional approach based on cluster counts (i.e. dn/dz, requiring redshifts). In the case where redshifts are available, our method performs similar to the traditional mass function (dn/dM/dz) for the purely cosmological parameters, but constrains better parameters defining the cluster scaling relations and their evolution. A further practical advantage of the CR-HR method is its simplicity: this fully top-down approach totally bypasses the tedious steps consisting in deriving cluster masses from X-ray temperature measurements.

  8. Use of keyword hierarchies to interpret gene expression patterns.

    PubMed

    Masys, D R; Welsh, J B; Lynn Fink, J; Gribskov, M; Klacansky, I; Corbeil, J

    2001-04-01

    High-density microarray technology permits the quantitative and simultaneous monitoring of thousands of genes. The interpretation challenge is to extract relevant information from this large amount of data. A growing variety of statistical analysis approaches are available to identify clusters of genes that share common expression characteristics, but provide no information regarding the biological similarities of genes within clusters. The published literature provides a potential source of information to assist in interpretation of clustering results. We describe a data mining method that uses indexing terms ('keywords') from the published literature linked to specific genes to present a view of the conceptual similarity of genes within a cluster or group of interest. The method takes advantage of the hierarchical nature of Medical Subject Headings used to index citations in the MEDLINE database, and the registry numbers applied to enzymes.

  9. Advances in Statistical Methods for Substance Abuse Prevention Research

    PubMed Central

    MacKinnon, David P.; Lockwood, Chondra M.

    2010-01-01

    The paper describes advances in statistical methods for prevention research with a particular focus on substance abuse prevention. Standard analysis methods are extended to the typical research designs and characteristics of the data collected in prevention research. Prevention research often includes longitudinal measurement, clustering of data in units such as schools or clinics, missing data, and categorical as well as continuous outcome variables. Statistical methods to handle these features of prevention data are outlined. Developments in mediation, moderation, and implementation analysis allow for the extraction of more detailed information from a prevention study. Advancements in the interpretation of prevention research results include more widespread calculation of effect size and statistical power, the use of confidence intervals as well as hypothesis testing, detailed causal analysis of research findings, and meta-analysis. The increased availability of statistical software has contributed greatly to the use of new methods in prevention research. It is likely that the Internet will continue to stimulate the development and application of new methods. PMID:12940467

  10. A spatial scan statistic for survival data based on Weibull distribution.

    PubMed

    Bhatt, Vijaya; Tiwari, Neeraj

    2014-05-20

    The spatial scan statistic has been developed as a geographical cluster detection analysis tool for different types of data sets such as Bernoulli, Poisson, ordinal, normal and exponential. We propose a scan statistic for survival data based on Weibull distribution. It may also be used for other survival distributions, such as exponential, gamma, and log normal. The proposed method is applied on the survival data of tuberculosis patients for the years 2004-2005 in Nainital district of Uttarakhand, India. Simulation studies reveal that the proposed method performs well for different survival distribution functions. Copyright © 2013 John Wiley & Sons, Ltd.

  11. Impact of missing data imputation methods on gene expression clustering and classification.

    PubMed

    de Souto, Marcilio C P; Jaskowiak, Pablo A; Costa, Ivan G

    2015-02-26

    Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies. The datasets analyzed in this study are available at http://costalab.org/Imputation/ .

  12. Binomial outcomes in dataset with some clusters of size two: can the dependence of twins be accounted for? A simulation study comparing the reliability of statistical methods based on a dataset of preterm infants.

    PubMed

    Sauzet, Odile; Peacock, Janet L

    2017-07-20

    The analysis of perinatal outcomes often involves datasets with some multiple births. These are datasets mostly formed of independent observations and a limited number of clusters of size two (twins) and maybe of size three or more. This non-independence needs to be accounted for in the statistical analysis. Using simulated data based on a dataset of preterm infants we have previously investigated the performance of several approaches to the analysis of continuous outcomes in the presence of some clusters of size two. Mixed models have been developed for binomial outcomes but very little is known about their reliability when only a limited number of small clusters are present. Using simulated data based on a dataset of preterm infants we investigated the performance of several approaches to the analysis of binomial outcomes in the presence of some clusters of size two. Logistic models, several methods of estimation for the logistic random intercept models and generalised estimating equations were compared. The presence of even a small percentage of twins means that a logistic regression model will underestimate all parameters but a logistic random intercept model fails to estimate the correlation between siblings if the percentage of twins is too small and will provide similar estimates to logistic regression. The method which seems to provide the best balance between estimation of the standard error and the parameter for any percentage of twins is the generalised estimating equations. This study has shown that the number of covariates or the level two variance do not necessarily affect the performance of the various methods used to analyse datasets containing twins but when the percentage of small clusters is too small, mixed models cannot capture the dependence between siblings.

  13. Examining the Effectiveness of Discriminant Function Analysis and Cluster Analysis in Species Identification of Male Field Crickets Based on Their Calling Songs

    PubMed Central

    Jaiswara, Ranjana; Nandi, Diptarup; Balakrishnan, Rohini

    2013-01-01

    Traditional taxonomy based on morphology has often failed in accurate species identification owing to the occurrence of cryptic species, which are reproductively isolated but morphologically identical. Molecular data have thus been used to complement morphology in species identification. The sexual advertisement calls in several groups of acoustically communicating animals are species-specific and can thus complement molecular data as non-invasive tools for identification. Several statistical tools and automated identifier algorithms have been used to investigate the efficiency of acoustic signals in species identification. Despite a plethora of such methods, there is a general lack of knowledge regarding the appropriate usage of these methods in specific taxa. In this study, we investigated the performance of two commonly used statistical methods, discriminant function analysis (DFA) and cluster analysis, in identification and classification based on acoustic signals of field cricket species belonging to the subfamily Gryllinae. Using a comparative approach we evaluated the optimal number of species and calling song characteristics for both the methods that lead to most accurate classification and identification. The accuracy of classification using DFA was high and was not affected by the number of taxa used. However, a constraint in using discriminant function analysis is the need for a priori classification of songs. Accuracy of classification using cluster analysis, which does not require a priori knowledge, was maximum for 6–7 taxa and decreased significantly when more than ten taxa were analysed together. We also investigated the efficacy of two novel derived acoustic features in improving the accuracy of identification. Our results show that DFA is a reliable statistical tool for species identification using acoustic signals. Our results also show that cluster analysis of acoustic signals in crickets works effectively for species classification and identification. PMID:24086666

  14. MO-DE-207B-03: Improved Cancer Classification Using Patient-Specific Biological Pathway Information Via Gene Expression Data

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Young, M; Craft, D

    Purpose: To develop an efficient, pathway-based classification system using network biology statistics to assist in patient-specific response predictions to radiation and drug therapies across multiple cancer types. Methods: We developed PICS (Pathway Informed Classification System), a novel two-step cancer classification algorithm. In PICS, a matrix m of mRNA expression values for a patient cohort is collapsed into a matrix p of biological pathways. The entries of p, which we term pathway scores, are obtained from either principal component analysis (PCA), normal tissue centroid (NTC), or gene expression deviation (GED). The pathway score matrix is clustered using both k-means and hierarchicalmore » clustering, and a clustering is judged by how well it groups patients into distinct survival classes. The most effective pathway scoring/clustering combination, per clustering p-value, thus generates various ‘signatures’ for conventional and functional cancer classification. Results: PICS successfully regularized large dimension gene data, separated normal and cancerous tissues, and clustered a large patient cohort spanning six cancer types. Furthermore, PICS clustered patient cohorts into distinct, statistically-significant survival groups. For a suboptimally-debulked ovarian cancer set, the pathway-classified Kaplan-Meier survival curve (p = .00127) showed significant improvement over that of a prior gene expression-classified study (p = .0179). For a pancreatic cancer set, the pathway-classified Kaplan-Meier survival curve (p = .00141) showed significant improvement over that of a prior gene expression-classified study (p = .04). Pathway-based classification confirmed biomarkers for the pyrimidine, WNT-signaling, glycerophosphoglycerol, beta-alanine, and panthothenic acid pathways for ovarian cancer. Despite its robust nature, PICS requires significantly less run time than current pathway scoring methods. Conclusion: This work validates the PICS method to improve cancer classification using biological pathways. Patients are classified with greater specificity and physiological relevance as compared to current gene-specific approaches. Focus now moves to utilizing PICS for pan-cancer patient-specific treatment response prediction.« less

  15. Tsallis p⊥ distribution from statistical clusters

    NASA Astrophysics Data System (ADS)

    Bialas, A.

    2015-07-01

    It is shown that the transverse momentum distributions of particles emerging from the decay of statistical clusters, distributed according to a power law in their transverse energy, closely resemble those following from the Tsallis non-extensive statistical model. The experimental data are well reproduced with the cluster temperature T ≈ 160 MeV.

  16. Identifying and Assessing Interesting Subgroups in a Heterogeneous Population.

    PubMed

    Lee, Woojoo; Alexeyenko, Andrey; Pernemalm, Maria; Guegan, Justine; Dessen, Philippe; Lazar, Vladimir; Lehtiö, Janne; Pawitan, Yudi

    2015-01-01

    Biological heterogeneity is common in many diseases and it is often the reason for therapeutic failures. Thus, there is great interest in classifying a disease into subtypes that have clinical significance in terms of prognosis or therapy response. One of the most popular methods to uncover unrecognized subtypes is cluster analysis. However, classical clustering methods such as k-means clustering or hierarchical clustering are not guaranteed to produce clinically interesting subtypes. This could be because the main statistical variability--the basis of cluster generation--is dominated by genes not associated with the clinical phenotype of interest. Furthermore, a strong prognostic factor might be relevant for a certain subgroup but not for the whole population; thus an analysis of the whole sample may not reveal this prognostic factor. To address these problems we investigate methods to identify and assess clinically interesting subgroups in a heterogeneous population. The identification step uses a clustering algorithm and to assess significance we use a false discovery rate- (FDR-) based measure. Under the heterogeneity condition the standard FDR estimate is shown to overestimate the true FDR value, but this is remedied by an improved FDR estimation procedure. As illustrations, two real data examples from gene expression studies of lung cancer are provided.

  17. Minimum number of clusters and comparison of analysis methods for cross sectional stepped wedge cluster randomised trials with binary outcomes: A simulation study.

    PubMed

    Barker, Daniel; D'Este, Catherine; Campbell, Michael J; McElduff, Patrick

    2017-03-09

    Stepped wedge cluster randomised trials frequently involve a relatively small number of clusters. The most common frameworks used to analyse data from these types of trials are generalised estimating equations and generalised linear mixed models. A topic of much research into these methods has been their application to cluster randomised trial data and, in particular, the number of clusters required to make reasonable inferences about the intervention effect. However, for stepped wedge trials, which have been claimed by many researchers to have a statistical power advantage over the parallel cluster randomised trial, the minimum number of clusters required has not been investigated. We conducted a simulation study where we considered the most commonly used methods suggested in the literature to analyse cross-sectional stepped wedge cluster randomised trial data. We compared the per cent bias, the type I error rate and power of these methods in a stepped wedge trial setting with a binary outcome, where there are few clusters available and when the appropriate adjustment for a time trend is made, which by design may be confounding the intervention effect. We found that the generalised linear mixed modelling approach is the most consistent when few clusters are available. We also found that none of the common analysis methods for stepped wedge trials were both unbiased and maintained a 5% type I error rate when there were only three clusters. Of the commonly used analysis approaches, we recommend the generalised linear mixed model for small stepped wedge trials with binary outcomes. We also suggest that in a stepped wedge design with three steps, at least two clusters be randomised at each step, to ensure that the intervention effect estimator maintains the nominal 5% significance level and is also reasonably unbiased.

  18. Unequal cluster sizes in stepped-wedge cluster randomised trials: a systematic review.

    PubMed

    Kristunas, Caroline; Morris, Tom; Gray, Laura

    2017-11-15

    To investigate the extent to which cluster sizes vary in stepped-wedge cluster randomised trials (SW-CRT) and whether any variability is accounted for during the sample size calculation and analysis of these trials. Any, not limited to healthcare settings. Any taking part in an SW-CRT published up to March 2016. The primary outcome is the variability in cluster sizes, measured by the coefficient of variation (CV) in cluster size. Secondary outcomes include the difference between the cluster sizes assumed during the sample size calculation and those observed during the trial, any reported variability in cluster sizes and whether the methods of sample size calculation and methods of analysis accounted for any variability in cluster sizes. Of the 101 included SW-CRTs, 48% mentioned that the included clusters were known to vary in size, yet only 13% of these accounted for this during the calculation of the sample size. However, 69% of the trials did use a method of analysis appropriate for when clusters vary in size. Full trial reports were available for 53 trials. The CV was calculated for 23 of these: the median CV was 0.41 (IQR: 0.22-0.52). Actual cluster sizes could be compared with those assumed during the sample size calculation for 14 (26%) of the trial reports; the cluster sizes were between 29% and 480% of that which had been assumed. Cluster sizes often vary in SW-CRTs. Reporting of SW-CRTs also remains suboptimal. The effect of unequal cluster sizes on the statistical power of SW-CRTs needs further exploration and methods appropriate to studies with unequal cluster sizes need to be employed. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.

  19. Weighted community detection and data clustering using message passing

    NASA Astrophysics Data System (ADS)

    Shi, Cheng; Liu, Yanchen; Zhang, Pan

    2018-03-01

    Grouping objects into clusters based on the similarities or weights between them is one of the most important problems in science and engineering. In this work, by extending message-passing algorithms and spectral algorithms proposed for an unweighted community detection problem, we develop a non-parametric method based on statistical physics, by mapping the problem to the Potts model at the critical temperature of spin-glass transition and applying belief propagation to solve the marginals corresponding to the Boltzmann distribution. Our algorithm is robust to over-fitting and gives a principled way to determine whether there are significant clusters in the data and how many clusters there are. We apply our method to different clustering tasks. In the community detection problem in weighted and directed networks, we show that our algorithm significantly outperforms existing algorithms. In the clustering problem, where the data were generated by mixture models in the sparse regime, we show that our method works all the way down to the theoretical limit of detectability and gives accuracy very close to that of the optimal Bayesian inference. In the semi-supervised clustering problem, our method only needs several labels to work perfectly in classic datasets. Finally, we further develop Thouless-Anderson-Palmer equations which heavily reduce the computation complexity in dense networks but give almost the same performance as belief propagation.

  20. Lagged segmented Poincaré plot analysis for risk stratification in patients with dilated cardiomyopathy.

    PubMed

    Voss, Andreas; Fischer, Claudia; Schroeder, Rico; Figulla, Hans R; Goernig, Matthias

    2012-07-01

    The objectives of this study were to introduce a new type of heart-rate variability analysis improving risk stratification in patients with idiopathic dilated cardiomyopathy (DCM) and to provide additional information about impaired heart beat generation in these patients. Beat-to-beat intervals (BBI) of 30-min ECGs recorded from 91 DCM patients and 21 healthy subjects were analyzed applying the lagged segmented Poincaré plot analysis (LSPPA) method. LSPPA includes the Poincaré plot reconstruction with lags of 1-100, rotating the cloud of points, its normalized segmentation adapted to their standard deviations, and finally, a frequency-dependent clustering. The lags were combined into eight different clusters representing specific frequency bands within 0.012-1.153 Hz. Statistical differences between low- and high-risk DCM could be found within the clusters II-VIII (e.g., cluster IV: 0.033-0.038 Hz; p = 0.0002; sensitivity = 85.7 %; specificity = 71.4 %). The multivariate statistics led to a sensitivity of 92.9 %, specificity of 85.7 % and an area under the curve of 92.1 % discriminating these patient groups. We introduced the LSPPA method to investigate time correlations in BBI time series. We found that LSPPA contributes considerably to risk stratification in DCM and yields the highest discriminant power in the low and very low-frequency bands.

  1. Utilizing Hierarchical Clustering to improve Efficiency of Self-Organizing Feature Map to Identify Hydrological Homogeneous Regions

    NASA Astrophysics Data System (ADS)

    Farsadnia, Farhad; Ghahreman, Bijan

    2016-04-01

    Hydrologic homogeneous group identification is considered both fundamental and applied research in hydrology. Clustering methods are among conventional methods to assess the hydrological homogeneous regions. Recently, Self-Organizing feature Map (SOM) method has been applied in some studies. However, the main problem of this method is the interpretation on the output map of this approach. Therefore, SOM is used as input to other clustering algorithms. The aim of this study is to apply a two-level Self-Organizing feature map and Ward hierarchical clustering method to determine the hydrologic homogenous regions in North and Razavi Khorasan provinces. At first by principal component analysis, we reduced SOM input matrix dimension, then the SOM was used to form a two-dimensional features map. To determine homogeneous regions for flood frequency analysis, SOM output nodes were used as input into the Ward method. Generally, the regions identified by the clustering algorithms are not statistically homogeneous. Consequently, they have to be adjusted to improve their homogeneity. After adjustment of the homogeneity regions by L-moment tests, five hydrologic homogeneous regions were identified. Finally, adjusted regions were created by a two-level SOM and then the best regional distribution function and associated parameters were selected by the L-moment approach. The results showed that the combination of self-organizing maps and Ward hierarchical clustering by principal components as input is more effective than the hierarchical method, by principal components or standardized inputs to achieve hydrologic homogeneous regions.

  2. Comparison of tests for spatial heterogeneity on data with global clustering patterns and outliers

    PubMed Central

    Jackson, Monica C; Huang, Lan; Luo, Jun; Hachey, Mark; Feuer, Eric

    2009-01-01

    Background The ability to evaluate geographic heterogeneity of cancer incidence and mortality is important in cancer surveillance. Many statistical methods for evaluating global clustering and local cluster patterns are developed and have been examined by many simulation studies. However, the performance of these methods on two extreme cases (global clustering evaluation and local anomaly (outlier) detection) has not been thoroughly investigated. Methods We compare methods for global clustering evaluation including Tango's Index, Moran's I, and Oden's I*pop; and cluster detection methods such as local Moran's I and SaTScan elliptic version on simulated count data that mimic global clustering patterns and outliers for cancer cases in the continental United States. We examine the power and precision of the selected methods in the purely spatial analysis. We illustrate Tango's MEET and SaTScan elliptic version on a 1987-2004 HIV and a 1950-1969 lung cancer mortality data in the United States. Results For simulated data with outlier patterns, Tango's MEET, Moran's I and I*pop had powers less than 0.2, and SaTScan had powers around 0.97. For simulated data with global clustering patterns, Tango's MEET and I*pop (with 50% of total population as the maximum search window) had powers close to 1. SaTScan had powers around 0.7-0.8 and Moran's I has powers around 0.2-0.3. In the real data example, Tango's MEET indicated the existence of global clustering patterns in both the HIV and lung cancer mortality data. SaTScan found a large cluster for HIV mortality rates, which is consistent with the finding from Tango's MEET. SaTScan also found clusters and outliers in the lung cancer mortality data. Conclusion SaTScan elliptic version is more efficient for outlier detection compared with the other methods evaluated in this article. Tango's MEET and Oden's I*pop perform best in global clustering scenarios among the selected methods. The use of SaTScan for data with global clustering patterns should be used with caution since SatScan may reveal an incorrect spatial pattern even though it has enough power to reject a null hypothesis of homogeneous relative risk. Tango's method should be used for global clustering evaluation instead of SaTScan. PMID:19822013

  3. A log-Weibull spatial scan statistic for time to event data.

    PubMed

    Usman, Iram; Rosychuk, Rhonda J

    2018-06-13

    Spatial scan statistics have been used for the identification of geographic clusters of elevated numbers of cases of a condition such as disease outbreaks. These statistics accompanied by the appropriate distribution can also identify geographic areas with either longer or shorter time to events. Other authors have proposed the spatial scan statistics based on the exponential and Weibull distributions. We propose the log-Weibull as an alternative distribution for the spatial scan statistic for time to events data and compare and contrast the log-Weibull and Weibull distributions through simulation studies. The effect of type I differential censoring and power have been investigated through simulated data. Methods are also illustrated on time to specialist visit data for discharged patients presenting to emergency departments for atrial fibrillation and flutter in Alberta during 2010-2011. We found northern regions of Alberta had longer times to specialist visit than other areas. We proposed the spatial scan statistic for the log-Weibull distribution as a new approach for detecting spatial clusters for time to event data. The simulation studies suggest that the test performs well for log-Weibull data.

  4. Cancer Cluster Investigations: Review of the Past and Proposals for the Future

    PubMed Central

    Goodman, Michael; LaKind, Judy S.; Fagliano, Jerald A.; Lash, Timothy L.; Wiemels, Joseph L.; Winn, Deborah M.; Patel, Chirag; Van Eenwyk, Juliet; Kohler, Betsy A.; Schisterman, Enrique F.; Albert, Paul; Mattison, Donald R.

    2014-01-01

    Residential clusters of non-communicable diseases are a source of enduring public concern, and at times, controversy. Many clusters reported to public health agencies by concerned citizens are accompanied by expectations that investigations will uncover a cause of disease. While goals, methods and conclusions of cluster studies are debated in the scientific literature and popular press, investigations of reported residential clusters rarely provide definitive answers about disease etiology. Further, it is inherently difficult to study a cluster for diseases with complex etiology and long latency (e.g., most cancers). Regardless, cluster investigations remain an important function of local, state and federal public health agencies. Challenges limiting the ability of cluster investigations to uncover causes for disease include the need to consider long latency, low statistical power of most analyses, uncertain definitions of cluster boundaries and population of interest, and in- and out-migration. A multi-disciplinary Workshop was held to discuss innovative and/or under-explored approaches to investigate cancer clusters. Several potentially fruitful paths forward are described, including modern methods of reconstructing residential history, improved approaches to analyzing spatial data, improved utilization of electronic data sources, advances using biomarkers of carcinogenesis, novel concepts for grouping cases, investigations of infectious etiology of cancer, and “omics” approaches. PMID:24477211

  5. Verification of Bayesian Clustering in Travel Behaviour Research – First Step to Macroanalysis of Travel Behaviour

    NASA Astrophysics Data System (ADS)

    Satra, P.; Carsky, J.

    2018-04-01

    Our research is looking at the travel behaviour from a macroscopic view, taking one municipality as a basic unit. The travel behaviour of one municipality as a whole is becoming one piece of a data in the research of travel behaviour of a larger area, perhaps a country. A data pre-processing is used to cluster the municipalities in groups, which show similarities in their travel behaviour. Such groups can be then researched for reasons of their prevailing pattern of travel behaviour without any distortion caused by municipalities with a different pattern. This paper deals with actual settings of the clustering process, which is based on Bayesian statistics, particularly the mixture model. An optimization of the settings parameters based on correlation of pointer model parameters and relative number of data in clusters is helpful, however not fully reliable method. Thus, method for graphic representation of clusters needs to be developed in order to check their quality. A training of the setting parameters in 2D has proven to be a beneficial method, because it allows visual control of the produced clusters. The clustering better be applied on separate groups of municipalities, where competition of only identical transport modes can be found.

  6. On Ion Clusters in the Interstellar Gas

    NASA Technical Reports Server (NTRS)

    Donn, Bertram

    1960-01-01

    In a recent paper V.I. Krassovsky (1958) predicts the occurrence of clusters of large numbers of atoms and molecules around ions in the interstellar gas. He then proposes a number of physicochemical processes that would be considerably enhanced by the high particle density in such clusters. In particular, he suggests that absorption by negative ions formed in the clusters would account for the interstellar extinction without any necessity for the presence of grains. Because of the important consequences that ion clusters could have, it is necessary to examine their occurrence more fully. This note re-examines the formation of ion clusters in space and shows that even ion-molecule pairs are essentially non-existent. Ion clusters have been considered by Bloom and Margenau (1952) from the same point of view as that used by Krassovsky, whose basic reference (Joffe and Semenov 1933) unfortunately is not available. A different approach has been used by Eyring, Hirschfelder, and Taylor (1936) following the methods of chemical equilibrium. Both the references cited here enable one to conclude that clustering is negligible. Therefore, the treatment of Eyring et al. is more appropriate than the method of Bloom and Margenau, which depends on the statistical equilibrium of an atmosphere in a force field.

  7. Comparison of Salmonella enteritidis phage types isolated from layers and humans in Belgium in 2005.

    PubMed

    Welby, Sarah; Imberechts, Hein; Riocreux, Flavien; Bertrand, Sophie; Dierick, Katelijne; Wildemauwe, Christa; Hooyberghs, Jozef; Van der Stede, Yves

    2011-08-01

    The aim of this study was to investigate the available results for Belgium of the European Union coordinated monitoring program (2004/665 EC) on Salmonella in layers in 2005, as well as the results of the monthly outbreak reports of Salmonella Enteritidis in humans in 2005 to identify a possible statistical significant trend in both populations. Separate descriptive statistics and univariate analysis were carried out and the parametric and/or non-parametric hypothesis tests were conducted. A time cluster analysis was performed for all Salmonella Enteritidis phage types (PTs) isolated. The proportions of each Salmonella Enteritidis PT in layers and in humans were compared and the monthly distribution of the most common PT, isolated in both populations, was evaluated. The time cluster analysis revealed significant clusters during the months May and June for layers and May, July, August, and September for humans. PT21, the most frequently isolated PT in both populations in 2005, seemed to be responsible of these significant clusters. PT4 was the second most frequently isolated PT. No significant difference was found for the monthly trend evolution of both PT in both populations based on parametric and non-parametric methods. A similar monthly trend of PT distribution in humans and layers during the year 2005 was observed. The time cluster analysis and the statistical significance testing confirmed these results. Moreover, the time cluster analysis showed significant clusters during the summer time and slightly delayed in time (humans after layers). These results suggest a common link between the prevalence of Salmonella Enteritidis in layers and the occurrence of the pathogen in humans. Phage typing was confirmed to be a useful tool for identifying temporal trends.

  8. Identifying clusters of active transportation using spatial scan statistics.

    PubMed

    Huang, Lan; Stinchcomb, David G; Pickle, Linda W; Dill, Jennifer; Berrigan, David

    2009-08-01

    There is an intense interest in the possibility that neighborhood characteristics influence active transportation such as walking or biking. The purpose of this paper is to illustrate how a spatial cluster identification method can evaluate the geographic variation of active transportation and identify neighborhoods with unusually high/low levels of active transportation. Self-reported walking/biking prevalence, demographic characteristics, street connectivity variables, and neighborhood socioeconomic data were collected from respondents to the 2001 California Health Interview Survey (CHIS; N=10,688) in Los Angeles County (LAC) and San Diego County (SDC). Spatial scan statistics were used to identify clusters of high or low prevalence (with and without age-adjustment) and the quantity of time spent walking and biking. The data, a subset from the 2001 CHIS, were analyzed in 2007-2008. Geographic clusters of significantly high or low prevalence of walking and biking were detected in LAC and SDC. Structural variables such as street connectivity and shorter block lengths are consistently associated with higher levels of active transportation, but associations between active transportation and socioeconomic variables at the individual and neighborhood levels are mixed. Only one cluster with less time spent walking and biking among walkers/bikers was detected in LAC, and this was of borderline significance. Age-adjustment affects the clustering pattern of walking/biking prevalence in LAC, but not in SDC. The use of spatial scan statistics to identify significant clustering of health behaviors such as active transportation adds to the more traditional regression analysis that examines associations between behavior and environmental factors by identifying specific geographic areas with unusual levels of the behavior independent of predefined administrative units.

  9. Multi-scale study of condensation in water jets using ellipsoidal-statistical Bhatnagar-Gross-Krook and molecular dynamics modeling

    NASA Astrophysics Data System (ADS)

    Li, Zheng; Borner, Arnaud; Levin, Deborah A.

    2014-06-01

    Homogeneous water condensation and ice formation in supersonic expansions to vacuum for stagnation pressures from 12 to 1000 mbar are studied using the particle-based Ellipsoidal-Statistical Bhatnagar-Gross-Krook (ES-BGK) method. We find that when condensation starts to occur, at a stagnation pressure of 96 mbar, the increase in the degree of condensation causes an increase in the rotational temperature due to the latent heat of vaporization. The simulated rotational temperature profiles along the plume expansion agree well with measurements confirming the kinetic homogeneous condensation models and the method of simulation. Comparisons of the simulated gas and cluster number densities, cluster size for different stagnation pressures along the plume centerline were made and it is found that the cluster size increase linearly with respect to stagnation pressure, consistent with classical nucleation theory. The sensitivity of our results to cluster nucleation model and latent heat values based on bulk water, specific cluster size, or bulk ice are examined. In particular, the ES-BGK simulations are found to be too coarse-grained to provide information on the phase or structure of the clusters formed. For this reason, molecular dynamics simulations of water condensation in a one-dimensional free expansion to simulate the conditions in the core of a plume are performed. We find that the internal structure of the clusters formed depends on the stagnation temperature. A larger cluster of average size 21 was tracked down the expansion, and a calculation of its average internal temperature as well as a comparison of its radial distribution functions (RDFs) with values measured for solid amorphous ice clusters lead us to conclude that this cluster is in a solid-like rather than liquid form. In another molecular-dynamics simulation at a much lower stagnation temperature, a larger cluster of size 324 and internal temperature 200 K was extracted from an expansion plume and equilibrated to determine its RDF and self-diffusion coefficient. The value of the latter shows that this cluster is formed in a supercooled liquid state rather than in an amorphous solid state.

  10. Mutation Clusters from Cancer Exome.

    PubMed

    Kakushadze, Zura; Yu, Willie

    2017-08-15

    We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit a mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1389 published genome samples across 14 cancer types. In contrast, we find in- and out-of-sample instabilities in cancer signatures extracted from exome samples via nonnegative matrix factorization (NMF), a computationally-costly and non-deterministic method. Extracting stable mutation structures from exome data could have important implications for speed and cost, which are critical for early-stage cancer diagnostics, such as novel blood-test methods currently in development.

  11. Mutation Clusters from Cancer Exome

    PubMed Central

    Kakushadze, Zura; Yu, Willie

    2017-01-01

    We apply our statistically deterministic machine learning/clustering algorithm *K-means (recently developed in https://ssrn.com/abstract=2908286) to 10,656 published exome samples for 32 cancer types. A majority of cancer types exhibit a mutation clustering structure. Our results are in-sample stable. They are also out-of-sample stable when applied to 1389 published genome samples across 14 cancer types. In contrast, we find in- and out-of-sample instabilities in cancer signatures extracted from exome samples via nonnegative matrix factorization (NMF), a computationally-costly and non-deterministic method. Extracting stable mutation structures from exome data could have important implications for speed and cost, which are critical for early-stage cancer diagnostics, such as novel blood-test methods currently in development. PMID:28809811

  12. Detecting Statistically Significant Communities of Triangle Motifs in Undirected Networks

    DTIC Science & Technology

    2016-04-26

    REPORT TYPE Final 3. DATES COVERED (From - To) 15 Oct 2014 to 14 Jan 2015 4. TITLE AND SUBTITLE Detecting statistically significant clusters of...extend the work of Perry et al. [6] by developing a statistical framework that supports the detection of triangle motif-based clusters in complex...priori, the need for triangle motif-based clustering . 2. Developed an algorithm for clustering undirected networks, where the triangle con guration was

  13. A system for learning statistical motion patterns.

    PubMed

    Hu, Weiming; Xiao, Xuejuan; Fu, Zhouyu; Xie, Dan; Tan, Tieniu; Maybank, Steve

    2006-09-01

    Analysis of motion patterns is an effective approach for anomaly detection and behavior prediction. Current approaches for the analysis of motion patterns depend on known scenes, where objects move in predefined ways. It is highly desirable to automatically construct object motion patterns which reflect the knowledge of the scene. In this paper, we present a system for automatically learning motion patterns for anomaly detection and behavior prediction based on a proposed algorithm for robustly tracking multiple objects. In the tracking algorithm, foreground pixels are clustered using a fast accurate fuzzy K-means algorithm. Growing and prediction of the cluster centroids of foreground pixels ensure that each cluster centroid is associated with a moving object in the scene. In the algorithm for learning motion patterns, trajectories are clustered hierarchically using spatial and temporal information and then each motion pattern is represented with a chain of Gaussian distributions. Based on the learned statistical motion patterns, statistical methods are used to detect anomalies and predict behaviors. Our system is tested using image sequences acquired, respectively, from a crowded real traffic scene and a model traffic scene. Experimental results show the robustness of the tracking algorithm, the efficiency of the algorithm for learning motion patterns, and the encouraging performance of algorithms for anomaly detection and behavior prediction.

  14. Quantification and statistical significance analysis of group separation in NMR-based metabonomics studies

    PubMed Central

    Goodpaster, Aaron M.; Kennedy, Michael A.

    2015-01-01

    Currently, no standard metrics are used to quantify cluster separation in PCA or PLS-DA scores plots for metabonomics studies or to determine if cluster separation is statistically significant. Lack of such measures makes it virtually impossible to compare independent or inter-laboratory studies and can lead to confusion in the metabonomics literature when authors putatively identify metabolites distinguishing classes of samples based on visual and qualitative inspection of scores plots that exhibit marginal separation. While previous papers have addressed quantification of cluster separation in PCA scores plots, none have advocated routine use of a quantitative measure of separation that is supported by a standard and rigorous assessment of whether or not the cluster separation is statistically significant. Here quantification and statistical significance of separation of group centroids in PCA and PLS-DA scores plots are considered. The Mahalanobis distance is used to quantify the distance between group centroids, and the two-sample Hotelling's T2 test is computed for the data, related to an F-statistic, and then an F-test is applied to determine if the cluster separation is statistically significant. We demonstrate the value of this approach using four datasets containing various degrees of separation, ranging from groups that had no apparent visual cluster separation to groups that had no visual cluster overlap. Widespread adoption of such concrete metrics to quantify and evaluate the statistical significance of PCA and PLS-DA cluster separation would help standardize reporting of metabonomics data. PMID:26246647

  15. The Performance of Methods to Test Upper-Level Mediation in the Presence of Nonnormal Data

    ERIC Educational Resources Information Center

    Pituch, Keenan A.; Stapleton, Laura M.

    2008-01-01

    A Monte Carlo study compared the statistical performance of standard and robust multilevel mediation analysis methods to test indirect effects for a cluster randomized experimental design under various departures from normality. The performance of these methods was examined for an upper-level mediation process, where the indirect effect is a fixed…

  16. Spatio-temporal cluster detection of chickenpox in Valencia, Spain in the period 2008-2012.

    PubMed

    Iftimi, Adina; Martínez-Ruiz, Francisco; Míguez Santiyán, Ana; Montes, Francisco

    2015-05-18

    Chickenpox is a highly contagious airborne disease caused by Varicella zoster, which affects nearly all non-immune children worldwide with an annual incidence estimated at 80-90 million cases. To analyze the spatiotemporal pattern of the chickenpox incidence in the city of Valencia, Spain two complementary statistical approaches were used. First, we evaluated the existence of clusters and spatio-temporal interaction; secondly, we used this information to find the locations of the spatio-temporal clusters via the space-time permutation model. The first method used detects any aggregation in our data but does not provide the spatial and temporal information. The second method gives the locations, areas and time-frame for the spatio-temporal clusters. An overall decreasing time trend, a pronounced 12-monthly periodicity and two complementary periods were observed. Several areas with high incidence, surrounding the center of the city were identified. The existence of aggregation in time and space was observed, and a number of spatio-temporal clusters were located.

  17. Robust statistical methods for hit selection in RNA interference high-throughput screening experiments.

    PubMed

    Zhang, Xiaohua Douglas; Yang, Xiting Cindy; Chung, Namjin; Gates, Adam; Stec, Erica; Kunapuli, Priya; Holder, Dan J; Ferrer, Marc; Espeseth, Amy S

    2006-04-01

    RNA interference (RNAi) high-throughput screening (HTS) experiments carried out using large (>5000 short interfering [si]RNA) libraries generate a huge amount of data. In order to use these data to identify the most effective siRNAs tested, it is critical to adopt and develop appropriate statistical methods. To address the questions in hit selection of RNAi HTS, we proposed a quartile-based method which is robust to outliers, true hits and nonsymmetrical data. We compared it with the more traditional tests, mean +/- k standard deviation (SD) and median +/- 3 median of absolute deviation (MAD). The results suggested that the quartile-based method selected more hits than mean +/- k SD under the same preset error rate. The number of hits selected by median +/- k MAD was close to that by the quartile-based method. Further analysis suggested that the quartile-based method had the greatest power in detecting true hits, especially weak or moderate true hits. Our investigation also suggested that platewise analysis (determining effective siRNAs on a plate-by-plate basis) can adjust for systematic errors in different plates, while an experimentwise analysis, in which effective siRNAs are identified in an analysis of the entire experiment, cannot. However, experimentwise analysis may detect a cluster of true positive hits placed together in one or several plates, while platewise analysis may not. To display hit selection results, we designed a specific figure called a plate-well series plot. We thus suggest the following strategy for hit selection in RNAi HTS experiments. First, choose the quartile-based method, or median +/- k MAD, for identifying effective siRNAs. Second, perform the chosen method experimentwise on transformed/normalized data, such as percentage inhibition, to check the possibility of hit clusters. If a cluster of selected hits are observed, repeat the analysis based on untransformed data to determine whether the cluster is due to an artifact in the data. If no clusters of hits are observed, select hits by performing platewise analysis on transformed data. Third, adopt the plate-well series plot to visualize both the data and the hit selection results, as well as to check for artifacts.

  18. Implementation of novel statistical procedures and other advanced approaches to improve analysis of CASA data.

    PubMed

    Ramón, M; Martínez-Pastor, F

    2018-04-23

    Computer-aided sperm analysis (CASA) produces a wealth of data that is frequently ignored. The use of multiparametric statistical methods can help explore these datasets, unveiling the subpopulation structure of sperm samples. In this review we analyse the significance of the internal heterogeneity of sperm samples and its relevance. We also provide a brief description of the statistical tools used for extracting sperm subpopulations from the datasets, namely unsupervised clustering (with non-hierarchical, hierarchical and two-step methods) and the most advanced supervised methods, based on machine learning. The former method has allowed exploration of subpopulation patterns in many species, whereas the latter offering further possibilities, especially considering functional studies and the practical use of subpopulation analysis. We also consider novel approaches, such as the use of geometric morphometrics or imaging flow cytometry. Finally, although the data provided by CASA systems provides valuable information on sperm samples by applying clustering analyses, there are several caveats. Protocols for capturing and analysing motility or morphometry should be standardised and adapted to each experiment, and the algorithms should be open in order to allow comparison of results between laboratories. Moreover, we must be aware of new technology that could change the paradigm for studying sperm motility and morphology.

  19. Comparison of tests for spatial heterogeneity on data with global clustering patterns and outliers.

    PubMed

    Jackson, Monica C; Huang, Lan; Luo, Jun; Hachey, Mark; Feuer, Eric

    2009-10-12

    The ability to evaluate geographic heterogeneity of cancer incidence and mortality is important in cancer surveillance. Many statistical methods for evaluating global clustering and local cluster patterns are developed and have been examined by many simulation studies. However, the performance of these methods on two extreme cases (global clustering evaluation and local anomaly (outlier) detection) has not been thoroughly investigated. We compare methods for global clustering evaluation including Tango's Index, Moran's I, and Oden's I*(pop); and cluster detection methods such as local Moran's I and SaTScan elliptic version on simulated count data that mimic global clustering patterns and outliers for cancer cases in the continental United States. We examine the power and precision of the selected methods in the purely spatial analysis. We illustrate Tango's MEET and SaTScan elliptic version on a 1987-2004 HIV and a 1950-1969 lung cancer mortality data in the United States. For simulated data with outlier patterns, Tango's MEET, Moran's I and I*(pop) had powers less than 0.2, and SaTScan had powers around 0.97. For simulated data with global clustering patterns, Tango's MEET and I*(pop) (with 50% of total population as the maximum search window) had powers close to 1. SaTScan had powers around 0.7-0.8 and Moran's I has powers around 0.2-0.3. In the real data example, Tango's MEET indicated the existence of global clustering patterns in both the HIV and lung cancer mortality data. SaTScan found a large cluster for HIV mortality rates, which is consistent with the finding from Tango's MEET. SaTScan also found clusters and outliers in the lung cancer mortality data. SaTScan elliptic version is more efficient for outlier detection compared with the other methods evaluated in this article. Tango's MEET and Oden's I*(pop) perform best in global clustering scenarios among the selected methods. The use of SaTScan for data with global clustering patterns should be used with caution since SatScan may reveal an incorrect spatial pattern even though it has enough power to reject a null hypothesis of homogeneous relative risk. Tango's method should be used for global clustering evaluation instead of SaTScan.

  20. A Granular Self-Organizing Map for Clustering and Gene Selection in Microarray Data.

    PubMed

    Ray, Shubhra Sankar; Ganivada, Avatharam; Pal, Sankar K

    2016-09-01

    A new granular self-organizing map (GSOM) is developed by integrating the concept of a fuzzy rough set with the SOM. While training the GSOM, the weights of a winning neuron and the neighborhood neurons are updated through a modified learning procedure. The neighborhood is newly defined using the fuzzy rough sets. The clusters (granules) evolved by the GSOM are presented to a decision table as its decision classes. Based on the decision table, a method of gene selection is developed. The effectiveness of the GSOM is shown in both clustering samples and developing an unsupervised fuzzy rough feature selection (UFRFS) method for gene selection in microarray data. While the superior results of the GSOM, as compared with the related clustering methods, are provided in terms of β -index, DB-index, Dunn-index, and fuzzy rough entropy, the genes selected by the UFRFS are not only better in terms of classification accuracy and a feature evaluation index, but also statistically more significant than the related unsupervised methods. The C-codes of the GSOM and UFRFS are available online at http://avatharamg.webs.com/software-code.

  1. Using Single Free Sorting and Multivariate Exploratory Methods to Design a New Coffee Taster's Flavor Wheel

    PubMed Central

    Sage, Emma; Velez, Martin; Guinard, Jean‐Xavier

    2016-01-01

    Abstract The original Coffee Taster's Flavor Wheel was developed by the Specialty Coffee Assn. of America over 20 y ago, and needed an innovative revision. This study used a novel application of traditional sensory and statistical methods in order to reorganize the new coffee Sensory Lexicon developed by World Coffee Research and Kansas State Univ. into scientifically valid clusters and levels to prepare a new, updated flavor wheel. Seventy‐two experts participated in a modified online rapid free sorting activity (no tasting) to sort flavor attributes of the lexicon. The data from all participants were compiled and agglomeration hierarchical clustering was used to determine the clusters and levels of the flavor attributes, while multidimensional scaling was used to determine the positioning of the clusters around the Coffee Taster's Flavor Wheel. This resulted in a new flavor wheel for the coffee industry. PMID:27861864

  2. A Data Analytics Approach to Discovering Unique Microstructural Configurations Susceptible to Fatigue

    NASA Astrophysics Data System (ADS)

    Jha, S. K.; Brockman, R. A.; Hoffman, R. M.; Sinha, V.; Pilchak, A. L.; Porter, W. J.; Buchanan, D. J.; Larsen, J. M.; John, R.

    2018-05-01

    Principal component analysis and fuzzy c-means clustering algorithms were applied to slip-induced strain and geometric metric data in an attempt to discover unique microstructural configurations and their frequencies of occurrence in statistically representative instantiations of a titanium alloy microstructure. Grain-averaged fatigue indicator parameters were calculated for the same instantiation. The fatigue indicator parameters strongly correlated with the spatial location of the microstructural configurations in the principal components space. The fuzzy c-means clustering method identified clusters of data that varied in terms of their average fatigue indicator parameters. Furthermore, the number of points in each cluster was inversely correlated to the average fatigue indicator parameter. This analysis demonstrates that data-driven methods have significant potential for providing unbiased determination of unique microstructural configurations and their frequencies of occurrence in a given volume from the point of view of strain localization and fatigue crack initiation.

  3. A Context-sensitive Approach to Anonymizing Spatial Surveillance Data: Impact on Outbreak Detection

    PubMed Central

    Cassa, Christopher A.; Grannis, Shaun J.; Overhage, J. Marc; Mandl, Kenneth D.

    2006-01-01

    Objective: The use of spatially based methods and algorithms in epidemiology and surveillance presents privacy challenges for researchers and public health agencies. We describe a novel method for anonymizing individuals in public health data sets by transposing their spatial locations through a process informed by the underlying population density. Further, we measure the impact of the skew on detection of spatial clustering as measured by a spatial scanning statistic. Design: Cases were emergency department (ED) visits for respiratory illness. Baseline ED visit data were injected with artificially created clusters ranging in magnitude, shape, and location. The geocoded locations were then transformed using a de-identification algorithm that accounts for the local underlying population density. Measurements: A total of 12,600 separate weeks of case data with artificially created clusters were combined with control data and the impact on detection of spatial clustering identified by a spatial scan statistic was measured. Results: The anonymization algorithm produced an expected skew of cases that resulted in high values of data set k-anonymity. De-identification that moves points an average distance of 0.25 km lowers the spatial cluster detection sensitivity by less than 4% and lowers the detection specificity less than 1%. Conclusion: A population-density–based Gaussian spatial blurring markedly decreases the ability to identify individuals in a data set while only slightly decreasing the performance of a standardly used outbreak detection tool. These findings suggest new approaches to anonymizing data for spatial epidemiology and surveillance. PMID:16357353

  4. Statistical detection of geographic clusters of resistant Escherichia coli in a regional network with WHONET and SaTScan

    PubMed Central

    Park, Rachel; O'Brien, Thomas F.; Huang, Susan S.; Baker, Meghan A.; Yokoe, Deborah S.; Kulldorff, Martin; Barrett, Craig; Swift, Jamie; Stelling, John

    2016-01-01

    Objectives While antimicrobial resistance threatens the prevention, treatment, and control of infectious diseases, systematic analysis of routine microbiology laboratory test results worldwide can alert new threats and promote timely response. This study explores statistical algorithms for recognizing geographic clustering of multi-resistant microbes within a healthcare network and monitoring the dissemination of new strains over time. Methods Escherichia coli antimicrobial susceptibility data from a three-year period stored in WHONET were analyzed across ten facilities in a healthcare network utilizing SaTScan's spatial multinomial model with two models for defining geographic proximity. We explored geographic clustering of multi-resistance phenotypes within the network and changes in clustering over time. Results Geographic clustering identified from both latitude/longitude and non-parametric facility groupings geographic models were similar, while the latter was offers greater flexibility and generalizability. Iterative application of the clustering algorithms suggested the possible recognition of the initial appearance of invasive E. coli ST131 in the clinical database of a single hospital and subsequent dissemination to others. Conclusion Systematic analysis of routine antimicrobial resistance susceptibility test results supports the recognition of geographic clustering of microbial phenotypic subpopulations with WHONET and SaTScan, and iterative application of these algorithms can detect the initial appearance in and dissemination across a region prompting early investigation, response, and containment measures. PMID:27530311

  5. Topology in two dimensions. II - The Abell and ACO cluster catalogues

    NASA Astrophysics Data System (ADS)

    Plionis, Manolis; Valdarnini, Riccardo; Coles, Peter

    1992-09-01

    We apply a method for quantifying the topology of projected galaxy clustering to the Abell and ACO catalogues of rich clusters. We use numerical simulations to quantify the statistical bias involved in using high peaks to define the large-scale structure, and we use the results obtained to correct our observational determinations for this known selection effect and also for possible errors introduced by boundary effects. We find that the Abell cluster sample is consistent with clusters being identified with high peaks of a Gaussian random field, but that the ACO shows a slight meatball shift away from the Gaussian behavior over and above that expected purely from the high-peak selection. The most conservative explanation of this effect is that it is caused by some artefact of the procedure used to select the clusters in the two samples.

  6. Cluster stability in the analysis of mass cytometry data.

    PubMed

    Melchiotti, Rossella; Gracio, Filipe; Kordasti, Shahram; Todd, Alan K; de Rinaldis, Emanuele

    2017-01-01

    Manual gating has been traditionally applied to cytometry data sets to identify cells based on protein expression. The advent of mass cytometry allows for a higher number of proteins to be simultaneously measured on cells, therefore providing a means to define cell clusters in a high dimensional expression space. This enhancement, whilst opening unprecedented opportunities for single cell-level analyses, makes the incremental replacement of manual gating with automated clustering a compelling need. To this aim many methods have been implemented and their successful applications demonstrated in different settings. However, the reproducibility of automatically generated clusters is proving challenging and an analytical framework to distinguish spurious clusters from more stable entities, and presumably more biologically relevant ones, is still missing. One way to estimate cell clusters' stability is the evaluation of their consistent re-occurrence within- and between-algorithms, a metric that is commonly used to evaluate results from gene expression. Herein we report the usage and importance of cluster stability evaluations, when applied to results generated from three popular clustering algorithms - SPADE, FLOCK and PhenoGraph - run on four different data sets. These algorithms were shown to generate clusters with various degrees of statistical stability, many of them being unstable. By comparing the results of automated clustering with manually gated populations, we illustrate how information on cluster stability can assist towards a more rigorous and informed interpretation of clustering results. We also explore the relationships between statistical stability and other properties such as clusters' compactness and isolation, demonstrating that whilst cluster stability is linked to other properties it cannot be reliably predicted by any of them. Our study proposes the introduction of cluster stability as a necessary checkpoint for cluster interpretation and contributes to the construction of a more systematic and standardized analytical framework for the assessment of cytometry clustering results. © 2016 International Society for Advancement of Cytometry. © 2016 International Society for Advancement of Cytometry.

  7. Source clustering in the Hi-GAL survey determined using a minimum spanning tree method

    NASA Astrophysics Data System (ADS)

    Beuret, M.; Billot, N.; Cambrésy, L.; Eden, D. J.; Elia, D.; Molinari, S.; Pezzuto, S.; Schisano, E.

    2017-01-01

    Aims: The aims are to investigate the clustering of the far-infrared sources from the Herschel infrared Galactic Plane Survey (Hi-GAL) in the Galactic longitude range of -71 to 67 deg. These clumps, and their spatial distribution, are an imprint of the original conditions within a molecular cloud. This will produce a catalogue of over-densities. Methods: The minimum spanning tree (MST) method was used to identify the over-densities in two dimensions. The catalogue was further refined by folding in heliocentric distances, resulting in more reliable over-densities, which are cluster candidates. Results: We found 1633 over-densities with more than ten members. Of these, 496 are defined as cluster candidates because of the reliability of the distances, with a further 1137 potential cluster candidates. The spatial distributions of the cluster candidates are different in the first and fourth quadrants, with all clusters following the spiral structure of the Milky Way. The cluster candidates are fractal. The clump mass functions of the clustered and isolated are statistically indistinguishable from each other and are consistent with Kroupa's initial mass function. Hi-GAL is a key-project of the Herschel Space Observatory survey (Pilbratt et al. 2010) and uses the PACS (Poglitsch et al. 2010) and SPIRE (Griffin et al. 2010) cameras in parallel mode.The catalogues of cluster candidates and potential clusters are only available at the CDS via anonymous ftp to http://cdsarc.u-strasbg.fr (http://130.79.128.5) or via http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/597/A114

  8. Cluster Free Energies from Simple Simulations of Small Numbers of Aggregants: Nucleation of Liquid MTBE from Vapor and Aqueous Phases.

    PubMed

    Patel, Lara A; Kindt, James T

    2017-03-14

    We introduce a global fitting analysis method to obtain free energies of association of noncovalent molecular clusters using equilibrated cluster size distributions from unbiased constant-temperature molecular dynamics (MD) simulations. Because the systems simulated are small enough that the law of mass action does not describe the aggregation statistics, the method relies on iteratively determining a set of cluster free energies that, using appropriately weighted sums over all possible partitions of N monomers into clusters, produces the best-fit size distribution. The quality of these fits can be used as an objective measure of self-consistency to optimize the cutoff distance that determines how clusters are defined. To showcase the method, we have simulated a united-atom model of methyl tert-butyl ether (MTBE) in the vapor phase and in explicit water solution over a range of system sizes (up to 95 MTBE in the vapor phase and 60 MTBE in the aqueous phase) and concentrations at 273 K. The resulting size-dependent cluster free energy functions follow a form derived from classical nucleation theory (CNT) quite well over the full range of cluster sizes, although deviations are more pronounced for small cluster sizes. The CNT fit to cluster free energies yielded surface tensions that were in both cases lower than those for the simulated planar interfaces. We use a simple model to derive a condition for minimizing non-ideal effects on cluster size distributions and show that the cutoff distance that yields the best global fit is consistent with this condition.

  9. Focusing cosmic telescopes: systematics of strong lens modeling

    NASA Astrophysics Data System (ADS)

    Johnson, Traci Lin; Sharon, Keren q.

    2018-01-01

    The use of strong gravitational lensing by galaxy clusters has become a popular method for studying the high redshift universe. While diverse in computational methods, lens modeling techniques have grasped the means for determining statistical errors on cluster masses and magnifications. However, the systematic errors have yet to be quantified, arising from the number of constraints, availablity of spectroscopic redshifts, and various types of image configurations. I will be presenting my dissertation work on quantifying systematic errors in parametric strong lensing techniques. I have participated in the Hubble Frontier Fields lens model comparison project, using simulated clusters to compare the accuracy of various modeling techniques. I have extended this project to understanding how changing the quantity of constraints affects the mass and magnification. I will also present my recent work extending these studies to clusters in the Outer Rim Simulation. These clusters are typical of the clusters found in wide-field surveys, in mass and lensing cross-section. These clusters have fewer constraints than the HFF clusters and thus, are more susceptible to systematic errors. With the wealth of strong lensing clusters discovered in surveys such as SDSS, SPT, DES, and in the future, LSST, this work will be influential in guiding the lens modeling efforts and follow-up spectroscopic campaigns.

  10. Modeling of correlated data with informative cluster sizes: An evaluation of joint modeling and within-cluster resampling approaches.

    PubMed

    Zhang, Bo; Liu, Wei; Zhang, Zhiwei; Qu, Yanping; Chen, Zhen; Albert, Paul S

    2017-08-01

    Joint modeling and within-cluster resampling are two approaches that are used for analyzing correlated data with informative cluster sizes. Motivated by a developmental toxicity study, we examined the performances and validity of these two approaches in testing covariate effects in generalized linear mixed-effects models. We show that the joint modeling approach is robust to the misspecification of cluster size models in terms of Type I and Type II errors when the corresponding covariates are not included in the random effects structure; otherwise, statistical tests may be affected. We also evaluate the performance of the within-cluster resampling procedure and thoroughly investigate the validity of it in modeling correlated data with informative cluster sizes. We show that within-cluster resampling is a valid alternative to joint modeling for cluster-specific covariates, but it is invalid for time-dependent covariates. The two methods are applied to a developmental toxicity study that investigated the effect of exposure to diethylene glycol dimethyl ether.

  11. A hierarchical clustering methodology for the estimation of toxicity.

    PubMed

    Martin, Todd M; Harten, Paul; Venkatapathy, Raghuraman; Das, Shashikala; Young, Douglas M

    2008-01-01

    ABSTRACT A quantitative structure-activity relationship (QSAR) methodology based on hierarchical clustering was developed to predict toxicological endpoints. This methodology utilizes Ward's method to divide a training set into a series of structurally similar clusters. The structural similarity is defined in terms of 2-D physicochemical descriptors (such as connectivity and E-state indices). A genetic algorithm-based technique is used to generate statistically valid QSAR models for each cluster (using the pool of descriptors described above). The toxicity for a given query compound is estimated using the weighted average of the predictions from the closest cluster from each step in the hierarchical clustering assuming that the compound is within the domain of applicability of the cluster. The hierarchical clustering methodology was tested using a Tetrahymena pyriformis acute toxicity data set containing 644 chemicals in the training set and with two prediction sets containing 339 and 110 chemicals. The results from the hierarchical clustering methodology were compared to the results from several different QSAR methodologies.

  12. Defining syndromes using cattle meat inspection data for syndromic surveillance purposes: a statistical approach with the 2005–2010 data from ten French slaughterhouses

    PubMed Central

    2013-01-01

    Background The slaughterhouse is a central processing point for food animals and thus a source of both demographic data (age, breed, sex) and health-related data (reason for condemnation and condemned portions) that are not available through other sources. Using these data for syndromic surveillance is therefore tempting. However many possible reasons for condemnation and condemned portions exist, making the definition of relevant syndromes challenging. The objective of this study was to determine a typology of cattle with at least one portion of the carcass condemned in order to define syndromes. Multiple factor analysis (MFA) in combination with clustering methods was performed using both health-related data and demographic data. Results Analyses were performed on 381,186 cattle with at least one portion of the carcass condemned among the 1,937,917 cattle slaughtered in ten French abattoirs. Results of the MFA and clustering methods led to 12 clusters considered as stable according to year of slaughter and slaughterhouse. One cluster was specific to a disease of public health importance (cysticercosis). Two clusters were linked to the slaughtering process (fecal contamination of heart or lungs and deterioration lesions). Two clusters respectively characterized by chronic liver lesions and chronic peritonitis could be linked to diseases of economic importance to farmers. Three clusters could be linked respectively to reticulo-pericarditis, fatty liver syndrome and farmer’s lung syndrome, which are related to both diseases of economic importance to farmers and herd management issues. Three clusters respectively characterized by arthritis, myopathy and Dark Firm Dry (DFD) meat could notably be linked to animal welfare issues. Finally, one cluster, characterized by bronchopneumonia, could be linked to both animal health and herd management issues. Conclusion The statistical approach of combining multiple factor analysis with cluster analysis showed its relevance for the detection of syndromes using available large and complex slaughterhouse data. The advantages of this statistical approach are to i) define groups of reasons for condemnation based on meat inspection data, ii) help grouping reasons for condemnation among a list of various possible reasons for condemnation for which a consensus among experts could be difficult to reach, iii) assign each animal to a single syndrome which allows the detection of changes in trends of syndromes to detect unusual patterns in known diseases and emergence of new diseases. PMID:23628140

  13. Prediction of CpG-island function: CpG clustering vs. sliding-window methods

    PubMed Central

    2010-01-01

    Background Unmethylated stretches of CpG dinucleotides (CpG islands) are an outstanding property of mammal genomes. Conventionally, these regions are detected by sliding window approaches using %G + C, CpG observed/expected ratio and length thresholds as main parameters. Recently, clustering methods directly detect clusters of CpG dinucleotides as a statistical property of the genome sequence. Results We compare sliding-window to clustering (i.e. CpGcluster) predictions by applying new ways to detect putative functionality of CpG islands. Analyzing the co-localization with several genomic regions as a function of window size vs. statistical significance (p-value), CpGcluster shows a higher overlap with promoter regions and highly conserved elements, at the same time showing less overlap with Alu retrotransposons. The major difference in the prediction was found for short islands (CpG islets), often exclusively predicted by CpGcluster. Many of these islets seem to be functional, as they are unmethylated, highly conserved and/or located within the promoter region. Finally, we show that window-based islands can spuriously overlap several, differentially regulated promoters as well as different methylation domains, which might indicate a wrong merge of several CpG islands into a single, very long island. The shorter CpGcluster islands seem to be much more specific when concerning the overlap with alternative transcription start sites or the detection of homogenous methylation domains. Conclusions The main difference between sliding-window approaches and clustering methods is the length of the predicted islands. Short islands, often differentially methylated, are almost exclusively predicted by CpGcluster. This suggests that CpGcluster may be the algorithm of choice to explore the function of these short, but putatively functional CpG islands. PMID:20500903

  14. Combining evidence, biomedical literature and statistical dependence: new insights for functional annotation of gene sets

    PubMed Central

    Aubry, Marc; Monnier, Annabelle; Chicault, Celine; de Tayrac, Marie; Galibert, Marie-Dominique; Burgun, Anita; Mosser, Jean

    2006-01-01

    Background Large-scale genomic studies based on transcriptome technologies provide clusters of genes that need to be functionally annotated. The Gene Ontology (GO) implements a controlled vocabulary organised into three hierarchies: cellular components, molecular functions and biological processes. This terminology allows a coherent and consistent description of the knowledge about gene functions. The GO terms related to genes come primarily from semi-automatic annotations made by trained biologists (annotation based on evidence) or text-mining of the published scientific literature (literature profiling). Results We report an original functional annotation method based on a combination of evidence and literature that overcomes the weaknesses and the limitations of each approach. It relies on the Gene Ontology Annotation database (GOA Human) and the PubGene biomedical literature index. We support these annotations with statistically associated GO terms and retrieve associative relations across the three GO hierarchies to emphasise the major pathways involved by a gene cluster. Both annotation methods and associative relations were quantitatively evaluated with a reference set of 7397 genes and a multi-cluster study of 14 clusters. We also validated the biological appropriateness of our hybrid method with the annotation of a single gene (cdc2) and that of a down-regulated cluster of 37 genes identified by a transcriptome study of an in vitro enterocyte differentiation model (CaCo-2 cells). Conclusion The combination of both approaches is more informative than either separate approach: literature mining can enrich an annotation based only on evidence. Text-mining of the literature can also find valuable associated MEDLINE references that confirm the relevance of the annotation. Eventually, GO terms networks can be built with associative relations in order to highlight cooperative and competitive pathways and their connected molecular functions. PMID:16674810

  15. Regional Patterns and Spatial Clusters of Nonstationarities in Annual Peak Instantaneous Streamflow

    NASA Astrophysics Data System (ADS)

    White, K. D.; Baker, B.; Mueller, C.; Villarini, G.; Foley, P.; Friedman, D.

    2017-12-01

    Information about hydrologic changes resulting from changes in climate, land use, and land cover is a necessity planning and design or water resources infrastructure. The United States Army Corps of Engineers (USACE) evaluated and selected 12 methods to detect abrupt and slowly varying nonstationarities in records of maximum peak annual flows. They deployed a publicly available tool[1]in 2016 and a guidance document in 2017 to support identification of nonstationarities in a reproducible manner using a robust statistical framework. This statistical framework has now been applied to streamflow records across the continental United States to explore the presence of regional patterns and spatial clusters of nonstationarities in peak annual flow. Incorporating this geographic dimension into the detection of nonstationarities provides valuable insight for the process of attribution of these significant changes. This poster summarizes the methods used and provides the results of the regional analysis. [1] Available here - http://www.corpsclimate.us/ptcih.cfm

  16. Population Genomics and the Statistical Values of Race: An Interdisciplinary Perspective on the Biological Classification of Human Populations and Implications for Clinical Genetic Epidemiological Research

    PubMed Central

    Maglo, Koffi N.; Mersha, Tesfaye B.; Martin, Lisa J.

    2016-01-01

    The biological status and biomedical significance of the concept of race as applied to humans continue to be contentious issues despite the use of advanced statistical and clustering methods to determine continental ancestry. It is thus imperative for researchers to understand the limitations as well as potential uses of the concept of race in biology and biomedicine. This paper deals with the theoretical assumptions behind cluster analysis in human population genomics. Adopting an interdisciplinary approach, it demonstrates that the hypothesis that attributes the clustering of human populations to “frictional” effects of landform barriers at continental boundaries is empirically incoherent. It then contrasts the scientific status of the “cluster” and “cline” constructs in human population genomics, and shows how cluster may be instrumentally produced. It also shows how statistical values of race vindicate Darwin's argument that race is evolutionarily meaningless. Finally, the paper explains why, due to spatiotemporal parameters, evolutionary forces, and socio-cultural factors influencing population structure, continental ancestry may be pragmatically relevant to global and public health genomics. Overall, this work demonstrates that, from a biological systematic and evolutionary taxonomical perspective, human races/continental groups or clusters have no natural meaning or objective biological reality. In fact, the utility of racial categorizations in research and in clinics can be explained by spatiotemporal parameters, socio-cultural factors, and evolutionary forces affecting disease causation and treatment response. PMID:26925096

  17. An X-ray method for detecting substructure in galaxy clusters - Application to Perseus, A2256, Centaurus, Coma, and Sersic 40/6

    NASA Technical Reports Server (NTRS)

    Mohr, Joseph J.; Fabricant, Daniel G.; Geller, Margaret J.

    1993-01-01

    We use the moments of the X-ray surface brightness distribution to constrain the dynamical state of a galaxy cluster. Using X-ray observations from the Einstein Observatory IPC, we measure the first moment FM, the ellipsoidal orientation angle, and the axial ratio at a sequence of radii in the cluster. We argue that a significant variation in the image centroid FM as a function of radius is evidence for a nonequilibrium feature in the intracluster medium (ICM) density distribution. In simple terms, centroid shifts indicate that the center of mass of the ICM varies with radius. This variation is a tracer of continuing dynamical evolution. For each cluster, we evaluate the significance of variations in the centroid of the IPC image by computing the same statistics on an ensemble of simulated cluster images. In producing these simulated images we include X-ray point source emission, telescope vignetting, Poisson noise, and characteristics of the IPC. Application of this new method to five Abell clusters reveals that the core of each one has significant substructure. In addition, we find significant variations in the orientation angle and the axial ratio for several of the clusters.

  18. Cluster analysis and its application to healthcare claims data: a study of end-stage renal disease patients who initiated hemodialysis.

    PubMed

    Liao, Minlei; Li, Yunfeng; Kianifard, Farid; Obi, Engels; Arcona, Stephen

    2016-03-02

    Cluster analysis (CA) is a frequently used applied statistical technique that helps to reveal hidden structures and "clusters" found in large data sets. However, this method has not been widely used in large healthcare claims databases where the distribution of expenditure data is commonly severely skewed. The purpose of this study was to identify cost change patterns of patients with end-stage renal disease (ESRD) who initiated hemodialysis (HD) by applying different clustering methods. A retrospective, cross-sectional, observational study was conducted using the Truven Health MarketScan® Research Databases. Patients aged ≥18 years with ≥2 ESRD diagnoses who initiated HD between 2008 and 2010 were included. The K-means CA method and hierarchical CA with various linkage methods were applied to all-cause costs within baseline (12-months pre-HD) and follow-up periods (12-months post-HD) to identify clusters. Demographic, clinical, and cost information was extracted from both periods, and then examined by cluster. A total of 18,380 patients were identified. Meaningful all-cause cost clusters were generated using K-means CA and hierarchical CA with either flexible beta or Ward's methods. Based on cluster sample sizes and change of cost patterns, the K-means CA method and 4 clusters were selected: Cluster 1: Average to High (n = 113); Cluster 2: Very High to High (n = 89); Cluster 3: Average to Average (n = 16,624); or Cluster 4: Increasing Costs, High at Both Points (n = 1554). Median cost changes in the 12-month pre-HD and post-HD periods increased from $185,070 to $884,605 for Cluster 1 (Average to High), decreased from $910,930 to $157,997 for Cluster 2 (Very High to High), were relatively stable and remained low from $15,168 to $13,026 for Cluster 3 (Average to Average), and increased from $57,909 to $193,140 for Cluster 4 (Increasing Costs, High at Both Points). Relatively stable costs after starting HD were associated with more stable scores on comorbidity index scores from the pre-and post-HD periods, while increasing costs were associated with more sharply increasing comorbidity scores. The K-means CA method appeared to be the most appropriate in healthcare claims data with highly skewed cost information when taking into account both change of cost patterns and sample size in the smallest cluster.

  19. Descriptive Epidemiology of Typhoid Fever during an Epidemic in Harare, Zimbabwe, 2012

    PubMed Central

    Polonsky, Jonathan A.; Martínez-Pino, Isabel; Nackers, Fabienne; Chonzi, Prosper; Manangazira, Portia; Van Herp, Michel; Maes, Peter; Porten, Klaudia; Luquero, Francisco J.

    2014-01-01

    Background Typhoid fever remains a significant public health problem in developing countries. In October 2011, a typhoid fever epidemic was declared in Harare, Zimbabwe - the fourth enteric infection epidemic since 2008. To orient control activities, we described the epidemiology and spatiotemporal clustering of the epidemic in Dzivaresekwa and Kuwadzana, the two most affected suburbs of Harare. Methods A typhoid fever case-patient register was analysed to describe the epidemic. To explore clustering, we constructed a dataset comprising GPS coordinates of case-patient residences and randomly sampled residential locations (spatial controls). The scale and significance of clustering was explored with Ripley K functions. Cluster locations were determined by a random labelling technique and confirmed using Kulldorff's spatial scan statistic. Principal Findings We analysed data from 2570 confirmed and suspected case-patients, and found significant spatiotemporal clustering of typhoid fever in two non-overlapping areas, which appeared to be linked to environmental sources. Peak relative risk was more than six times greater than in areas lying outside the cluster ranges. Clusters were identified in similar geographical ranges by both random labelling and Kulldorff's spatial scan statistic. The spatial scale at which typhoid fever clustered was highly localised, with significant clustering at distances up to 4.5 km and peak levels at approximately 3.5 km. The epicentre of infection transmission shifted from one cluster to the other during the course of the epidemic. Conclusions This study demonstrated highly localised clustering of typhoid fever during an epidemic in an urban African setting, and highlights the importance of spatiotemporal analysis for making timely decisions about targetting prevention and control activities and reinforcing treatment during epidemics. This approach should be integrated into existing surveillance systems to facilitate early detection of epidemics and identify their spatial range. PMID:25486292

  20. Phylogenetic investigation of a statewide HIV-1 epidemic reveals ongoing and active transmission networks among men who have sex with men

    PubMed Central

    Chan, Philip A.; Hogan, Joseph W.; Huang, Austin; DeLong, Allison; Salemi, Marco; Mayer, Kenneth H.; Kantor, Rami

    2015-01-01

    Background Molecular epidemiologic evaluation of HIV-1 transmission networks can elucidate behavioral components of transmission that can be targets for intervention. Methods We combined phylogenetic and statistical approaches using pol sequences from patients diagnosed 2004-2011 at a large HIV center in Rhode Island, following 75% of the state’s HIV population. Phylogenetic trees were constructed using maximum likelihood and putative transmission clusters were evaluated using latent class analyses (LCA) to determine association of cluster size with underlying demographic/behavioral characteristics. A logistic growth model was used to assess intra-cluster dynamics over time and predict “active” clusters that were more likely to harbor undiagnosed infections. Results Of 1,166 HIV-1 subtype B sequences, 31% were distributed among 114 statistically-supported, monophyletic clusters (range: 2-15 sequences/cluster). Sequences from men who have sex with men (MSM) formed 52% of clusters. LCA demonstrated that sequences from recently diagnosed (2008-2011) MSM with primary HIV infection (PHI) and other sexually transmitted infections (STIs) were more likely to form larger clusters (Odds Ratio 1.62-11.25, p<0.01). MSM in clusters were more likely to have anonymous partners and meet partners at sex clubs and pornographic stores. Four large clusters with 38 sequences (100% male, 89% MSM) had a high-probability of harboring undiagnosed infections and included younger MSM with PHI and STIs. Conclusions In this first large-scale molecular epidemiologic investigation of HIV-1 transmission in New England, sexual networks among recently diagnosed MSM with PHI and concomitant STIs contributed to ongoing transmission. Characterization of transmission dynamics revealed actively growing clusters which may be targets for intervention. PMID:26258569

  1. Distribution-based fuzzy clustering of electrical resistivity tomography images for interface detection

    NASA Astrophysics Data System (ADS)

    Ward, W. O. C.; Wilkinson, P. B.; Chambers, J. E.; Oxby, L. S.; Bai, L.

    2014-04-01

    A novel method for the effective identification of bedrock subsurface elevation from electrical resistivity tomography images is described. Identifying subsurface boundaries in the topographic data can be difficult due to smoothness constraints used in inversion, so a statistical population-based approach is used that extends previous work in calculating isoresistivity surfaces. The analysis framework involves a procedure for guiding a clustering approach based on the fuzzy c-means algorithm. An approximation of resistivity distributions, found using kernel density estimation, was utilized as a means of guiding the cluster centroids used to classify data. A fuzzy method was chosen over hard clustering due to uncertainty in hard edges in the topography data, and a measure of clustering uncertainty was identified based on the reciprocal of cluster membership. The algorithm was validated using a direct comparison of known observed bedrock depths at two 3-D survey sites, using real-time GPS information of exposed bedrock by quarrying on one site, and borehole logs at the other. Results show similarly accurate detection as a leading isosurface estimation method, and the proposed algorithm requires significantly less user input and prior site knowledge. Furthermore, the method is effectively dimension-independent and will scale to data of increased spatial dimensions without a significant effect on the runtime. A discussion on the results by automated versus supervised analysis is also presented.

  2. Data-driven inference for the spatial scan statistic.

    PubMed

    Almeida, Alexandre C L; Duarte, Anderson R; Duczmal, Luiz H; Oliveira, Fernando L P; Takahashi, Ricardo H C

    2011-08-02

    Kulldorff's spatial scan statistic for aggregated area maps searches for clusters of cases without specifying their size (number of areas) or geographic location in advance. Their statistical significance is tested while adjusting for the multiple testing inherent in such a procedure. However, as is shown in this work, this adjustment is not done in an even manner for all possible cluster sizes. A modification is proposed to the usual inference test of the spatial scan statistic, incorporating additional information about the size of the most likely cluster found. A new interpretation of the results of the spatial scan statistic is done, posing a modified inference question: what is the probability that the null hypothesis is rejected for the original observed cases map with a most likely cluster of size k, taking into account only those most likely clusters of size k found under null hypothesis for comparison? This question is especially important when the p-value computed by the usual inference process is near the alpha significance level, regarding the correctness of the decision based in this inference. A practical procedure is provided to make more accurate inferences about the most likely cluster found by the spatial scan statistic.

  3. Analyzing Protein Clusters on the Plasma Membrane: Application of Spatial Statistical Analysis Methods on Super-Resolution Microscopy Images.

    PubMed

    Paparelli, Laura; Corthout, Nikky; Pavie, Benjamin; Annaert, Wim; Munck, Sebastian

    2016-01-01

    The spatial distribution of proteins within the cell affects their capability to interact with other molecules and directly influences cellular processes and signaling. At the plasma membrane, multiple factors drive protein compartmentalization into specialized functional domains, leading to the formation of clusters in which intermolecule interactions are facilitated. Therefore, quantifying protein distributions is a necessity for understanding their regulation and function. The recent advent of super-resolution microscopy has opened up the possibility of imaging protein distributions at the nanometer scale. In parallel, new spatial analysis methods have been developed to quantify distribution patterns in super-resolution images. In this chapter, we provide an overview of super-resolution microscopy and summarize the factors influencing protein arrangements on the plasma membrane. Finally, we highlight methods for analyzing clusterization of plasma membrane proteins, including examples of their applications.

  4. Statistical Analysis of Large Scale Structure by the Discrete Wavelet Transform

    NASA Astrophysics Data System (ADS)

    Pando, Jesus

    1997-10-01

    The discrete wavelet transform (DWT) is developed as a general statistical tool for the study of large scale structures (LSS) in astrophysics. The DWT is used in all aspects of structure identification including cluster analysis, spectrum and two-point correlation studies, scale-scale correlation analysis and to measure deviations from Gaussian behavior. The techniques developed are demonstrated on 'academic' signals, on simulated models of the Lymanα (Lyα) forests, and on observational data of the Lyα forests. This technique can detect clustering in the Ly-α clouds where traditional techniques such as the two-point correlation function have failed. The position and strength of these clusters in both real and simulated data is determined and it is shown that clusters exist on scales as large as at least 20 h-1 Mpc at significance levels of 2-4 σ. Furthermore, it is found that the strength distribution of the clusters can be used to distinguish between real data and simulated samples even where other traditional methods have failed to detect differences. Second, a method for measuring the power spectrum of a density field using the DWT is developed. All common features determined by the usual Fourier power spectrum can be calculated by the DWT. These features, such as the index of a power law or typical scales, can be detected even when the samples are geometrically complex, the samples are incomplete, or the mean density on larger scales is not known (the infrared uncertainty). Using this method the spectra of Ly-α forests in both simulated and real samples is calculated. Third, a method for measuring hierarchical clustering is introduced. Because hierarchical evolution is characterized by a set of rules of how larger dark matter halos are formed by the merging of smaller halos, scale-scale correlations of the density field should be one of the most sensitive quantities in determining the merging history. We show that these correlations can be completely determined by the correlations between discrete wavelet coefficients on adjacent scales and at nearly the same spatial position, Cj,j+12/cdot2. Scale-scale correlations on two samples of the QSO Ly-α forests absorption spectra are computed. Lastly, higher order statistics are developed to detect deviations from Gaussian behavior. These higher order statistics are necessary to fully characterize the Ly-α forests because the usual 2nd order statistics, such as the two-point correlation function or power spectrum, give inconclusive results. It is shown how this technique takes advantage of the locality of the DWT to circumvent the central limit theorem. A non-Gaussian spectrum is defined and this spectrum reveals not only the magnitude, but the scales of non-Gaussianity. When applied to simulated and observational samples of the Ly-α clouds, it is found that different popular models of structure formation have different spectra while two, independent observational data sets, have the same spectra. Moreover, the non-Gaussian spectra of real data sets are significantly different from the spectra of various possible random samples. (Abstract shortened by UMI.)

  5. Consistency of Cluster Analysis for Cognitive Diagnosis: The Reduced Reparameterized Unified Model and the General Diagnostic Model.

    PubMed

    Chiu, Chia-Yi; Köhn, Hans-Friedrich

    2016-09-01

    The asymptotic classification theory of cognitive diagnosis (ACTCD) provided the theoretical foundation for using clustering methods that do not rely on a parametric statistical model for assigning examinees to proficiency classes. Like general diagnostic classification models, clustering methods can be useful in situations where the true diagnostic classification model (DCM) underlying the data is unknown and possibly misspecified, or the items of a test conform to a mix of multiple DCMs. Clustering methods can also be an option when fitting advanced and complex DCMs encounters computational difficulties. These can range from the use of excessive CPU times to plain computational infeasibility. However, the propositions of the ACTCD have only been proven for the Deterministic Input Noisy Output "AND" gate (DINA) model and the Deterministic Input Noisy Output "OR" gate (DINO) model. For other DCMs, there does not exist a theoretical justification to use clustering for assigning examinees to proficiency classes. But if clustering is to be used legitimately, then the ACTCD must cover a larger number of DCMs than just the DINA model and the DINO model. Thus, the purpose of this article is to prove the theoretical propositions of the ACTCD for two other important DCMs, the Reduced Reparameterized Unified Model and the General Diagnostic Model.

  6. Defining functioning levels in patients with schizophrenia: A combination of a novel clustering method and brain SPECT analysis.

    PubMed

    Catherine, Faget-Agius; Aurélie, Vincenti; Eric, Guedj; Pierre, Michel; Raphaëlle, Richieri; Marine, Alessandrini; Pascal, Auquier; Christophe, Lançon; Laurent, Boyer

    2017-12-30

    This study aims to define functioning levels of patients with schizophrenia by using a method of interpretable clustering based on a specific functioning scale, the Functional Remission Of General Schizophrenia (FROGS) scale, and to test their validity regarding clinical and neuroimaging characterization. In this observational study, patients with schizophrenia have been classified using a hierarchical top-down method called clustering using unsupervised binary trees (CUBT). Socio-demographic, clinical, and neuroimaging SPECT perfusion data were compared between the different clusters to ensure their clinical relevance. A total of 242 patients were analyzed. A four-group functioning level structure has been identified: 54 are classified as "minimal", 81 as "low", 64 as "moderate", and 43 as "high". The clustering shows satisfactory statistical properties, including reproducibility and discriminancy. The 4 clusters consistently differentiate patients. "High" functioning level patients reported significantly the lowest scores on the PANSS and the CDSS, and the highest scores on the GAF, the MARS and S-QoL 18. Functioning levels were significantly associated with cerebral perfusion of two relevant areas: the left inferior parietal cortex and the anterior cingulate. Our study provides relevant functioning levels in schizophrenia, and may enhance the use of functioning scale. Copyright © 2017 Elsevier B.V. All rights reserved.

  7. Identifying and Assessing Interesting Subgroups in a Heterogeneous Population

    PubMed Central

    Lee, Woojoo; Alexeyenko, Andrey; Pernemalm, Maria; Guegan, Justine; Dessen, Philippe; Lazar, Vladimir; Lehtiö, Janne; Pawitan, Yudi

    2015-01-01

    Biological heterogeneity is common in many diseases and it is often the reason for therapeutic failures. Thus, there is great interest in classifying a disease into subtypes that have clinical significance in terms of prognosis or therapy response. One of the most popular methods to uncover unrecognized subtypes is cluster analysis. However, classical clustering methods such as k-means clustering or hierarchical clustering are not guaranteed to produce clinically interesting subtypes. This could be because the main statistical variability—the basis of cluster generation—is dominated by genes not associated with the clinical phenotype of interest. Furthermore, a strong prognostic factor might be relevant for a certain subgroup but not for the whole population; thus an analysis of the whole sample may not reveal this prognostic factor. To address these problems we investigate methods to identify and assess clinically interesting subgroups in a heterogeneous population. The identification step uses a clustering algorithm and to assess significance we use a false discovery rate- (FDR-) based measure. Under the heterogeneity condition the standard FDR estimate is shown to overestimate the true FDR value, but this is remedied by an improved FDR estimation procedure. As illustrations, two real data examples from gene expression studies of lung cancer are provided. PMID:26339613

  8. Inference of Markovian properties of molecular sequences from NGS data and applications to comparative genomics.

    PubMed

    Ren, Jie; Song, Kai; Deng, Minghua; Reinert, Gesine; Cannon, Charles H; Sun, Fengzhu

    2016-04-01

    Next-generation sequencing (NGS) technologies generate large amounts of short read data for many different organisms. The fact that NGS reads are generally short makes it challenging to assemble the reads and reconstruct the original genome sequence. For clustering genomes using such NGS data, word-count based alignment-free sequence comparison is a promising approach, but for this approach, the underlying expected word counts are essential.A plausible model for this underlying distribution of word counts is given through modeling the DNA sequence as a Markov chain (MC). For single long sequences, efficient statistics are available to estimate the order of MCs and the transition probability matrix for the sequences. As NGS data do not provide a single long sequence, inference methods on Markovian properties of sequences based on single long sequences cannot be directly used for NGS short read data. Here we derive a normal approximation for such word counts. We also show that the traditional Chi-square statistic has an approximate gamma distribution ,: using the Lander-Waterman model for physical mapping. We propose several methods to estimate the order of the MC based on NGS reads and evaluate those using simulations. We illustrate the applications of our results by clustering genomic sequences of several vertebrate and tree species based on NGS reads using alignment-free sequence dissimilarity measures. We find that the estimated order of the MC has a considerable effect on the clustering results ,: and that the clustering results that use a N: MC of the estimated order give a plausible clustering of the species. Our implementation of the statistics developed here is available as R package 'NGS.MC' at http://www-rcf.usc.edu/∼fsun/Programs/NGS-MC/NGS-MC.html fsun@usc.edu Supplementary data are available at Bioinformatics online. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.

  9. Distinguishing Functional DNA Words; A Method for Measuring Clustering Levels

    NASA Astrophysics Data System (ADS)

    Moghaddasi, Hanieh; Khalifeh, Khosrow; Darooneh, Amir Hossein

    2017-01-01

    Functional DNA sub-sequences and genome elements are spatially clustered through the genome just as keywords in literary texts. Therefore, some of the methods for ranking words in texts can also be used to compare different DNA sub-sequences. In analogy with the literary texts, here we claim that the distribution of distances between the successive sub-sequences (words) is q-exponential which is the distribution function in non-extensive statistical mechanics. Thus the q-parameter can be used as a measure of words clustering levels. Here, we analyzed the distribution of distances between consecutive occurrences of 16 possible dinucleotides in human chromosomes to obtain their corresponding q-parameters. We found that CG as a biologically important two-letter word concerning its methylation, has the highest clustering level. This finding shows the predicting ability of the method in biology. We also proposed that chromosome 18 with the largest value of q-parameter for promoters of genes is more sensitive to dietary and lifestyle. We extended our study to compare the genome of some selected organisms and concluded that the clustering level of CGs increases in higher evolutionary organisms compared to lower ones.

  10. Galaxy Cluster Mass Reconstruction Project – III. The impact of dynamical substructure on cluster mass estimates

    DOE PAGES

    Old, L.; Wojtak, R.; Pearce, F. R.; ...

    2017-12-20

    With the advent of wide-field cosmological surveys, we are approaching samples of hundreds of thousands of galaxy clusters. While such large numbers will help reduce statistical uncertainties, the control of systematics in cluster masses is crucial. Here we examine the effects of an important source of systematic uncertainty in galaxy-based cluster mass estimation techniques: the presence of significant dynamical substructure. Dynamical substructure manifests as dynamically distinct subgroups in phase-space, indicating an ‘unrelaxed’ state. This issue affects around a quarter of clusters in a generally selected sample. We employ a set of mock clusters whose masses have been measured homogeneously withmore » commonly used galaxy-based mass estimation techniques (kinematic, richness, caustic, radial methods). We use these to study how the relation between observationally estimated and true cluster mass depends on the presence of substructure, as identified by various popular diagnostics. We find that the scatter for an ensemble of clusters does not increase dramatically for clusters with dynamical substructure. However, we find a systematic bias for all methods, such that clusters with significant substructure have higher measured masses than their relaxed counterparts. This bias depends on cluster mass: the most massive clusters are largely unaffected by the presence of significant substructure, but masses are significantly overestimated for lower mass clusters, by ~ 10 percent at 10 14 and ≳ 20 percent for ≲ 10 13.5. Finally, the use of cluster samples with different levels of substructure can therefore bias certain cosmological parameters up to a level comparable to the typical uncertainties in current cosmological studies.« less

  11. Galaxy Cluster Mass Reconstruction Project – III. The impact of dynamical substructure on cluster mass estimates

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Old, L.; Wojtak, R.; Pearce, F. R.

    With the advent of wide-field cosmological surveys, we are approaching samples of hundreds of thousands of galaxy clusters. While such large numbers will help reduce statistical uncertainties, the control of systematics in cluster masses is crucial. Here we examine the effects of an important source of systematic uncertainty in galaxy-based cluster mass estimation techniques: the presence of significant dynamical substructure. Dynamical substructure manifests as dynamically distinct subgroups in phase-space, indicating an ‘unrelaxed’ state. This issue affects around a quarter of clusters in a generally selected sample. We employ a set of mock clusters whose masses have been measured homogeneously withmore » commonly used galaxy-based mass estimation techniques (kinematic, richness, caustic, radial methods). We use these to study how the relation between observationally estimated and true cluster mass depends on the presence of substructure, as identified by various popular diagnostics. We find that the scatter for an ensemble of clusters does not increase dramatically for clusters with dynamical substructure. However, we find a systematic bias for all methods, such that clusters with significant substructure have higher measured masses than their relaxed counterparts. This bias depends on cluster mass: the most massive clusters are largely unaffected by the presence of significant substructure, but masses are significantly overestimated for lower mass clusters, by ~ 10 percent at 10 14 and ≳ 20 percent for ≲ 10 13.5. Finally, the use of cluster samples with different levels of substructure can therefore bias certain cosmological parameters up to a level comparable to the typical uncertainties in current cosmological studies.« less

  12. Targeting regional pediatric congenital hearing loss using a spatial scan statistic.

    PubMed

    Bush, Matthew L; Christian, Warren Jay; Bianchi, Kristin; Lester, Cathy; Schoenberg, Nancy

    2015-01-01

    Congenital hearing loss is a common problem, and timely identification and intervention are paramount for language development. Patients from rural regions may have many barriers to timely diagnosis and intervention. The purpose of this study was to examine the spatial and hospital-based distribution of failed infant hearing screening testing and pediatric congenital hearing loss throughout Kentucky. Data on live births and audiological reporting of infant hearing loss results in Kentucky from 2009 to 2011 were analyzed. The authors used spatial scan statistics to identify high-rate clusters of failed newborn screening tests and permanent congenital hearing loss (PCHL), based on the total number of live births per county. The authors conducted further analyses on PCHL and failed newborn hearing screening tests, based on birth hospital data and method of screening. The authors observed four statistically significant (p < 0.05) high-rate clusters with failed newborn hearing screenings in Kentucky, including two in the Appalachian region. Hospitals using two-stage otoacoustic emission testing demonstrated higher rates of failed screening (p = 0.009) than those using two-stage automated auditory brainstem response testing. A significant cluster of high rate of PCHL was observed in Western Kentucky. Five of the 54 birthing hospitals were found to have higher relative risk of PCHL, and two of those hospitals are located in a very rural region of Western Kentucky within the cluster. This spatial analysis in children in Kentucky has identified specific regions throughout the state with high rates of congenital hearing loss and failed newborn hearing screening tests. Further investigation regarding causative factors is warranted. This method of analysis can be useful in the setting of hearing health disparities to focus efforts on regions facing high incidence of congenital hearing loss.

  13. Hyperparameterization of soil moisture statistical models for North America with Ensemble Learning Models (Elm)

    NASA Astrophysics Data System (ADS)

    Steinberg, P. D.; Brener, G.; Duffy, D.; Nearing, G. S.; Pelissier, C.

    2017-12-01

    Hyperparameterization, of statistical models, i.e. automated model scoring and selection, such as evolutionary algorithms, grid searches, and randomized searches, can improve forecast model skill by reducing errors associated with model parameterization, model structure, and statistical properties of training data. Ensemble Learning Models (Elm), and the related Earthio package, provide a flexible interface for automating the selection of parameters and model structure for machine learning models common in climate science and land cover classification, offering convenient tools for loading NetCDF, HDF, Grib, or GeoTiff files, decomposition methods like PCA and manifold learning, and parallel training and prediction with unsupervised and supervised classification, clustering, and regression estimators. Continuum Analytics is using Elm to experiment with statistical soil moisture forecasting based on meteorological forcing data from NASA's North American Land Data Assimilation System (NLDAS). There Elm is using the NSGA-2 multiobjective optimization algorithm for optimizing statistical preprocessing of forcing data to improve goodness-of-fit for statistical models (i.e. feature engineering). This presentation will discuss Elm and its components, including dask (distributed task scheduling), xarray (data structures for n-dimensional arrays), and scikit-learn (statistical preprocessing, clustering, classification, regression), and it will show how NSGA-2 is being used for automate selection of soil moisture forecast statistical models for North America.

  14. 3D variational brain tumor segmentation using Dirichlet priors on a clustered feature set.

    PubMed

    Popuri, Karteek; Cobzas, Dana; Murtha, Albert; Jägersand, Martin

    2012-07-01

    Brain tumor segmentation is a required step before any radiation treatment or surgery. When performed manually, segmentation is time consuming and prone to human errors. Therefore, there have been significant efforts to automate the process. But, automatic tumor segmentation from MRI data is a particularly challenging task. Tumors have a large diversity in shape and appearance with intensities overlapping the normal brain tissues. In addition, an expanding tumor can also deflect and deform nearby tissue. In our work, we propose an automatic brain tumor segmentation method that addresses these last two difficult problems. We use the available MRI modalities (T1, T1c, T2) and their texture characteristics to construct a multidimensional feature set. Then, we extract clusters which provide a compact representation of the essential information in these features. The main idea in this work is to incorporate these clustered features into the 3D variational segmentation framework. In contrast to previous variational approaches, we propose a segmentation method that evolves the contour in a supervised fashion. The segmentation boundary is driven by the learned region statistics in the cluster space. We incorporate prior knowledge about the normal brain tissue appearance during the estimation of these region statistics. In particular, we use a Dirichlet prior that discourages the clusters from the normal brain region to be in the tumor region. This leads to a better disambiguation of the tumor from brain tissue. We evaluated the performance of our automatic segmentation method on 15 real MRI scans of brain tumor patients, with tumors that are inhomogeneous in appearance, small in size and in proximity to the major structures in the brain. Validation with the expert segmentation labels yielded encouraging results: Jaccard (58%), Precision (81%), Recall (67%), Hausdorff distance (24 mm). Using priors on the brain/tumor appearance, our proposed automatic 3D variational segmentation method was able to better disambiguate the tumor from the surrounding tissue.

  15. A review on the multivariate statistical methods for dimensional reduction studies

    NASA Astrophysics Data System (ADS)

    Aik, Lim Eng; Kiang, Lam Chee; Mohamed, Zulkifley Bin; Hong, Tan Wei

    2017-05-01

    In this research study we have discussed multivariate statistical methods for dimensional reduction, which has been done by various researchers. The reduction of dimensionality is valuable to accelerate algorithm progression, as well as really may offer assistance with the last grouping/clustering precision. A lot of boisterous or even flawed info information regularly prompts a not exactly alluring algorithm progression. Expelling un-useful or dis-instructive information segments may for sure help the algorithm discover more broad grouping locales and principles and generally speaking accomplish better exhibitions on new data set.

  16. Prediction of operon-like gene clusters in the Arabidopsis thaliana genome based on co-expression analysis of neighboring genes.

    PubMed

    Wada, Masayoshi; Takahashi, Hiroki; Altaf-Ul-Amin, Md; Nakamura, Kensuke; Hirai, Masami Y; Ohta, Daisaku; Kanaya, Shigehiko

    2012-07-15

    Operon-like arrangements of genes occur in eukaryotes ranging from yeasts and filamentous fungi to nematodes, plants, and mammals. In plants, several examples of operon-like gene clusters involved in metabolic pathways have recently been characterized, e.g. the cyclic hydroxamic acid pathways in maize, the avenacin biosynthesis gene clusters in oat, the thalianol pathway in Arabidopsis thaliana, and the diterpenoid momilactone cluster in rice. Such operon-like gene clusters are defined by their co-regulation or neighboring positions within immediate vicinity of chromosomal regions. A comprehensive analysis of the expression of neighboring genes therefore accounts a crucial step to reveal the complete set of operon-like gene clusters within a genome. Genome-wide prediction of operon-like gene clusters should contribute to functional annotation efforts and provide novel insight into evolutionary aspects acquiring certain biological functions as well. We predicted co-expressed gene clusters by comparing the Pearson correlation coefficient of neighboring genes and randomly selected gene pairs, based on a statistical method that takes false discovery rate (FDR) into consideration for 1469 microarray gene expression datasets of A. thaliana. We estimated that A. thaliana contains 100 operon-like gene clusters in total. We predicted 34 statistically significant gene clusters consisting of 3 to 22 genes each, based on a stringent FDR threshold of 0.1. Functional relationships among genes in individual clusters were estimated by sequence similarity and functional annotation of genes. Duplicated gene pairs (determined based on BLAST with a cutoff of E<10(-5)) are included in 27 clusters. Five clusters are associated with metabolism, containing P450 genes restricted to the Brassica family and predicted to be involved in secondary metabolism. Operon-like clusters tend to include genes encoding bio-machinery associated with ribosomes, the ubiquitin/proteasome system, secondary metabolic pathways, lipid and fatty-acid metabolism, and the lipid transfer system. Copyright © 2012 Elsevier B.V. All rights reserved.

  17. A space-time scan statistic for detecting emerging outbreaks.

    PubMed

    Tango, Toshiro; Takahashi, Kunihiko; Kohriyama, Kazuaki

    2011-03-01

    As a major analytical method for outbreak detection, Kulldorff's space-time scan statistic (2001, Journal of the Royal Statistical Society, Series A 164, 61-72) has been implemented in many syndromic surveillance systems. Since, however, it is based on circular windows in space, it has difficulty correctly detecting actual noncircular clusters. Takahashi et al. (2008, International Journal of Health Geographics 7, 14) proposed a flexible space-time scan statistic with the capability of detecting noncircular areas. It seems to us, however, that the detection of the most likely cluster defined in these space-time scan statistics is not the same as the detection of localized emerging disease outbreaks because the former compares the observed number of cases with the conditional expected number of cases. In this article, we propose a new space-time scan statistic which compares the observed number of cases with the unconditional expected number of cases, takes a time-to-time variation of Poisson mean into account, and implements an outbreak model to capture localized emerging disease outbreaks more timely and correctly. The proposed models are illustrated with data from weekly surveillance of the number of absentees in primary schools in Kitakyushu-shi, Japan, 2006. © 2010, The International Biometric Society.

  18. Measurement of surface roughness changes of unpolished and polished enamel following erosion

    PubMed Central

    Austin, Rupert S.; Parkinson, Charles R.; Hasan, Adam; Bartlett, David W.

    2017-01-01

    Objectives To determine if Sa roughness data from measuring one central location of unpolished and polished enamel were representative of the overall surfaces before and after erosion. Methods Twenty human enamel sections (4x4 mm) were embedded in bis-acryl composite and randomised to either a native or polishing enamel preparation protocol. Enamel samples were subjected to an acid challenge (15 minutes 100 mL orange juice, pH 3.2, titratable acidity 41.3mmol OH/L, 62.5 rpm agitation, repeated for three cycles). Median (IQR) surface roughness [Sa] was measured at baseline and after erosion from both a centralised cluster and four peripheral clusters. Within each cluster, five smaller areas (0.04 mm2) provided the Sa roughness data. Results For both unpolished and polished enamel samples there were no significant differences between measuring one central cluster or four peripheral clusters, before and after erosion. For unpolished enamel the single central cluster had a median (IQR) Sa roughness of 1.45 (2.58) μm and the four peripheral clusters had a median (IQR) of 1.32 (4.86) μm before erosion; after erosion there were statistically significant reductions to 0.38 (0.35) μm and 0.34 (0.49) μm respectively (p<0.0001). Polished enamel had a median (IQR) Sa roughness 0.04 (0.17) μm for the single central cluster and 0.05 (0.15) μm for the four peripheral clusters which statistically significantly increased after erosion to 0.27 (0.08) μm for both (p<0.0001). Conclusion Measuring one central cluster of unpolished and polished enamel was representative of the overall enamel surface roughness, before and after erosion. PMID:28771562

  19. Utility of K-Means clustering algorithm in differentiating apparent diffusion coefficient values between benign and malignant neck pathologies

    PubMed Central

    Srinivasan, A.; Galbán, C.J.; Johnson, T.D.; Chenevert, T.L.; Ross, B.D.; Mukherji, S.K.

    2014-01-01

    Purpose The objective of our study was to analyze the differences between apparent diffusion coefficient (ADC) partitions (created using the K-Means algorithm) between benign and malignant neck lesions and evaluate its benefit in distinguishing these entities. Material and methods MRI studies of 10 benign and 10 malignant proven neck pathologies were post-processed on a PC using in-house software developed in MATLAB (The MathWorks, Inc., Natick, MA). Lesions were manually contoured by two neuroradiologists with the ADC values within each lesion clustered into two (low ADC-ADCL, high ADC-ADCH) and three partitions (ADCL, intermediate ADC-ADCI, ADCH) using the K-Means clustering algorithm. An unpaired two-tailed Student’s t-test was performed for all metrics to determine statistical differences in the means between the benign and malignant pathologies. Results Statistically significant difference between the mean ADCL clusters in benign and malignant pathologies was seen in the 3 cluster models of both readers (p=0.03, 0.022 respectively) and the 2 cluster model of reader 2 (p=0.04) with the other metrics (ADCH, ADCI, whole lesion mean ADC) not revealing any significant differences. Receiver operating characteristics curves demonstrated the quantitative difference in mean ADCH and ADCL in both the 2 and 3 cluster models to be predictive of malignancy (2 clusters: p=0.008, area under curve=0.850, 3 clusters: p=0.01, area under curve=0.825). Conclusion The K-Means clustering algorithm that generates partitions of large datasets may provide a better characterization of neck pathologies and may be of additional benefit in distinguishing benign and malignant neck pathologies compared to whole lesion mean ADC alone. PMID:20007723

  20. A spatial cluster analysis of tractor overturns in Kentucky from 1960 to 2002

    USGS Publications Warehouse

    Saman, D.M.; Cole, H.P.; Odoi, A.; Myers, M.L.; Carey, D.I.; Westneat, S.C.

    2012-01-01

    Background: Agricultural tractor overturns without rollover protective structures are the leading cause of farm fatalities in the United States. To our knowledge, no studies have incorporated the spatial scan statistic in identifying high-risk areas for tractor overturns. The aim of this study was to determine whether tractor overturns cluster in certain parts of Kentucky and identify factors associated with tractor overturns. Methods: A spatial statistical analysis using Kulldorff's spatial scan statistic was performed to identify county clusters at greatest risk for tractor overturns. A regression analysis was then performed to identify factors associated with tractor overturns. Results: The spatial analysis revealed a cluster of higher than expected tractor overturns in four counties in northern Kentucky (RR = 2.55) and 10 counties in eastern Kentucky (RR = 1.97). Higher rates of tractor overturns were associated with steeper average percent slope of pasture land by county (p = 0.0002) and a greater percent of total tractors with less than 40 horsepower by county (p<0.0001). Conclusions: This study reveals that geographic hotspots of tractor overturns exist in Kentucky and identifies factors associated with overturns. This study provides policymakers a guide to targeted county-level interventions (e.g., roll-over protective structures promotion interventions) with the intention of reducing tractor overturns in the highest risk counties in Kentucky. ?? 2012 Saman et al.

  1. Is It Feasible to Identify Natural Clusters of TSC-Associated Neuropsychiatric Disorders (TAND)?

    PubMed

    Leclezio, Loren; Gardner-Lubbe, Sugnet; de Vries, Petrus J

    2018-04-01

    Tuberous sclerosis complex (TSC) is a genetic disorder with multisystem involvement. The lifetime prevalence of TSC-Associated Neuropsychiatric Disorders (TAND) is in the region of 90% in an apparently unique, individual pattern. This "uniqueness" poses significant challenges for diagnosis, psycho-education, and intervention planning. To date, no studies have explored whether there may be natural clusters of TAND. The purpose of this feasibility study was (1) to investigate the practicability of identifying natural TAND clusters, and (2) to identify appropriate multivariate data analysis techniques for larger-scale studies. TAND Checklist data were collected from 56 individuals with a clinical diagnosis of TSC (n = 20 from South Africa; n = 36 from Australia). Using R, the open-source statistical platform, mean squared contingency coefficients were calculated to produce a correlation matrix, and various cluster analyses and exploratory factor analysis were examined. Ward's method rendered six TAND clusters with good face validity and significant convergence with a six-factor exploratory factor analysis solution. The "bottom-up" data-driven strategies identified a "scholastic" cluster of TAND manifestations, an "autism spectrum disorder-like" cluster, a "dysregulated behavior" cluster, a "neuropsychological" cluster, a "hyperactive/impulsive" cluster, and a "mixed/mood" cluster. These feasibility results suggest that a combination of cluster analysis and exploratory factor analysis methods may be able to identify clinically meaningful natural TAND clusters. Findings require replication and expansion in larger dataset, and could include quantification of cluster or factor scores at an individual level. Copyright © 2018 Elsevier Inc. All rights reserved.

  2. Appplication of statistical mechanical methods to the modeling of social networks

    NASA Astrophysics Data System (ADS)

    Strathman, Anthony Robert

    With the recent availability of large-scale social data sets, social networks have become open to quantitative analysis via the methods of statistical physics. We examine the statistical properties of a real large-scale social network, generated from cellular phone call-trace logs. We find this network, like many other social networks to be assortative (r = 0.31) and clustered (i.e., strongly transitive, C = 0.21). We measure fluctuation scaling to identify the presence of internal structure in the network and find that structural inhomogeneity effectively disappears at the scale of a few hundred nodes, though there is no sharp cutoff. We introduce an agent-based model of social behavior, designed to model the formation and dissolution of social ties. The model is a modified Metropolis algorithm containing agents operating under the basic sociological constraints of reciprocity, communication need and transitivity. The model introduces the concept of a social temperature. We go on to show that this simple model reproduces the global statistical network features (incl. assortativity, connected fraction, mean degree, clustering, and mean shortest path length) of the real network data and undergoes two phase transitions, one being from a "gas" to a "liquid" state and the second from a liquid to a glassy state as function of this social temperature.

  3. Spatial cluster analysis of nanoscopically mapped serotonin receptors for classification of fixed brain tissue

    NASA Astrophysics Data System (ADS)

    Sams, Michael; Silye, Rene; Göhring, Janett; Muresan, Leila; Schilcher, Kurt; Jacak, Jaroslaw

    2014-01-01

    We present a cluster spatial analysis method using nanoscopic dSTORM images to determine changes in protein cluster distributions within brain tissue. Such methods are suitable to investigate human brain tissue and will help to achieve a deeper understanding of brain disease along with aiding drug development. Human brain tissue samples are usually treated postmortem via standard fixation protocols, which are established in clinical laboratories. Therefore, our localization microscopy-based method was adapted to characterize protein density and protein cluster localization in samples fixed using different protocols followed by common fluorescent immunohistochemistry techniques. The localization microscopy allows nanoscopic mapping of serotonin 5-HT1A receptor groups within a two-dimensional image of a brain tissue slice. These nanoscopically mapped proteins can be confined to clusters by applying the proposed statistical spatial analysis. Selected features of such clusters were subsequently used to characterize and classify the tissue. Samples were obtained from different types of patients, fixed with different preparation methods, and finally stored in a human tissue bank. To verify the proposed method, samples of a cryopreserved healthy brain have been compared with epitope-retrieved and paraffin-fixed tissues. Furthermore, samples of healthy brain tissues were compared with data obtained from patients suffering from mental illnesses (e.g., major depressive disorder). Our work demonstrates the applicability of localization microscopy and image analysis methods for comparison and classification of human brain tissues at a nanoscopic level. Furthermore, the presented workflow marks a unique technological advance in the characterization of protein distributions in brain tissue sections.

  4. Geographic clusters in underimmunization and vaccine refusal.

    PubMed

    Lieu, Tracy A; Ray, G Thomas; Klein, Nicola P; Chung, Cindy; Kulldorff, Martin

    2015-02-01

    Parental refusal and delay of childhood vaccines has increased in recent years and is believed to cluster in some communities. Such clusters could pose public health risks and barriers to achieving immunization quality benchmarks. Our aims were to (1) describe geographic clusters of underimmunization and vaccine refusal, (2) compare clusters of underimmunization with different vaccines, and (3) evaluate whether vaccine refusal clusters may pose barriers to achieving high immunization rates. We analyzed electronic health records among children born between 2000 and 2011 with membership in Kaiser Permanente Northern California. The study population included 154,424 children in 13 counties with continuous membership from birth to 36 months of age. We used spatial scan statistics to identify clusters of underimmunization (having missed 1 or more vaccines by 36 months of age) and vaccine refusal (based on International Classification of Diseases, Ninth Revision, Clinical Modification codes). We identified 5 statistically significant clusters of underimmunization among children who turned 36 months old during 2010-2012. The underimmunization rate within clusters ranged from 18% to 23%, and the rate outside them was 11%. Children in the most statistically significant cluster had 1.58 (P < .001) times the rate of underimmunization as others. Underimmunization with measles, mumps, rubella vaccine and varicella vaccines clustered in similar geographic areas. Vaccine refusal also clustered, with rates of 5.5% to 13.5% within clusters, compared with 2.6% outside them. Underimmunization and vaccine refusal cluster geographically. Spatial scan statistics may be a useful tool to identify locations with challenges to achieving high immunization rates, which deserve focused intervention. Copyright © 2015 by the American Academy of Pediatrics.

  5. Regional and Temporal Variation in Methamphetamine-Related Incidents: Applications of Spatial and Temporal Scan Statistics

    PubMed Central

    Sudakin, Daniel L.

    2009-01-01

    Introduction This investigation utilized spatial scan statistics, geographic information systems and multiple data sources to assess spatial clustering of statewide methamphetamine-related incidents. Temporal and spatial associations with regulatory interventions to reduce access to precursor chemicals (pseudoephedrine) were also explored. Methods Four statewide data sources were utilized including regional poison control center statistics, fatality incidents, methamphetamine laboratory seizures, and hazardous substance releases involving methamphetamine laboratories. Spatial clustering of methamphetamine incidents was assessed using SaTScan™. SaTScan™ was also utilized to assess space-time clustering of methamphetamine laboratory incidents, in relation to the enactment of regulations to reduce access to pseudoephedrine. Results Five counties with a significantly higher relative risk of methamphetamine-related incidents were identified. The county identified as the most likely cluster had a significantly elevated relative risk of methamphetamine laboratories (RR=11.5), hazardous substance releases (RR=8.3), and fatalities relating to methamphetamine (RR=1.4). A significant increase in relative risk of methamphetamine laboratory incidents was apparent in this same geographic area (RR=20.7) during the time period when regulations were enacted in 2004 and 2005, restricting access to pseudoephedrine. Subsequent to the enactment of these regulations, a significantly lower rate of incidents (RR 0.111, p=0.0001) was observed over a large geographic area of the state, including regions that previously had significantly higher rates. Conclusions Spatial and temporal scan statistics can be effectively applied to multiple data sources to assess regional variation in methamphetamine-related incidents, and explore the impact of preventive regulatory interventions. PMID:19225949

  6. Bayesian statistics as a new tool for spectral analysis - I. Application for the determination of basic parameters of massive stars

    NASA Astrophysics Data System (ADS)

    Mugnes, J.-M.; Robert, C.

    2015-11-01

    Spectral analysis is a powerful tool to investigate stellar properties and it has been widely used for decades now. However, the methods considered to perform this kind of analysis are mostly based on iteration among a few diagnostic lines to determine the stellar parameters. While these methods are often simple and fast, they can lead to errors and large uncertainties due to the required assumptions. Here, we present a method based on Bayesian statistics to find simultaneously the best combination of effective temperature, surface gravity, projected rotational velocity, and microturbulence velocity, using all the available spectral lines. Different tests are discussed to demonstrate the strength of our method, which we apply to 54 mid-resolution spectra of field and cluster B stars obtained at the Observatoire du Mont-Mégantic. We compare our results with those found in the literature. Differences are seen which are well explained by the different methods used. We conclude that the B-star microturbulence velocities are often underestimated. We also confirm the trend that B stars in clusters are on average faster rotators than field B stars.

  7. Ages of LMC star clusters using ASAD2

    NASA Astrophysics Data System (ADS)

    Asa'd, Randa S.; Vazdekis, Alexandre; Zeinelabdin, Sami

    2016-04-01

    We use ASAD2, the new version of ASAD (Analyzer of Spectra for Age Determination), to obtain the age and reddening of 27 Large Magellanic Cloud (LMC) clusters from full fitting of integrated spectra using different statistical methods [χ2 and Kolmogorov-Smirnov (KS) test] and a set of stellar population models including GALAXEV and MILES. We show that our results are in good agreement with the colour-magnitude diagram (CMD) ages for both models, and that metallicity does not affect the age determination for the full spectrum fitting method regardless of the model used for ages with log (age/year) < 9. We discuss the results obtained by the two statistical results for both GALAXEV and MILES versus three factors: age, signal-to-noise ratio and resolution (full width at half maximum). The predicted reddening values when using the χ2 minimization method are within the range found in the literature for resolved clusters (I.e. <0.35); however the KS test can predict E(B - V) higher values. The sharp spectrum transition originated at ages around the supergiants contribution, at either side of the AGB peak around log (age/year) 9.0 and log (age/year) 7.8 are limiting our ability to provide values in agreement with the CMD estimates and as a result the reddening determination is not accurate. We provide the detailed results of four clusters spanning a wide range of ages. ASAD2 is a user-friendly program available for download on the Web and can be immediately used at http://randaasad.wordpress.com/asad-package/.

  8. Clustering of fast-food restaurants around schools: a novel application of spatial statistics to the study of food environments.

    PubMed

    Austin, S Bryn; Melly, Steven J; Sanchez, Brisa N; Patel, Aarti; Buka, Stephen; Gortmaker, Steven L

    2005-09-01

    We examined the concentration of fast food restaurants in areas proximal to schools to characterize school neighborhood food environments. We used geocoded databases of restaurant and school addresses to examine locational patterns of fast-food restaurants and kindergartens and primary and secondary schools in Chicago. We used the bivariate K function statistical method to quantify the degree of clustering (spatial dependence) of fast-food restaurants around school locations. The median distance from any school in Chicago to the nearest fast-food restaurant was 0.52 km, a distance that an adult can walk in little more than 5 minutes, and 78% of schools had at least 1 fast-food restaurant within 800 m. Fast-food restaurants were statistically significantly clustered in areas within a short walking distance from schools, with an estimated 3 to 4 times as many fast-food restaurants within 1.5 km from schools than would be expected if the restaurants were distributed throughout the city in a way unrelated to school locations. Fast-food restaurants are concentrated within a short walking distance from schools, exposing children to poor-quality food environments in their school neighborhoods.

  9. Quasi-Likelihood Techniques in a Logistic Regression Equation for Identifying Simulium damnosum s.l. Larval Habitats Intra-cluster Covariates in Togo.

    PubMed

    Jacob, Benjamin G; Novak, Robert J; Toe, Laurent; Sanfo, Moussa S; Afriyie, Abena N; Ibrahim, Mohammed A; Griffith, Daniel A; Unnasch, Thomas R

    2012-01-01

    The standard methods for regression analyses of clustered riverine larval habitat data of Simulium damnosum s.l. a major black-fly vector of Onchoceriasis, postulate models relating observational ecological-sampled parameter estimators to prolific habitats without accounting for residual intra-cluster error correlation effects. Generally, this correlation comes from two sources: (1) the design of the random effects and their assumed covariance from the multiple levels within the regression model; and, (2) the correlation structure of the residuals. Unfortunately, inconspicuous errors in residual intra-cluster correlation estimates can overstate precision in forecasted S.damnosum s.l. riverine larval habitat explanatory attributes regardless how they are treated (e.g., independent, autoregressive, Toeplitz, etc). In this research, the geographical locations for multiple riverine-based S. damnosum s.l. larval ecosystem habitats sampled from 2 pre-established epidemiological sites in Togo were identified and recorded from July 2009 to June 2010. Initially the data was aggregated into proc genmod. An agglomerative hierarchical residual cluster-based analysis was then performed. The sampled clustered study site data was then analyzed for statistical correlations using Monthly Biting Rates (MBR). Euclidean distance measurements and terrain-related geomorphological statistics were then generated in ArcGIS. A digital overlay was then performed also in ArcGIS using the georeferenced ground coordinates of high and low density clusters stratified by Annual Biting Rates (ABR). This data was overlain onto multitemporal sub-meter pixel resolution satellite data (i.e., QuickBird 0.61m wavbands ). Orthogonal spatial filter eigenvectors were then generated in SAS/GIS. Univariate and non-linear regression-based models (i.e., Logistic, Poisson and Negative Binomial) were also employed to determine probability distributions and to identify statistically significant parameter estimators from the sampled data. Thereafter, Durbin-Watson test statistics were used to test the null hypothesis that the regression residuals were not autocorrelated against the alternative that the residuals followed an autoregressive process in AUTOREG. Bayesian uncertainty matrices were also constructed employing normal priors for each of the sampled estimators in PROC MCMC. The residuals revealed both spatially structured and unstructured error effects in the high and low ABR-stratified clusters. The analyses also revealed that the estimators, levels of turbidity and presence of rocks were statistically significant for the high-ABR-stratified clusters, while the estimators distance between habitats and floating vegetation were important for the low-ABR-stratified cluster. Varying and constant coefficient regression models, ABR- stratified GIS-generated clusters, sub-meter resolution satellite imagery, a robust residual intra-cluster diagnostic test, MBR-based histograms, eigendecomposition spatial filter algorithms and Bayesian matrices can enable accurate autoregressive estimation of latent uncertainity affects and other residual error probabilities (i.e., heteroskedasticity) for testing correlations between georeferenced S. damnosum s.l. riverine larval habitat estimators. The asymptotic distribution of the resulting residual adjusted intra-cluster predictor error autocovariate coefficients can thereafter be established while estimates of the asymptotic variance can lead to the construction of approximate confidence intervals for accurately targeting productive S. damnosum s.l habitats based on spatiotemporal field-sampled count data.

  10. Analysis of ligand-protein exchange by Clustering of Ligand Diffusion Coefficient Pairs (CoLD-CoP)

    NASA Astrophysics Data System (ADS)

    Snyder, David A.; Chantova, Mihaela; Chaudhry, Saadia

    2015-06-01

    NMR spectroscopy is a powerful tool in describing protein structures and protein activity for pharmaceutical and biochemical development. This study describes a method to determine weak binding ligands in biological systems by using hierarchic diffusion coefficient clustering of multidimensional data obtained with a 400 MHz Bruker NMR. Comparison of DOSY spectrums of ligands of the chemical library in the presence and absence of target proteins show translational diffusion rates for small molecules upon interaction with macromolecules. For weak binders such as compounds found in fragment libraries, changes in diffusion rates upon macromolecular binding are on the order of the precision of DOSY diffusion measurements, and identifying such subtle shifts in diffusion requires careful statistical analysis. The "CoLD-CoP" (Clustering of Ligand Diffusion Coefficient Pairs) method presented here uses SAHN clustering to identify protein-binders in a chemical library or even a not fully characterized metabolite mixture. We will show how DOSY NMR and the "CoLD-CoP" method complement each other in identifying the most suitable candidates for lysozyme and wheat germ acid phosphatase.

  11. DICON: interactive visual analysis of multidimensional clusters.

    PubMed

    Cao, Nan; Gotz, David; Sun, Jimeng; Qu, Huamin

    2011-12-01

    Clustering as a fundamental data analysis technique has been widely used in many analytic applications. However, it is often difficult for users to understand and evaluate multidimensional clustering results, especially the quality of clusters and their semantics. For large and complex data, high-level statistical information about the clusters is often needed for users to evaluate cluster quality while a detailed display of multidimensional attributes of the data is necessary to understand the meaning of clusters. In this paper, we introduce DICON, an icon-based cluster visualization that embeds statistical information into a multi-attribute display to facilitate cluster interpretation, evaluation, and comparison. We design a treemap-like icon to represent a multidimensional cluster, and the quality of the cluster can be conveniently evaluated with the embedded statistical information. We further develop a novel layout algorithm which can generate similar icons for similar clusters, making comparisons of clusters easier. User interaction and clutter reduction are integrated into the system to help users more effectively analyze and refine clustering results for large datasets. We demonstrate the power of DICON through a user study and a case study in the healthcare domain. Our evaluation shows the benefits of the technique, especially in support of complex multidimensional cluster analysis. © 2011 IEEE

  12. LoCuSS: THE MASS DENSITY PROFILE OF MASSIVE GALAXY CLUSTERS AT z = 0.2 {sup ,}

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Okabe, Nobuhiro; Umetsu, Keiichi; Smith, Graham P.

    We present a stacked weak-lensing analysis of an approximately mass-selected sample of 50 galaxy clusters at 0.15 < z < 0.3, based on observations with Suprime-Cam on the Subaru Telescope. We develop a new method for selecting lensed background galaxies from which we estimate that our sample of red background galaxies suffers just 1% contamination. We detect the stacked tangential shear signal from the full sample of 50 clusters, based on this red sample of background galaxies, at a total signal-to-noise ratio of 32.7. The Navarro-Frenk-White model is an excellent fit to the data, yielding sub-10% statistical precision on massmore » and concentration: M{sub vir}=7.19{sup +0.53}{sub -0.50} Multiplication-Sign 10{sup 14} h{sup -1} M{sub sun}, c{sub vir}=5.41{sup +0.49}{sub -0.45} (c{sub 200}=4.22{sup +0.40}{sub -0.36}). Tests of a range of possible systematic errors, including shear calibration and stacking-related issues, indicate that they are subdominant to the statistical errors. The concentration parameter obtained from stacking our approximately mass-selected cluster sample is broadly in line with theoretical predictions. Moreover, the uncertainty on our measurement is comparable with the differences between the different predictions in the literature. Overall, our results highlight the potential for stacked weak-lensing methods to probe the mean mass density profile of cluster-scale dark matter halos with upcoming surveys, including Hyper-Suprime-Cam, Dark Energy Survey, and KIDS.« less

  13. Multimorbidity and health-related quality of life (HRQoL) in a nationally representative population sample: implications of count versus cluster method for defining multimorbidity on HRQoL.

    PubMed

    Wang, Lili; Palmer, Andrew J; Cocker, Fiona; Sanderson, Kristy

    2017-01-09

    No universally accepted definition of multimorbidity (MM) exists, and implications of different definitions have not been explored. This study examined the performance of the count and cluster definitions of multimorbidity on the sociodemographic profile and health-related quality of life (HRQoL) in a general population. Data were derived from the nationally representative 2007 Australian National Survey of Mental Health and Wellbeing (n = 8841). The HRQoL scores were measured using the Assessment of Quality of Life (AQoL-4D) instrument. The simple count (2+ & 3+ conditions) and hierarchical cluster methods were used to define/identify clusters of multimorbidity. Linear regression was used to assess the associations between HRQoL and multimorbidity as defined by the different methods. The assessment of multimorbidity, which was defined using the count method, resulting in the prevalence of 26% (MM2+) and 10.1% (MM3+). Statistically significant clusters identified through hierarchical cluster analysis included heart or circulatory conditions (CVD)/arthritis (cluster-1, 9%) and major depressive disorder (MDD)/anxiety (cluster-2, 4%). A sensitivity analysis suggested that the stability of the clusters resulted from hierarchical clustering. The sociodemographic profiles were similar between MM2+, MM3+ and cluster-1, but were different from cluster-2. HRQoL was negatively associated with MM2+ (β: -0.18, SE: -0.01, p < 0.001), MM3+ (β: -0.23, SE: -0.02, p < 0.001), cluster-1 (β: -0.10, SE: 0.01, p < 0.001) and cluster-2 (β: -0.36, SE: 0.01, p < 0.001). Our findings confirm the existence of an inverse relationship between multimorbidity and HRQoL in the Australian population and indicate that the hierarchical clustering approach is validated when the outcome of interest is HRQoL from this head-to-head comparison. Moreover, a simple count fails to identify if there are specific conditions of interest that are driving poorer HRQoL. Researchers should exercise caution when selecting a definition of multimorbidity because it may significantly influence the study outcomes.

  14. A method for determining the radius of an open cluster from stellar proper motions

    NASA Astrophysics Data System (ADS)

    Sánchez, Néstor; Alfaro, Emilio J.; López-Martínez, Fátima

    2018-04-01

    We propose a method for calculating the radius of an open cluster in an objective way from an astrometric catalogue containing, at least, positions and proper motions. It uses the minimum spanning tree in the proper motion space to discriminate cluster stars from field stars and it quantifies the strength of the cluster-field separation by means of a statistical parameter defined for the first time in this paper. This is done for a range of different sampling radii from where the cluster radius is obtained as the size at which the best cluster-field separation is achieved. The novelty of this strategy is that the cluster radius is obtained independently of how its stars are spatially distributed. We test the reliability and robustness of the method with both simulated and real data from a well-studied open cluster (NGC 188), and apply it to UCAC4 data for five other open clusters with different catalogued radius values. NGC 188, NGC 1647, NGC 6603, and Ruprecht 155 yielded unambiguous radius values of 15.2 ± 1.8, 29.4 ± 3.4, 4.2 ± 1.7, and 7.0 ± 0.3 arcmin, respectively. ASCC 19 and Collinder 471 showed more than one possible solution, but it is not possible to know whether this is due to the involved uncertainties or due to the presence of complex patterns in their proper motion distributions, something that could be inherent to the physical object or due to the way in which the catalogue was sampled.

  15. Clustering Genes of Common Evolutionary History

    PubMed Central

    Gori, Kevin; Suchan, Tomasz; Alvarez, Nadir; Goldman, Nick; Dessimoz, Christophe

    2016-01-01

    Phylogenetic inference can potentially result in a more accurate tree using data from multiple loci. However, if the loci are incongruent—due to events such as incomplete lineage sorting or horizontal gene transfer—it can be misleading to infer a single tree. To address this, many previous contributions have taken a mechanistic approach, by modeling specific processes. Alternatively, one can cluster loci without assuming how these incongruencies might arise. Such “process-agnostic” approaches typically infer a tree for each locus and cluster these. There are, however, many possible combinations of tree distance and clustering methods; their comparative performance in the context of tree incongruence is largely unknown. Furthermore, because standard model selection criteria such as AIC cannot be applied to problems with a variable number of topologies, the issue of inferring the optimal number of clusters is poorly understood. Here, we perform a large-scale simulation study of phylogenetic distances and clustering methods to infer loci of common evolutionary history. We observe that the best-performing combinations are distances accounting for branch lengths followed by spectral clustering or Ward’s method. We also introduce two statistical tests to infer the optimal number of clusters and show that they strongly outperform the silhouette criterion, a general-purpose heuristic. We illustrate the usefulness of the approach by 1) identifying errors in a previous phylogenetic analysis of yeast species and 2) identifying topological incongruence among newly sequenced loci of the globeflower fly genus Chiastocheta. We release treeCl, a new program to cluster genes of common evolutionary history (http://git.io/treeCl). PMID:26893301

  16. Multi-particle correlations in transverse momenta from statistical clusters

    NASA Astrophysics Data System (ADS)

    Bialas, Andrzej; Bzdak, Adam

    2016-09-01

    We evaluate n-particle (n = 2 , 3 , 4 , 5) transverse momentum correlations for pions and kaons following from the decay of statistical clusters. These correlation functions could provide strong constraints on a possible existence of thermal clusters in the process of particle production.

  17. Retrospective space-time cluster analysis of whooping cough, re-emergence in Barcelona, Spain, 2000-2011.

    PubMed

    Solano, Rubén; Gómez-Barroso, Diana; Simón, Fernando; Lafuente, Sarah; Simón, Pere; Rius, Cristina; Gorrindo, Pilar; Toledo, Diana; Caylà, Joan A

    2014-05-01

    A retrospective, space-time study of whooping cough cases reported to the Public Health Agency of Barcelona, Spain between the years 2000 and 2011 is presented. It is based on 633 individual whooping cough cases and the 2006 population census from the Spanish National Statistics Institute, stratified by age and sex at the census tract level. Cluster identification was attempted using space-time scan statistic assuming a Poisson distribution and restricting temporal extent to 7 days and spatial distance to 500 m. Statistical calculations were performed with Stata 11 and SatScan and mapping was performed with ArcGis 10.0. Only clusters showing statistical significance (P <0.05) were mapped. The most likely cluster identified included five census tracts located in three neighbourhoods in central Barcelona during the week from 17 to 23 August 2011. This cluster included five cases compared with the expected level of 0.0021 (relative risk = 2436, P <0.001). In addition, 11 secondary significant space-time clusters were detected with secondary clusters occurring at different times and localizations. Spatial statistics is felt to be useful by complementing epidemiological surveillance systems through visualizing excess in the number of cases in space and time and thus increase the possibility of identifying outbreaks not reported by the surveillance system.

  18. Identifying irregularly shaped crime hot-spots using a multiobjective evolutionary algorithm

    NASA Astrophysics Data System (ADS)

    Wu, Xiaolan; Grubesic, Tony H.

    2010-12-01

    Spatial cluster detection techniques are widely used in criminology, geography, epidemiology, and other fields. In particular, spatial scan statistics are popular and efficient techniques for detecting areas of elevated crime or disease events. The majority of spatial scan approaches attempt to delineate geographic zones by evaluating the significance of clusters using likelihood ratio statistics tested with the Poisson distribution. While this can be effective, many scan statistics give preference to circular clusters, diminishing their ability to identify elongated and/or irregular shaped clusters. Although adjusting the shape of the scan window can mitigate some of these problems, both the significance of irregular clusters and their spatial structure must be accounted for in a meaningful way. This paper utilizes a multiobjective evolutionary algorithm to find clusters with maximum significance while quantitatively tracking their geographic structure. Crime data for the city of Cincinnati are utilized to demonstrate the advantages of the new approach and highlight its benefits versus more traditional scan statistics.

  19. Application of the Linux cluster for exhaustive window haplotype analysis using the FBAT and Unphased programs.

    PubMed

    Mishima, Hiroyuki; Lidral, Andrew C; Ni, Jun

    2008-05-28

    Genetic association studies have been used to map disease-causing genes. A newly introduced statistical method, called exhaustive haplotype association study, analyzes genetic information consisting of different numbers and combinations of DNA sequence variations along a chromosome. Such studies involve a large number of statistical calculations and subsequently high computing power. It is possible to develop parallel algorithms and codes to perform the calculations on a high performance computing (HPC) system. However, most existing commonly-used statistic packages for genetic studies are non-parallel versions. Alternatively, one may use the cutting-edge technology of grid computing and its packages to conduct non-parallel genetic statistical packages on a centralized HPC system or distributed computing systems. In this paper, we report the utilization of a queuing scheduler built on the Grid Engine and run on a Rocks Linux cluster for our genetic statistical studies. Analysis of both consecutive and combinational window haplotypes was conducted by the FBAT (Laird et al., 2000) and Unphased (Dudbridge, 2003) programs. The dataset consisted of 26 loci from 277 extended families (1484 persons). Using the Rocks Linux cluster with 22 compute-nodes, FBAT jobs performed about 14.4-15.9 times faster, while Unphased jobs performed 1.1-18.6 times faster compared to the accumulated computation duration. Execution of exhaustive haplotype analysis using non-parallel software packages on a Linux-based system is an effective and efficient approach in terms of cost and performance.

  20. Application of the Linux cluster for exhaustive window haplotype analysis using the FBAT and Unphased programs

    PubMed Central

    Mishima, Hiroyuki; Lidral, Andrew C; Ni, Jun

    2008-01-01

    Background Genetic association studies have been used to map disease-causing genes. A newly introduced statistical method, called exhaustive haplotype association study, analyzes genetic information consisting of different numbers and combinations of DNA sequence variations along a chromosome. Such studies involve a large number of statistical calculations and subsequently high computing power. It is possible to develop parallel algorithms and codes to perform the calculations on a high performance computing (HPC) system. However, most existing commonly-used statistic packages for genetic studies are non-parallel versions. Alternatively, one may use the cutting-edge technology of grid computing and its packages to conduct non-parallel genetic statistical packages on a centralized HPC system or distributed computing systems. In this paper, we report the utilization of a queuing scheduler built on the Grid Engine and run on a Rocks Linux cluster for our genetic statistical studies. Results Analysis of both consecutive and combinational window haplotypes was conducted by the FBAT (Laird et al., 2000) and Unphased (Dudbridge, 2003) programs. The dataset consisted of 26 loci from 277 extended families (1484 persons). Using the Rocks Linux cluster with 22 compute-nodes, FBAT jobs performed about 14.4–15.9 times faster, while Unphased jobs performed 1.1–18.6 times faster compared to the accumulated computation duration. Conclusion Execution of exhaustive haplotype analysis using non-parallel software packages on a Linux-based system is an effective and efficient approach in terms of cost and performance. PMID:18541045

  1. Statistical uncertainty of extreme wind storms over Europe derived from a probabilistic clustering technique

    NASA Astrophysics Data System (ADS)

    Walz, Michael; Leckebusch, Gregor C.

    2016-04-01

    Extratropical wind storms pose one of the most dangerous and loss intensive natural hazards for Europe. However, due to only 50 years of high quality observational data, it is difficult to assess the statistical uncertainty of these sparse events just based on observations. Over the last decade seasonal ensemble forecasts have become indispensable in quantifying the uncertainty of weather prediction on seasonal timescales. In this study seasonal forecasts are used in a climatological context: By making use of the up to 51 ensemble members, a broad and physically consistent statistical base can be created. This base can then be used to assess the statistical uncertainty of extreme wind storm occurrence more accurately. In order to determine the statistical uncertainty of storms with different paths of progression, a probabilistic clustering approach using regression mixture models is used to objectively assign storm tracks (either based on core pressure or on extreme wind speeds) to different clusters. The advantage of this technique is that the entire lifetime of a storm is considered for the clustering algorithm. Quadratic curves are found to describe the storm tracks most accurately. Three main clusters (diagonal, horizontal or vertical progression of the storm track) can be identified, each of which have their own particulate features. Basic storm features like average velocity and duration are calculated and compared for each cluster. The main benefit of this clustering technique, however, is to evaluate if the clusters show different degrees of uncertainty, e.g. more (less) spread for tracks approaching Europe horizontally (diagonally). This statistical uncertainty is compared for different seasonal forecast products.

  2. Towards Accurate Modelling of Galaxy Clustering on Small Scales: Testing the Standard ΛCDM + Halo Model

    NASA Astrophysics Data System (ADS)

    Sinha, Manodeep; Berlind, Andreas A.; McBride, Cameron K.; Scoccimarro, Roman; Piscionere, Jennifer A.; Wibking, Benjamin D.

    2018-04-01

    Interpreting the small-scale clustering of galaxies with halo models can elucidate the connection between galaxies and dark matter halos. Unfortunately, the modelling is typically not sufficiently accurate for ruling out models statistically. It is thus difficult to use the information encoded in small scales to test cosmological models or probe subtle features of the galaxy-halo connection. In this paper, we attempt to push halo modelling into the "accurate" regime with a fully numerical mock-based methodology and careful treatment of statistical and systematic errors. With our forward-modelling approach, we can incorporate clustering statistics beyond the traditional two-point statistics. We use this modelling methodology to test the standard ΛCDM + halo model against the clustering of SDSS DR7 galaxies. Specifically, we use the projected correlation function, group multiplicity function and galaxy number density as constraints. We find that while the model fits each statistic separately, it struggles to fit them simultaneously. Adding group statistics leads to a more stringent test of the model and significantly tighter constraints on model parameters. We explore the impact of varying the adopted halo definition and cosmological model and find that changing the cosmology makes a significant difference. The most successful model we tried (Planck cosmology with Mvir halos) matches the clustering of low luminosity galaxies, but exhibits a 2.3σ tension with the clustering of luminous galaxies, thus providing evidence that the "standard" halo model needs to be extended. This work opens the door to adding interesting freedom to the halo model and including additional clustering statistics as constraints.

  3. A geographic analysis of individual and environmental risk factors for hypospadias births

    PubMed Central

    Winston, Jennifer J; Meyer, Robert E; Emch, Michael E

    2014-01-01

    Background Hypospadias is a relatively common birth defect affecting the male urinary tract. We explored the etiology of hypospadias by examining its spatial distribution in North Carolina and the spatial clustering of residuals from individual and environmental risk factors. Methods We used data collected by the North Carolina Birth Defects Monitoring Program from 2003-2005 to estimate local Moran's I statistics to identify geographic clustering of overall and severe hypospadias, using 995 overall cases and 16,013 controls. We conducted logistic regression and local Moran's I statistics on standardized residuals to consider the contribution of individual variables (maternal age, maternal race/ethnicity, maternal education, smoking, parity, and diabetes) and environmental variables (block group land cover) to this clustering. Results Local Moran's I statistics indicated significant clustering of overall and severe hypospadias in eastern central North Carolina. Spatial clustering of hypospadias persisted when controlling for individual factors, but diminished somewhat when controlling for environmental factors. In adjusted models, maternal residence in a block group with more than 5% crop cover was associated with overall hypospadias (OR = 1.22; 95% CI = 1.04 – 1.43); that is living in a block group with greater than 5% crop cover was associated with a 22% increase in the odds of having a baby with hypospadias. Land cover was not associated with severe hypospadias. Conclusions This study illustrates the potential contribution of mapping in generating hypotheses about disease etiology. Results suggest that environmental factors including proximity to agriculture may play some role in the spatial distribution of hypospadias. PMID:25196538

  4. Use of Spatial Epidemiology and Hot Spot Analysis to Target Women Eligible for Prenatal Women, Infants, and Children Services

    PubMed Central

    Krawczyk, Christopher; Gradziel, Pat; Geraghty, Estella M.

    2014-01-01

    Objectives. We used a geographic information system and cluster analyses to determine locations in need of enhanced Special Supplemental Nutrition Program for Women, Infants, and Children (WIC) Program services. Methods. We linked documented births in the 2010 California Birth Statistical Master File with the 2010 data from the WIC Integrated Statewide Information System. Analyses focused on the density of pregnant women who were eligible for but not receiving WIC services in California’s 7049 census tracts. We used incremental spatial autocorrelation and hot spot analyses to identify clusters of WIC-eligible nonparticipants. Results. We detected clusters of census tracts with higher-than-expected densities, compared with the state mean density of WIC-eligible nonparticipants, in 21 of 58 (36.2%) California counties (P < .05). In subsequent county-level analyses, we located neighborhood-level clusters of higher-than-expected densities of eligible nonparticipants in Sacramento, San Francisco, Fresno, and Los Angeles Counties (P < .05). Conclusions. Hot spot analyses provided a rigorous and objective approach to determine the locations of statistically significant clusters of WIC-eligible nonparticipants. Results helped inform WIC program and funding decisions, including the opening of new WIC centers, and offered a novel approach for targeting public health services. PMID:24354821

  5. Spatial clustering of metal and metalloid mixtures in unregulated water sources on the Navajo Nation - Arizona, New Mexico, and Utah, USA.

    PubMed

    Hoover, Joseph H; Coker, Eric; Barney, Yolanda; Shuey, Chris; Lewis, Johnnye

    2018-08-15

    Contaminant mixtures are identified regularly in public and private drinking water supplies throughout the United States; however, the complex and often correlated nature of mixtures makes identification of relevant combinations challenging. This study employed a Bayesian clustering method to identify subgroups of water sources with similar metal and metalloid profiles. Additionally, a spatial scan statistic assessed spatial clustering of these subgroups and a human health metric was applied to investigate potential for human toxicity. These methods were applied to a dataset comprised of metal and metalloid measurements from unregulated water sources located on the Navajo Nation, in the southwest United States. Results indicated distinct subgroups of water sources with similar contaminant profiles and that some of these subgroups were spatially clustered. Several profiles had metal and metalloid concentrations that may have potential for human toxicity including arsenic, uranium, lead, manganese, and selenium. This approach may be useful for identifying mixtures in water sources, spatially evaluating the clusters, and help inform toxicological research investigating mixtures. Copyright © 2018 The Authors. Published by Elsevier B.V. All rights reserved.

  6. Using data mining to segment healthcare markets from patients' preference perspectives.

    PubMed

    Liu, Sandra S; Chen, Jie

    2009-01-01

    This paper aims to provide an example of how to use data mining techniques to identify patient segments regarding preferences for healthcare attributes and their demographic characteristics. Data were derived from a number of individuals who received in-patient care at a health network in 2006. Data mining and conventional hierarchical clustering with average linkage and Pearson correlation procedures are employed and compared to show how each procedure best determines segmentation variables. Data mining tools identified three differentiable segments by means of cluster analysis. These three clusters have significantly different demographic profiles. The study reveals, when compared with traditional statistical methods, that data mining provides an efficient and effective tool for market segmentation. When there are numerous cluster variables involved, researchers and practitioners need to incorporate factor analysis for reducing variables to clearly and meaningfully understand clusters. Interests and applications in data mining are increasing in many businesses. However, this technology is seldom applied to healthcare customer experience management. The paper shows that efficient and effective application of data mining methods can aid the understanding of patient healthcare preferences.

  7. Analysis of the effects of the global financial crisis on the Turkish economy, using hierarchical methods

    NASA Astrophysics Data System (ADS)

    Kantar, Ersin; Keskin, Mustafa; Deviren, Bayram

    2012-04-01

    We have analyzed the topology of 50 important Turkish companies for the period 2006-2010 using the concept of hierarchical methods (the minimal spanning tree (MST) and hierarchical tree (HT)). We investigated the statistical reliability of links between companies in the MST by using the bootstrap technique. We also used the average linkage cluster analysis (ALCA) technique to observe the cluster structures much better. The MST and HT are known as useful tools to perceive and detect global structure, taxonomy, and hierarchy in financial data. We obtained four clusters of companies according to their proximity. We also observed that the Banks and Holdings cluster always forms in the centre of the MSTs for the periods 2006-2007, 2008, and 2009-2010. The clusters match nicely with their common production activities or their strong interrelationship. The effects of the Automobile sector increased after the global financial crisis due to the temporary incentives provided by the Turkish government. We find that Turkish companies were not very affected by the global financial crisis.

  8. Measuring Health Information Dissemination and Identifying Target Interest Communities on Twitter: Methods Development and Case Study of the @SafetyMD Network.

    PubMed

    Kandadai, Venk; Yang, Haodong; Jiang, Ling; Yang, Christopher C; Fleisher, Linda; Winston, Flaura Koplin

    2016-05-05

    Little is known about the ability of individual stakeholder groups to achieve health information dissemination goals through Twitter. This study aimed to develop and apply methods for the systematic evaluation and optimization of health information dissemination by stakeholders through Twitter. Tweet content from 1790 followers of @SafetyMD (July-November 2012) was examined. User emphasis, a new indicator of Twitter information dissemination, was defined and applied to retweets across two levels of retweeters originating from @SafetyMD. User interest clusters were identified based on principal component analysis (PCA) and hierarchical cluster analysis (HCA) of a random sample of 170 followers. User emphasis of keywords remained across levels but decreased by 9.5 percentage points. PCA and HCA identified 12 statistically unique clusters of followers within the @SafetyMD Twitter network. This study is one of the first to develop methods for use by stakeholders to evaluate and optimize their use of Twitter to disseminate health information. Our new methods provide preliminary evidence that individual stakeholders can evaluate the effectiveness of health information dissemination and create content-specific clusters for more specific targeted messaging.

  9. Exploring the individual patterns of spiritual well-being in people newly diagnosed with advanced cancer: a cluster analysis.

    PubMed

    Bai, Mei; Dixon, Jane; Williams, Anna-Leila; Jeon, Sangchoon; Lazenby, Mark; McCorkle, Ruth

    2016-11-01

    Research shows that spiritual well-being correlates positively with quality of life (QOL) for people with cancer, whereas contradictory findings are frequently reported with respect to the differentiated associations between dimensions of spiritual well-being, namely peace, meaning and faith, and QOL. This study aimed to examine individual patterns of spiritual well-being among patients newly diagnosed with advanced cancer. Cluster analysis was based on the twelve items of the 12-item Functional Assessment of Chronic Illness Therapy-Spiritual Well-Being Scale at Time 1. A combination of hierarchical and k-means (non-hierarchical) clustering methods was employed to jointly determine the number of clusters. Self-rated health, depressive symptoms, peace, meaning and faith, and overall QOL were compared at Time 1 and Time 2. Hierarchical and k-means clustering methods both suggested four clusters. Comparison of the four clusters supported statistically significant and clinically meaningful differences in QOL outcomes among clusters while revealing contrasting relations of faith with QOL. Cluster 1, Cluster 3, and Cluster 4 represented high, medium, and low levels of overall QOL, respectively, with correspondingly high, medium, and low levels of peace, meaning, and faith. Cluster 2 was distinguished from other clusters by its medium levels of overall QOL, peace, and meaning and low level of faith. This study provides empirical support for individual difference in response to a newly diagnosed cancer and brings into focus conceptual and methodological challenges associated with the measure of spiritual well-being, which may partly contribute to the attenuated relation between faith and QOL.

  10. Lagrangian analysis by clustering. An example in the Nordic Seas.

    NASA Astrophysics Data System (ADS)

    Koszalka, Inga; Lacasce, Joseph H.

    2010-05-01

    We propose a new method for obtaining average velocities and eddy diffusivities from Lagrangian data. Rather than grouping the drifter-derived velocities in uniform geographical bins, as is commonly done, we group a specified number of nearest-neighbor velocities. This is done via a clustering algorithm operating on the instantaneous positions of the drifters. Thus it is the data distribution itself which determines the positions of the averages and the areal extent of the clusters. A major advantage is that because the number of members is essentially the same for all clusters, the statistical accuracy is more uniform than with geographical bins. We illustrate the technique using synthetic data from a stochastic model, employing a realistic mean flow. The latter is an accurate representation of the surface currents in the Nordic Seas and is strongly inhomogeneous in space. We use the clustering algorithm to extract the mean velocities and diffusivities (both of which are known from the stochastic model). We also compare the results to those obtained with fixed geographical bins. Clustering is more successful at capturing spatial variability of the mean flow and also improves convergence in the eddy diffusivity estimates. We discuss both the future prospects and shortcomings of the new method.

  11. Model-based Clustering of Categorical Time Series with Multinomial Logit Classification

    NASA Astrophysics Data System (ADS)

    Frühwirth-Schnatter, Sylvia; Pamminger, Christoph; Winter-Ebmer, Rudolf; Weber, Andrea

    2010-09-01

    A common problem in many areas of applied statistics is to identify groups of similar time series in a panel of time series. However, distance-based clustering methods cannot easily be extended to time series data, where an appropriate distance-measure is rather difficult to define, particularly for discrete-valued time series. Markov chain clustering, proposed by Pamminger and Frühwirth-Schnatter [6], is an approach for clustering discrete-valued time series obtained by observing a categorical variable with several states. This model-based clustering method is based on finite mixtures of first-order time-homogeneous Markov chain models. In order to further explain group membership we present an extension to the approach of Pamminger and Frühwirth-Schnatter [6] by formulating a probabilistic model for the latent group indicators within the Bayesian classification rule by using a multinomial logit model. The parameters are estimated for a fixed number of clusters within a Bayesian framework using an Markov chain Monte Carlo (MCMC) sampling scheme representing a (full) Gibbs-type sampler which involves only draws from standard distributions. Finally, an application to a panel of Austrian wage mobility data is presented which leads to an interesting segmentation of the Austrian labour market.

  12. Spatio-Temporal Analysis of Smear-Positive Tuberculosis in the Sidama Zone, Southern Ethiopia

    PubMed Central

    Dangisso, Mesay Hailu; Datiko, Daniel Gemechu; Lindtjørn, Bernt

    2015-01-01

    Background Tuberculosis (TB) is a disease of public health concern, with a varying distribution across settings depending on socio-economic status, HIV burden, availability and performance of the health system. Ethiopia is a country with a high burden of TB, with regional variations in TB case notification rates (CNRs). However, TB program reports are often compiled and reported at higher administrative units that do not show the burden at lower units, so there is limited information about the spatial distribution of the disease. We therefore aim to assess the spatial distribution and presence of the spatio-temporal clustering of the disease in different geographic settings over 10 years in the Sidama Zone in southern Ethiopia. Methods A retrospective space–time and spatial analysis were carried out at the kebele level (the lowest administrative unit within a district) to identify spatial and space-time clusters of smear-positive pulmonary TB (PTB). Scan statistics, Global Moran’s I, and Getis and Ordi (Gi*) statistics were all used to help analyze the spatial distribution and clusters of the disease across settings. Results A total of 22,545 smear-positive PTB cases notified over 10 years were used for spatial analysis. In a purely spatial analysis, we identified the most likely cluster of smear-positive PTB in 192 kebeles in eight districts (RR= 2, p<0.001), with 12,155 observed and 8,668 expected cases. The Gi* statistic also identified the clusters in the same areas, and the spatial clusters showed stability in most areas in each year during the study period. The space-time analysis also detected the most likely cluster in 193 kebeles in the same eight districts (RR= 1.92, p<0.001), with 7,584 observed and 4,738 expected cases in 2003-2012. Conclusion The study found variations in CNRs and significant spatio-temporal clusters of smear-positive PTB in the Sidama Zone. The findings can be used to guide TB control programs to devise effective TB control strategies for the geographic areas characterized by the highest CNRs. Further studies are required to understand the factors associated with clustering based on individual level locations and investigation of cases. PMID:26030162

  13. Robust continuous clustering

    PubMed Central

    Shah, Sohil Atul

    2017-01-01

    Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank. PMID:28851838

  14. Weak lensing calibration of mass bias in the REFLEX+BCS X-ray galaxy cluster catalogue

    NASA Astrophysics Data System (ADS)

    Simet, Melanie; Battaglia, Nicholas; Mandelbaum, Rachel; Seljak, Uroš

    2017-04-01

    The use of large, X-ray-selected Galaxy cluster catalogues for cosmological analyses requires a thorough understanding of the X-ray mass estimates. Weak gravitational lensing is an ideal method to shed light on such issues, due to its insensitivity to the cluster dynamical state. We perform a weak lensing calibration of 166 galaxy clusters from the REFLEX and BCS cluster catalogue and compare our results to the X-ray masses based on scaled luminosities from that catalogue. To interpret the weak lensing signal in terms of cluster masses, we compare the lensing signal to simple theoretical Navarro-Frenk-White models and to simulated cluster lensing profiles, including complications such as cluster substructure, projected large-scale structure and Eddington bias. We find evidence of underestimation in the X-ray masses, as expected, with = 0.75 ± 0.07 stat. ±0.05 sys. for our best-fitting model. The biases in cosmological parameters in a typical cluster abundance measurement that ignores this mass bias will typically exceed the statistical errors.

  15. Cluster-randomized Studies in Educational Research: Principles and Methodological Aspects

    PubMed Central

    Dreyhaupt, Jens; Mayer, Benjamin; Keis, Oliver; Öchsner, Wolfgang; Muche, Rainer

    2017-01-01

    An increasing number of studies are being performed in educational research to evaluate new teaching methods and approaches. These studies could be performed more efficiently and deliver more convincing results if they more strictly applied and complied with recognized standards of scientific studies. Such an approach could substantially increase the quality in particular of prospective, two-arm (intervention) studies that aim to compare two different teaching methods. A key standard in such studies is randomization, which can minimize systematic bias in study findings; such bias may result if the two study arms are not structurally equivalent. If possible, educational research studies should also achieve this standard, although this is not yet generally the case. Some difficulties and concerns exist, particularly regarding organizational and methodological aspects. An important point to consider in educational research studies is that usually individuals cannot be randomized, because of the teaching situation, and instead whole groups have to be randomized (so-called “cluster randomization”). Compared with studies with individual randomization, studies with cluster randomization normally require (significantly) larger sample sizes and more complex methods for calculating sample size. Furthermore, cluster-randomized studies require more complex methods for statistical analysis. The consequence of the above is that a competent expert with respective special knowledge needs to be involved in all phases of cluster-randomized studies. Studies to evaluate new teaching methods need to make greater use of randomization in order to achieve scientifically convincing results. Therefore, in this article we describe the general principles of cluster randomization and how to implement these principles, and we also outline practical aspects of using cluster randomization in prospective, two-arm comparative educational research studies. PMID:28584874

  16. Cluster-randomized Studies in Educational Research: Principles and Methodological Aspects.

    PubMed

    Dreyhaupt, Jens; Mayer, Benjamin; Keis, Oliver; Öchsner, Wolfgang; Muche, Rainer

    2017-01-01

    An increasing number of studies are being performed in educational research to evaluate new teaching methods and approaches. These studies could be performed more efficiently and deliver more convincing results if they more strictly applied and complied with recognized standards of scientific studies. Such an approach could substantially increase the quality in particular of prospective, two-arm (intervention) studies that aim to compare two different teaching methods. A key standard in such studies is randomization, which can minimize systematic bias in study findings; such bias may result if the two study arms are not structurally equivalent. If possible, educational research studies should also achieve this standard, although this is not yet generally the case. Some difficulties and concerns exist, particularly regarding organizational and methodological aspects. An important point to consider in educational research studies is that usually individuals cannot be randomized, because of the teaching situation, and instead whole groups have to be randomized (so-called "cluster randomization"). Compared with studies with individual randomization, studies with cluster randomization normally require (significantly) larger sample sizes and more complex methods for calculating sample size. Furthermore, cluster-randomized studies require more complex methods for statistical analysis. The consequence of the above is that a competent expert with respective special knowledge needs to be involved in all phases of cluster-randomized studies. Studies to evaluate new teaching methods need to make greater use of randomization in order to achieve scientifically convincing results. Therefore, in this article we describe the general principles of cluster randomization and how to implement these principles, and we also outline practical aspects of using cluster randomization in prospective, two-arm comparative educational research studies.

  17. Geographical Clusters of Rape in the United States: 2000-2012

    PubMed Central

    Amin, Raid; Nabors, Nicole S.; Nelson, Arlene M.; Saqlain, Murshid; Kulldorff, Martin

    2016-01-01

    Background While rape is a very serious crime and public health problem, no spatial mapping has been attempted for rape on the national scale. This paper addresses the three research questions: (1) Are reported rape cases randomly distributed across the USA, after being adjusted for population density and age, or are there geographical clusters of reported rape cases? (2) Are the geographical clusters of reported rapes still present after adjusting for differences in poverty levels? (3) Are there geographical clusters where the proportion of reported rape cases that lead to an arrest is exceptionally low or exceptionally high? Methods We studied the geographical variation of reported rape events (2003-2012) and rape arrests (2000-2012) in the 48 contiguous states of the USA. The disease Surveillance software SaTScan™ with its spatial scan statistic is used to evaluate the spatial variation in rapes. The spatial scan statistic has been widely used as a geographical surveillance tool for diseases, and we used it to identify geographical areas with clusters of reported rape and clusters of arrest rates for rape. Results The spatial scan statistic was used to identify geographical areas with exceptionally high rates of reported rape. The analyses were adjusted for age, and in secondary analyses, for both age and poverty level. We also identified geographical areas with either a low or a high proportion of reported rapes leading to an arrest. Conclusions We have identified geographical areas with exceptionally high (low) rates of reported rape. The geographical problem areas identified are prime candidates for more intensive preventive counseling and criminal prosecution efforts by public health, social service, and law enforcement agencies Geographical clusters of high rates of reported rape are prime areas in need of expanded implementation of preventive measures, such as changing attitudes in our society toward rape crimes, in addition to having the criminal justice system play an even larger role in preventing rape. PMID:28078318

  18. Using Single Free Sorting and Multivariate Exploratory Methods to Design a New Coffee Taster's Flavor Wheel.

    PubMed

    Spencer, Molly; Sage, Emma; Velez, Martin; Guinard, Jean-Xavier

    2016-12-01

    The original Coffee Taster's Flavor Wheel was developed by the Specialty Coffee Assn. of America over 20 y ago, and needed an innovative revision. This study used a novel application of traditional sensory and statistical methods in order to reorganize the new coffee Sensory Lexicon developed by World Coffee Research and Kansas State Univ. into scientifically valid clusters and levels to prepare a new, updated flavor wheel. Seventy-two experts participated in a modified online rapid free sorting activity (no tasting) to sort flavor attributes of the lexicon. The data from all participants were compiled and agglomeration hierarchical clustering was used to determine the clusters and levels of the flavor attributes, while multidimensional scaling was used to determine the positioning of the clusters around the Coffee Taster's Flavor Wheel. This resulted in a new flavor wheel for the coffee industry. © 2016 The Authors. Journal of Food Science published by Wiley Periodicals, Inc. on behalf of Institute of Food Technologists.

  19. Parallel and Scalable Clustering and Classification for Big Data in Geosciences

    NASA Astrophysics Data System (ADS)

    Riedel, M.

    2015-12-01

    Machine learning, data mining, and statistical computing are common techniques to perform analysis in earth sciences. This contribution will focus on two concrete and widely used data analytics methods suitable to analyse 'big data' in the context of geoscience use cases: clustering and classification. From the broad class of available clustering methods we focus on the density-based spatial clustering of appliactions with noise (DBSCAN) algorithm that enables the identification of outliers or interesting anomalies. A new open source parallel and scalable DBSCAN implementation will be discussed in the light of a scientific use case that detects water mixing events in the Koljoefjords. The second technique we cover is classification, with a focus set on the support vector machines algorithm (SVMs), as one of the best out-of-the-box classification algorithm. A parallel and scalable SVM implementation will be discussed in the light of a scientific use case in the field of remote sensing with 52 different classes of land cover types.

  20. MODEL-FREE MULTI-PROBE LENSING RECONSTRUCTION OF CLUSTER MASS PROFILES

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Umetsu, Keiichi

    2013-05-20

    Lens magnification by galaxy clusters induces characteristic spatial variations in the number counts of background sources, amplifying their observed fluxes and expanding the area of sky, the net effect of which, known as magnification bias, depends on the intrinsic faint-end slope of the source luminosity function. The bias is strongly negative for red galaxies, dominated by the geometric area distortion, whereas it is mildly positive for blue galaxies, enhancing the blue counts toward the cluster center. We generalize the Bayesian approach of Umetsu et al. for reconstructing projected cluster mass profiles, by incorporating multiple populations of background sources for magnification-biasmore » measurements and combining them with complementary lens-distortion measurements, effectively breaking the mass-sheet degeneracy and improving the statistical precision of cluster mass measurements. The approach can be further extended to include strong-lensing projected mass estimates, thus allowing for non-parametric absolute mass determinations in both the weak and strong regimes. We apply this method to our recent CLASH lensing measurements of MACS J1206.2-0847, and demonstrate how combining multi-probe lensing constraints can improve the reconstruction of cluster mass profiles. This method will also be useful for a stacked lensing analysis, combining all lensing-related effects in the cluster regime, for a definitive determination of the averaged mass profile.« less

  1. Detecting subtle hydrochemical anomalies with multivariate statistics: an example from homogeneous groundwaters in the Great Artesian Basin, Australia

    NASA Astrophysics Data System (ADS)

    O'Shea, Bethany; Jankowski, Jerzy

    2006-12-01

    The major ion composition of Great Artesian Basin groundwater in the lower Namoi River valley is relatively homogeneous in chemical composition. Traditional graphical techniques have been combined with multivariate statistical methods to determine whether subtle differences in the chemical composition of these waters can be delineated. Hierarchical cluster analysis and principal components analysis were successful in delineating minor variations within the groundwaters of the study area that were not visually identified in the graphical techniques applied. Hydrochemical interpretation allowed geochemical processes to be identified in each statistically defined water type and illustrated how these groundwaters differ from one another. Three main geochemical processes were identified in the groundwaters: ion exchange, precipitation, and mixing between waters from different sources. Both statistical methods delineated an anomalous sample suspected of being influenced by magmatic CO2 input. The use of statistical methods to complement traditional graphical techniques for waters appearing homogeneous is emphasized for all investigations of this type. Copyright

  2. ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.

    PubMed

    Koslicki, David; Chatterjee, Saikat; Shahrivar, Damon; Walker, Alan W; Francis, Suzanna C; Fraser, Louise J; Vehkaperä, Mikko; Lan, Yueheng; Corander, Jukka

    2015-01-01

    Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging. There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity. An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

  3. University and student segmentation: multilevel latent-class analysis of students' attitudes towards research methods and statistics.

    PubMed

    Mutz, Rüdiger; Daniel, Hans-Dieter

    2013-06-01

    It is often claimed that psychology students' attitudes towards research methods and statistics affect course enrollment, persistence, achievement, and course climate. However, the inter-institutional variability has been widely neglected in the research on students' attitudes towards research methods and statistics, but it is important for didactic purposes (heterogeneity of the student population). The paper presents a scale based on findings of the social psychology of attitudes (polar and emotion-based concept) in conjunction with a method for capturing beginning university students' attitudes towards research methods and statistics and identifying the proportion of students having positive attitudes at the institutional level. The study based on a re-analysis of a nationwide survey in Germany in August 2000 of all psychology students that enrolled in fall 1999/2000 (N= 1,490) and N= 44 universities. Using multilevel latent-class analysis (MLLCA), the aim was to group students in different student attitude types and at the same time to obtain university segments based on the incidences of the different student attitude types. Four student latent clusters were found that can be ranked on a bipolar attitude dimension. Membership in a cluster was predicted by age, grade point average (GPA) on school-leaving exam, and personality traits. In addition, two university segments were found: universities with an average proportion of students with positive attitudes and universities with a high proportion of students with positive attitudes (excellent segment). As psychology students make up a very heterogeneous group, the use of multiple learning activities as opposed to the classical lecture course is required. © 2011 The British Psychological Society.

  4. Applying spatial analysis tools in public health: an example using SaTScan to detect geographic targets for colorectal cancer screening interventions.

    PubMed

    Sherman, Recinda L; Henry, Kevin A; Tannenbaum, Stacey L; Feaster, Daniel J; Kobetz, Erin; Lee, David J

    2014-03-20

    Epidemiologists are gradually incorporating spatial analysis into health-related research as geocoded cases of disease become widely available and health-focused geospatial computer applications are developed. One health-focused application of spatial analysis is cluster detection. Using cluster detection to identify geographic areas with high-risk populations and then screening those populations for disease can improve cancer control. SaTScan is a free cluster-detection software application used by epidemiologists around the world to describe spatial clusters of infectious and chronic disease, as well as disease vectors and risk factors. The objectives of this article are to describe how spatial analysis can be used in cancer control to detect geographic areas in need of colorectal cancer screening intervention, identify issues commonly encountered by SaTScan users, detail how to select the appropriate methods for using SaTScan, and explain how method selection can affect results. As an example, we used various methods to detect areas in Florida where the population is at high risk for late-stage diagnosis of colorectal cancer. We found that much of our analysis was underpowered and that no single method detected all clusters of statistical or public health significance. However, all methods detected 1 area as high risk; this area is potentially a priority area for a screening intervention. Cluster detection can be incorporated into routine public health operations, but the challenge is to identify areas in which the burden of disease can be alleviated through public health intervention. Reliance on SaTScan's default settings does not always produce pertinent results.

  5. Identification of Reliable Components in Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS): a Data-Driven Approach across Metabolic Processes.

    PubMed

    Motegi, Hiromi; Tsuboi, Yuuri; Saga, Ayako; Kagami, Tomoko; Inoue, Maki; Toki, Hideaki; Minowa, Osamu; Noda, Tetsuo; Kikuchi, Jun

    2015-11-04

    There is an increasing need to use multivariate statistical methods for understanding biological functions, identifying the mechanisms of diseases, and exploring biomarkers. In addition to classical analyses such as hierarchical cluster analysis, principal component analysis, and partial least squares discriminant analysis, various multivariate strategies, including independent component analysis, non-negative matrix factorization, and multivariate curve resolution, have recently been proposed. However, determining the number of components is problematic. Despite the proposal of several different methods, no satisfactory approach has yet been reported. To resolve this problem, we implemented a new idea: classifying a component as "reliable" or "unreliable" based on the reproducibility of its appearance, regardless of the number of components in the calculation. Using the clustering method for classification, we applied this idea to multivariate curve resolution-alternating least squares (MCR-ALS). Comparisons between conventional and modified methods applied to proton nuclear magnetic resonance ((1)H-NMR) spectral datasets derived from known standard mixtures and biological mixtures (urine and feces of mice) revealed that more plausible results are obtained by the modified method. In particular, clusters containing little information were detected with reliability. This strategy, named "cluster-aided MCR-ALS," will facilitate the attainment of more reliable results in the metabolomics datasets.

  6. Quantifying the impact of fixed effects modeling of clusters in multiple imputation for cluster randomized trials

    PubMed Central

    Andridge, Rebecca. R.

    2011-01-01

    In cluster randomized trials (CRTs), identifiable clusters rather than individuals are randomized to study groups. Resulting data often consist of a small number of clusters with correlated observations within a treatment group. Missing data often present a problem in the analysis of such trials, and multiple imputation (MI) has been used to create complete data sets, enabling subsequent analysis with well-established analysis methods for CRTs. We discuss strategies for accounting for clustering when multiply imputing a missing continuous outcome, focusing on estimation of the variance of group means as used in an adjusted t-test or ANOVA. These analysis procedures are congenial to (can be derived from) a mixed effects imputation model; however, this imputation procedure is not yet available in commercial statistical software. An alternative approach that is readily available and has been used in recent studies is to include fixed effects for cluster, but the impact of using this convenient method has not been studied. We show that under this imputation model the MI variance estimator is positively biased and that smaller ICCs lead to larger overestimation of the MI variance. Analytical expressions for the bias of the variance estimator are derived in the case of data missing completely at random (MCAR), and cases in which data are missing at random (MAR) are illustrated through simulation. Finally, various imputation methods are applied to data from the Detroit Middle School Asthma Project, a recent school-based CRT, and differences in inference are compared. PMID:21259309

  7. Review of Statistical Learning Methods in Integrated Omics Studies (An Integrated Information Science).

    PubMed

    Zeng, Irene Sui Lan; Lumley, Thomas

    2018-01-01

    Integrated omics is becoming a new channel for investigating the complex molecular system in modern biological science and sets a foundation for systematic learning for precision medicine. The statistical/machine learning methods that have emerged in the past decade for integrated omics are not only innovative but also multidisciplinary with integrated knowledge in biology, medicine, statistics, machine learning, and artificial intelligence. Here, we review the nontrivial classes of learning methods from the statistical aspects and streamline these learning methods within the statistical learning framework. The intriguing findings from the review are that the methods used are generalizable to other disciplines with complex systematic structure, and the integrated omics is part of an integrated information science which has collated and integrated different types of information for inferences and decision making. We review the statistical learning methods of exploratory and supervised learning from 42 publications. We also discuss the strengths and limitations of the extended principal component analysis, cluster analysis, network analysis, and regression methods. Statistical techniques such as penalization for sparsity induction when there are fewer observations than the number of features and using Bayesian approach when there are prior knowledge to be integrated are also included in the commentary. For the completeness of the review, a table of currently available software and packages from 23 publications for omics are summarized in the appendix.

  8. Marginal regression approach for additive hazards models with clustered current status data.

    PubMed

    Su, Pei-Fang; Chi, Yunchan

    2014-01-15

    Current status data arise naturally from tumorigenicity experiments, epidemiology studies, biomedicine, econometrics and demographic and sociology studies. Moreover, clustered current status data may occur with animals from the same litter in tumorigenicity experiments or with subjects from the same family in epidemiology studies. Because the only information extracted from current status data is whether the survival times are before or after the monitoring or censoring times, the nonparametric maximum likelihood estimator of survival function converges at a rate of n(1/3) to a complicated limiting distribution. Hence, semiparametric regression models such as the additive hazards model have been extended for independent current status data to derive the test statistics, whose distributions converge at a rate of n(1/2) , for testing the regression parameters. However, a straightforward application of these statistical methods to clustered current status data is not appropriate because intracluster correlation needs to be taken into account. Therefore, this paper proposes two estimating functions for estimating the parameters in the additive hazards model for clustered current status data. The comparative results from simulation studies are presented, and the application of the proposed estimating functions to one real data set is illustrated. Copyright © 2013 John Wiley & Sons, Ltd.

  9. A Poisson nonnegative matrix factorization method with parameter subspace clustering constraint for endmember extraction in hyperspectral imagery

    NASA Astrophysics Data System (ADS)

    Sun, Weiwei; Ma, Jun; Yang, Gang; Du, Bo; Zhang, Liangpei

    2017-06-01

    A new Bayesian method named Poisson Nonnegative Matrix Factorization with Parameter Subspace Clustering Constraint (PNMF-PSCC) has been presented to extract endmembers from Hyperspectral Imagery (HSI). First, the method integrates the liner spectral mixture model with the Bayesian framework and it formulates endmember extraction into a Bayesian inference problem. Second, the Parameter Subspace Clustering Constraint (PSCC) is incorporated into the statistical program to consider the clustering of all pixels in the parameter subspace. The PSCC could enlarge differences among ground objects and helps finding endmembers with smaller spectrum divergences. Meanwhile, the PNMF-PSCC method utilizes the Poisson distribution as the prior knowledge of spectral signals to better explain the quantum nature of light in imaging spectrometer. Third, the optimization problem of PNMF-PSCC is formulated into maximizing the joint density via the Maximum A Posterior (MAP) estimator. The program is finally solved by iteratively optimizing two sub-problems via the Alternating Direction Method of Multipliers (ADMM) framework and the FURTHESTSUM initialization scheme. Five state-of-the art methods are implemented to make comparisons with the performance of PNMF-PSCC on both the synthetic and real HSI datasets. Experimental results show that the PNMF-PSCC outperforms all the five methods in Spectral Angle Distance (SAD) and Root-Mean-Square-Error (RMSE), and especially it could identify good endmembers for ground objects with smaller spectrum divergences.

  10. An empirical comparison of methods for analyzing correlated data from a discrete choice survey to elicit patient preference for colorectal cancer screening

    PubMed Central

    2012-01-01

    Background A discrete choice experiment (DCE) is a preference survey which asks participants to make a choice among product portfolios comparing the key product characteristics by performing several choice tasks. Analyzing DCE data needs to account for within-participant correlation because choices from the same participant are likely to be similar. In this study, we empirically compared some commonly-used statistical methods for analyzing DCE data while accounting for within-participant correlation based on a survey of patient preference for colorectal cancer (CRC) screening tests conducted in Hamilton, Ontario, Canada in 2002. Methods A two-stage DCE design was used to investigate the impact of six attributes on participants' preferences for CRC screening test and willingness to undertake the test. We compared six models for clustered binary outcomes (logistic and probit regressions using cluster-robust standard error (SE), random-effects and generalized estimating equation approaches) and three models for clustered nominal outcomes (multinomial logistic and probit regressions with cluster-robust SE and random-effects multinomial logistic model). We also fitted a bivariate probit model with cluster-robust SE treating the choices from two stages as two correlated binary outcomes. The rank of relative importance between attributes and the estimates of β coefficient within attributes were used to assess the model robustness. Results In total 468 participants with each completing 10 choices were analyzed. Similar results were reported for the rank of relative importance and β coefficients across models for stage-one data on evaluating participants' preferences for the test. The six attributes ranked from high to low as follows: cost, specificity, process, sensitivity, preparation and pain. However, the results differed across models for stage-two data on evaluating participants' willingness to undertake the tests. Little within-patient correlation (ICC ≈ 0) was found in stage-one data, but substantial within-patient correlation existed (ICC = 0.659) in stage-two data. Conclusions When small clustering effect presented in DCE data, results remained robust across statistical models. However, results varied when larger clustering effect presented. Therefore, it is important to assess the robustness of the estimates via sensitivity analysis using different models for analyzing clustered data from DCE studies. PMID:22348526

  11. Classifying Higher Education Institutions in Korea: A Performance-Based Approach

    ERIC Educational Resources Information Center

    Shin, Jung Cheol

    2009-01-01

    The purpose of this study was to classify higher education institutions according to institutional performance rather than predetermined benchmarks. Institutional performance was defined as research performance and classified using Hierarchical Cluster Analysis, a statistical method that classifies objects according to specified classification…

  12. Agro-ecoregionalization of Iowa using multivariate geographical clustering

    Treesearch

    Carol L. Williams; William W. Hargrove; Matt Leibman; David E. James

    2008-01-01

    Agro-ecoregionalization is categorization of landscapes for use in crop suitability analysis, strategic agroeconomic development, risk analysis, and other purposes. Past agro-ecoregionalizations have been subjective, expert opinion driven, crop specific, and unsuitable for statistical extrapolation. Use of quantitative analytical methods provides an opportunity for...

  13. The Optical Gravitational Lensing Experiment

    NASA Technical Reports Server (NTRS)

    Udalski, A.; Szymanski, M.; Kaluzny, J.; Kubiak, M.; Mateo, Mario

    1992-01-01

    The technical features are described of the Optical Gravitational Lensing Experiment, which aims to detect a statistically significant number of microlensing events toward the Galactic bulge. Clusters of galaxies observed during the 1992 season are listed and discussed and the reduction methods are described. Future plans are addressed.

  14. Towards Development of Clustering Applications for Large-Scale Comparative Genotyping and Kinship Analysis Using Y-Short Tandem Repeats.

    PubMed

    Seman, Ali; Sapawi, Azizian Mohd; Salleh, Mohd Zaki

    2015-06-01

    Y-chromosome short tandem repeats (Y-STRs) are genetic markers with practical applications in human identification. However, where mass identification is required (e.g., in the aftermath of disasters with significant fatalities), the efficiency of the process could be improved with new statistical approaches. Clustering applications are relatively new tools for large-scale comparative genotyping, and the k-Approximate Modal Haplotype (k-AMH), an efficient algorithm for clustering large-scale Y-STR data, represents a promising method for developing these tools. In this study we improved the k-AMH and produced three new algorithms: the Nk-AMH I (including a new initial cluster center selection), the Nk-AMH II (including a new dominant weighting value), and the Nk-AMH III (combining I and II). The Nk-AMH III was the superior algorithm, with mean clustering accuracy that increased in four out of six datasets and remained at 100% in the other two. Additionally, the Nk-AMH III achieved a 2% higher overall mean clustering accuracy score than the k-AMH, as well as optimal accuracy for all datasets (0.84-1.00). With inclusion of the two new methods, the Nk-AMH III produced an optimal solution for clustering Y-STR data; thus, the algorithm has potential for further development towards fully automatic clustering of any large-scale genotypic data.

  15. Towards accurate modelling of galaxy clustering on small scales: testing the standard ΛCDM + halo model

    NASA Astrophysics Data System (ADS)

    Sinha, Manodeep; Berlind, Andreas A.; McBride, Cameron K.; Scoccimarro, Roman; Piscionere, Jennifer A.; Wibking, Benjamin D.

    2018-07-01

    Interpreting the small-scale clustering of galaxies with halo models can elucidate the connection between galaxies and dark matter haloes. Unfortunately, the modelling is typically not sufficiently accurate for ruling out models statistically. It is thus difficult to use the information encoded in small scales to test cosmological models or probe subtle features of the galaxy-halo connection. In this paper, we attempt to push halo modelling into the `accurate' regime with a fully numerical mock-based methodology and careful treatment of statistical and systematic errors. With our forward-modelling approach, we can incorporate clustering statistics beyond the traditional two-point statistics. We use this modelling methodology to test the standard Λ cold dark matter (ΛCDM) + halo model against the clustering of Sloan Digital Sky Survey (SDSS) seventh data release (DR7) galaxies. Specifically, we use the projected correlation function, group multiplicity function, and galaxy number density as constraints. We find that while the model fits each statistic separately, it struggles to fit them simultaneously. Adding group statistics leads to a more stringent test of the model and significantly tighter constraints on model parameters. We explore the impact of varying the adopted halo definition and cosmological model and find that changing the cosmology makes a significant difference. The most successful model we tried (Planck cosmology with Mvir haloes) matches the clustering of low-luminosity galaxies, but exhibits a 2.3σ tension with the clustering of luminous galaxies, thus providing evidence that the `standard' halo model needs to be extended. This work opens the door to adding interesting freedom to the halo model and including additional clustering statistics as constraints.

  16. Thermodynamics and proton activities of protic ionic liquids with quantum cluster equilibrium theory

    NASA Astrophysics Data System (ADS)

    Ingenmey, Johannes; von Domaros, Michael; Perlt, Eva; Verevkin, Sergey P.; Kirchner, Barbara

    2018-05-01

    We applied the binary Quantum Cluster Equilibrium (bQCE) method to a number of alkylammonium-based protic ionic liquids in order to predict boiling points, vaporization enthalpies, and proton activities. The theory combines statistical thermodynamics of van-der-Waals-type clusters with ab initio quantum chemistry and yields the partition functions (and associated thermodynamic potentials) of binary mixtures over a wide range of thermodynamic phase points. Unlike conventional cluster approaches that are limited to the prediction of thermodynamic properties, dissociation reactions can be effortlessly included into the bQCE formalism, giving access to ionicities, as well. The method is open to quantum chemical methods at any level of theory, but combination with low-cost composite density functional theory methods and the proposed systematic approach to generate cluster sets provides a computationally inexpensive and mostly parameter-free way to predict such properties at good-to-excellent accuracy. Boiling points can be predicted within an accuracy of 50 K, reaching excellent accuracy for ethylammonium nitrate. Vaporization enthalpies are predicted within an accuracy of 20 kJ mol-1 and can be systematically interpreted on a molecular level. We present the first theoretical approach to predict proton activities in protic ionic liquids, with results fitting well into the experimentally observed correlation. Furthermore, enthalpies of vaporization were measured experimentally for some alkylammonium nitrates and an excellent linear correlation with vaporization enthalpies of their respective parent amines is observed.

  17. The Wilcoxon signed rank test for paired comparisons of clustered data.

    PubMed

    Rosner, Bernard; Glynn, Robert J; Lee, Mei-Ling T

    2006-03-01

    The Wilcoxon signed rank test is a frequently used nonparametric test for paired data (e.g., consisting of pre- and posttreatment measurements) based on independent units of analysis. This test cannot be used for paired comparisons arising from clustered data (e.g., if paired comparisons are available for each of two eyes of an individual). To incorporate clustering, a generalization of the randomization test formulation for the signed rank test is proposed, where the unit of randomization is at the cluster level (e.g., person), while the individual paired units of analysis are at the subunit within cluster level (e.g., eye within person). An adjusted variance estimate of the signed rank test statistic is then derived, which can be used for either balanced (same number of subunits per cluster) or unbalanced (different number of subunits per cluster) data, with an exchangeable correlation structure, with or without tied values. The resulting test statistic is shown to be asymptotically normal as the number of clusters becomes large, if the cluster size is bounded. Simulation studies are performed based on simulating correlated ranked data from a signed log-normal distribution. These studies indicate appropriate type I error for data sets with > or =20 clusters and a superior power profile compared with either the ordinary signed rank test based on the average cluster difference score or the multivariate signed rank test of Puri and Sen. Finally, the methods are illustrated with two data sets, (i) an ophthalmologic data set involving a comparison of electroretinogram (ERG) data in retinitis pigmentosa (RP) patients before and after undergoing an experimental surgical procedure, and (ii) a nutritional data set based on a randomized prospective study of nutritional supplements in RP patients where vitamin E intake outside of study capsules is compared before and after randomization to monitor compliance with nutritional protocols.

  18. Multivariate Statistical Analysis of Cigarette Design Feature Influence on ISO TNCO Yields.

    PubMed

    Agnew-Heard, Kimberly A; Lancaster, Vicki A; Bravo, Roberto; Watson, Clifford; Walters, Matthew J; Holman, Matthew R

    2016-06-20

    The aim of this study is to explore how differences in cigarette physical design parameters influence tar, nicotine, and carbon monoxide (TNCO) yields in mainstream smoke (MSS) using the International Organization of Standardization (ISO) smoking regimen. Standardized smoking methods were used to evaluate 50 U.S. domestic brand cigarettes and a reference cigarette representing a range of TNCO yields in MSS collected from linear smoking machines using a nonintense smoking regimen. Multivariate statistical methods were used to form clusters of cigarettes based on their ISO TNCO yields and then to explore the relationship between the ISO generated TNCO yields and the nine cigarette physical design parameters between and within each cluster simultaneously. The ISO generated TNCO yields in MSS are 1.1-17.0 mg tar/cigarette, 0.1-2.2 mg nicotine/cigarette, and 1.6-17.3 mg CO/cigarette. Cluster analysis divided the 51 cigarettes into five discrete clusters based on their ISO TNCO yields. No one physical parameter dominated across all clusters. Predicting ISO machine generated TNCO yields based on these nine physical design parameters is complex due to the correlation among and between the nine physical design parameters and TNCO yields. From these analyses, it is estimated that approximately 20% of the variability in the ISO generated TNCO yields comes from other parameters (e.g., filter material, filter type, inclusion of expanded or reconstituted tobacco, and tobacco blend composition, along with differences in tobacco leaf origin and stalk positions and added ingredients). A future article will examine the influence of these physical design parameters on TNCO yields under a Canadian Intense (CI) smoking regimen. Together, these papers will provide a more robust picture of the design features that contribute to TNCO exposure across the range of real world smoking patterns.

  19. Applications of Stochastic Analyses for Collaborative Learning and Cognitive Assessment

    DTIC Science & Technology

    2007-04-01

    models (Visser, Maartje, Raijmakers, & Molenaar , 2002). The second part of this paper illustrates two applications of the methods described in the...clustering three-way data sets. Computational Statistics and Data Analysis, 51 (11), 5368–5376. Visser, I., Maartje, E., Raijmakers, E. J., & Molenaar

  20. DAFi: A directed recursive data filtering and clustering approach for improving and interpreting data clustering identification of cell populations from polychromatic flow cytometry data.

    PubMed

    Lee, Alexandra J; Chang, Ivan; Burel, Julie G; Lindestam Arlehamn, Cecilia S; Mandava, Aishwarya; Weiskopf, Daniela; Peters, Bjoern; Sette, Alessandro; Scheuermann, Richard H; Qian, Yu

    2018-04-17

    Computational methods for identification of cell populations from polychromatic flow cytometry data are changing the paradigm of cytometry bioinformatics. Data clustering is the most common computational approach to unsupervised identification of cell populations from multidimensional cytometry data. However, interpretation of the identified data clusters is labor-intensive. Certain types of user-defined cell populations are also difficult to identify by fully automated data clustering analysis. Both are roadblocks before a cytometry lab can adopt the data clustering approach for cell population identification in routine use. We found that combining recursive data filtering and clustering with constraints converted from the user manual gating strategy can effectively address these two issues. We named this new approach DAFi: Directed Automated Filtering and Identification of cell populations. Design of DAFi preserves the data-driven characteristics of unsupervised clustering for identifying novel cell subsets, but also makes the results interpretable to experimental scientists through mapping and merging the multidimensional data clusters into the user-defined two-dimensional gating hierarchy. The recursive data filtering process in DAFi helped identify small data clusters which are otherwise difficult to resolve by a single run of the data clustering method due to the statistical interference of the irrelevant major clusters. Our experiment results showed that the proportions of the cell populations identified by DAFi, while being consistent with those by expert centralized manual gating, have smaller technical variances across samples than those from individual manual gating analysis and the nonrecursive data clustering analysis. Compared with manual gating segregation, DAFi-identified cell populations avoided the abrupt cut-offs on the boundaries. DAFi has been implemented to be used with multiple data clustering methods including K-means, FLOCK, FlowSOM, and the ClusterR package. For cell population identification, DAFi supports multiple options including clustering, bisecting, slope-based gating, and reversed filtering to meet various autogating needs from different scientific use cases. © 2018 International Society for Advancement of Cytometry. © 2018 International Society for Advancement of Cytometry.

  1. Molecular Subtyping to Detect Human Listeriosis Clusters

    PubMed Central

    Sauders, Brian D.; Fortes, Esther D.; Morse, Dale L.; Dumas, Nellie; Kiehlbauch, Julia A.; Schukken, Ynte; Hibbs, Jonathan R.

    2003-01-01

    We analyzed the diversity (Simpson’s Index, D) and distribution of Listeria monocytogenes in human listeriosis cases in New York State (excluding New York City) from November 1996 to June 2000 by using automated ribotyping and pulsed-field gel electrophoresis (PFGE). We applied a scan statistic (p<0.05) to detect listeriosis clusters caused by a specific Listeria monocytogenes subtype. Of 131 human isolates, 34 (D=0.923) ribotypes and 74 (D=0.975) PFGE types were found. Nine (31% of cases) clusters were identified by ribotype or PFGE; five (18% of cases) clusters were identified by using both methods. Two of the nine clusters (13% of cases) identified corresponded with investigated multistate listeriosis outbreaks. While most human listeriosis cases are considered sporadic, highly discriminatory molecular subtyping approaches thus indicated that 13% to 31% of cases reported in New York State may represent single-source clusters. Listeriosis control and reduction efforts should include broad-based subtyping of human isolates and consider that a large number of cases may represent outbreaks. PMID:12781006

  2. [Spatial and temporal clustering characteristics of typhoid and paratyphoid fever and its change pattern in 3 provinces in southwestern China, 2001-2012].

    PubMed

    Wang, L X; Yang, B; Yan, M Y; Tang, Y Q; Liu, Z C; Wang, R Q; Li, S; Ma, L; Kan, B

    2017-11-10

    Objective: To analyze the spatial and temporal clustering characteristics of typhoid and paratyphoid fever and its change pattern in Yunnan, Guizhou and Guangxi provinces in southwestern China in recent years. Methods: The incidence data of typhoid and paratyphoid fever cases at county level in 3 provinces during 2001-2012 were collected from China Information System for Diseases Control and Prevention and analyzed by the methods of descriptive epidemiology and geographic informatics. And the map showing the spatial and temporal clustering characters of typhoid and paratyphoid fever cases in three provinces was drawn. SaTScan statistics was used to identify the typhoid and paratyphoid fever clustering areas of three provinces in each year from 2001 to 2012. Results: During the study period, the reported cases of typhoid and paratyphoid fever declined with year. The reported incidence decreased from 30.15 per 100 000 in 2001 to 10.83 per 100 000 in 2006(annual incidence 21.12 per 100 000); while during 2007-2012, the incidence became stable, ranging from 4.75 per 100 000 to 6.83 per 100 000 (annual incidence 5.73 per 100 000). The seasonal variation of the incidence was consistent in three provinces, with majority of cases occurred in summer and autumn. The spatial and temporal clustering of typhoid and paratyphoid fever was demonstrated by the incidence map. Most high-incidence counties were located in a zonal area extending from Yuxi of Yunnan to Guiyang of Guizhou, but were concentrated in Guilin in Guangxi. Temporal and spatial scan statistics identified the positional shifting of class Ⅰ clustering area from Guizhou to Yunnan. Class Ⅰ clustering area was located around the central and western areas (Zunyi and Anshun) of Guizhou during 2001-2003, and moved to the central area of Yunnan during 2004-2012. Conclusion: Spatial and temporal clustering of typhoid and paratyphoid fever existed in the endemic areas of southwestern China, and the clustering area covered a zone connecting the central areas of Guizhou and Yunnan. From 2004 to 2012, the most important clustering area shifted from Guizhou to Yunnan. Findings from this study provided evidence for the identifying key areas for typhoid and paratyphoid fever control and prevention and allocate health resources.

  3. A benchmark for statistical microarray data analysis that preserves actual biological and technical variance.

    PubMed

    De Hertogh, Benoît; De Meulder, Bertrand; Berger, Fabrice; Pierre, Michael; Bareke, Eric; Gaigneaux, Anthoula; Depiereux, Eric

    2010-01-11

    Recent reanalysis of spike-in datasets underscored the need for new and more accurate benchmark datasets for statistical microarray analysis. We present here a fresh method using biologically-relevant data to evaluate the performance of statistical methods. Our novel method ranks the probesets from a dataset composed of publicly-available biological microarray data and extracts subset matrices with precise information/noise ratios. Our method can be used to determine the capability of different methods to better estimate variance for a given number of replicates. The mean-variance and mean-fold change relationships of the matrices revealed a closer approximation of biological reality. Performance analysis refined the results from benchmarks published previously.We show that the Shrinkage t test (close to Limma) was the best of the methods tested, except when two replicates were examined, where the Regularized t test and the Window t test performed slightly better. The R scripts used for the analysis are available at http://urbm-cluster.urbm.fundp.ac.be/~bdemeulder/.

  4. Light clusters and pasta phases in warm and dense nuclear matter

    NASA Astrophysics Data System (ADS)

    Avancini, Sidney S.; Ferreira, Márcio; Pais, Helena; Providência, Constança; Röpke, Gerd

    2017-04-01

    The pasta phases are calculated for warm stellar matter in a framework of relativistic mean-field models, including the possibility of light cluster formation. Results from three different semiclassical approaches are compared with a quantum statistical calculation. Light clusters are considered as point-like particles, and their abundances are determined from the minimization of the free energy. The couplings of the light clusters to mesons are determined from experimental chemical equilibrium constants and many-body quantum statistical calculations. The effect of these light clusters on the chemical potentials is also discussed. It is shown that, by including heavy clusters, light clusters are present up to larger nucleonic densities, although with smaller mass fractions.

  5. Probing the dynamical and X-ray mass proxies of the cluster of galaxies Abell S1101

    NASA Astrophysics Data System (ADS)

    Rabitz, Andreas; Zhang, Yu-Ying; Schwope, Axel; Verdugo, Miguel; Reiprich, Thomas H.; Klein, Matthias

    2017-01-01

    Context. The galaxy cluster Abell S1101 (S1101 hereafter) deviates significantly from the X-ray luminosity versus velocity dispersion relation (L-σ) of galaxy clusters in our previous study. Given reliable X-ray luminosity measurement combining XMM-Newton and ROSAT, this could most likely be caused by the bias in the velocity dispersion due to interlopers and low member statistic in the previous sample of member galaxies, which was solely based on 20 galaxy redshifts drawn from the literature. Aims: We intend to increase the galaxy member statistics to perform precision measurements of the velocity dispersion and dynamical mass of S1101. We aim for a detailed substructure and dynamical state characterization of this cluster, and a comparison of mass estimates derived from (I) the velocity dispersion (Mvir), (II) the caustic mass computation (Mcaustic), and (III) mass proxies from X-ray observations and the Sunyaev-Zel'dovich (SZ) effect. Methods: We carried out new optical spectroscopic observations of the galaxies in this cluster field with VIMOS, obtaining a sample of 60 member galaxies for S1101. We revised the cluster redshift and velocity dispersion measurements based on this sample and also applied the Dressler-Shectman substructure test. Results: The completeness of cluster members within r200 was significantly improved for this cluster. Tests for dynamical substructure do not show evidence of major disturbances or merging activities in S1101. We find good agreement between the dynamical cluster mass measurements and X-ray mass estimates, which confirms the relaxed state of the cluster displayed in the 2D substructure test. The SZ mass proxy is slightly higher than the other estimates. The updated measurement of σ erased the deviation of S1101 in the L-σ relation. We also noticed a background structure in the cluster field of S1101. This structure is a galaxy group that is very close to the cluster S1101 in projection but at almost twice its redshift. However the mass of this structure is too low to significantly bias the observed bolometric X-ray luminosity of S1101. Hence, we can conclude that the deviation of S1101 in the L-σ relation in our previous study can be explained by low member statistics and galaxy interlopers, which are known to introduce biases in the estimated velocity dispersion. We have made use of VLT/VIMOS observations taken with the ESO Telescope at the Paranal Observatory under programme 087.A-0096.

  6. Triadic split-merge sampler

    NASA Astrophysics Data System (ADS)

    van Rossum, Anne C.; Lin, Hai Xiang; Dubbeldam, Johan; van der Herik, H. Jaap

    2018-04-01

    In machine vision typical heuristic methods to extract parameterized objects out of raw data points are the Hough transform and RANSAC. Bayesian models carry the promise to optimally extract such parameterized objects given a correct definition of the model and the type of noise at hand. A category of solvers for Bayesian models are Markov chain Monte Carlo methods. Naive implementations of MCMC methods suffer from slow convergence in machine vision due to the complexity of the parameter space. Towards this blocked Gibbs and split-merge samplers have been developed that assign multiple data points to clusters at once. In this paper we introduce a new split-merge sampler, the triadic split-merge sampler, that perform steps between two and three randomly chosen clusters. This has two advantages. First, it reduces the asymmetry between the split and merge steps. Second, it is able to propose a new cluster that is composed out of data points from two different clusters. Both advantages speed up convergence which we demonstrate on a line extraction problem. We show that the triadic split-merge sampler outperforms the conventional split-merge sampler. Although this new MCMC sampler is demonstrated in this machine vision context, its application extend to the very general domain of statistical inference.

  7. Analysis of ligand-protein exchange by Clustering of Ligand Diffusion Coefficient Pairs (CoLD-CoP).

    PubMed

    Snyder, David A; Chantova, Mihaela; Chaudhry, Saadia

    2015-06-01

    NMR spectroscopy is a powerful tool in describing protein structures and protein activity for pharmaceutical and biochemical development. This study describes a method to determine weak binding ligands in biological systems by using hierarchic diffusion coefficient clustering of multidimensional data obtained with a 400 MHz Bruker NMR. Comparison of DOSY spectrums of ligands of the chemical library in the presence and absence of target proteins show translational diffusion rates for small molecules upon interaction with macromolecules. For weak binders such as compounds found in fragment libraries, changes in diffusion rates upon macromolecular binding are on the order of the precision of DOSY diffusion measurements, and identifying such subtle shifts in diffusion requires careful statistical analysis. The "CoLD-CoP" (Clustering of Ligand Diffusion Coefficient Pairs) method presented here uses SAHN clustering to identify protein-binders in a chemical library or even a not fully characterized metabolite mixture. We will show how DOSY NMR and the "CoLD-CoP" method complement each other in identifying the most suitable candidates for lysozyme and wheat germ acid phosphatase. Copyright © 2015 Elsevier Inc. All rights reserved.

  8. Statistical Clustering and the Contents of the Infant Vocabulary

    ERIC Educational Resources Information Center

    Swingley, Daniel

    2005-01-01

    Infants parse speech into word-sized units according to biases that develop in the first year. One bias, present before the age of 7 months, is to cluster syllables that tend to co-occur. The present computational research demonstrates that this statistical clustering bias could lead to the extraction of speech sequences that are actual words,…

  9. `Inter-Arrival Time' Inspired Algorithm and its Application in Clustering and Molecular Phylogeny

    NASA Astrophysics Data System (ADS)

    Kolekar, Pandurang S.; Kale, Mohan M.; Kulkarni-Kale, Urmila

    2010-10-01

    Bioinformatics, being multidisciplinary field, involves applications of various methods from allied areas of Science for data mining using computational approaches. Clustering and molecular phylogeny is one of the key areas in Bioinformatics, which help in study of classification and evolution of organisms. Molecular phylogeny algorithms can be divided into distance based and character based methods. But most of these methods are dependent on pre-alignment of sequences and become computationally intensive with increase in size of data and hence demand alternative efficient approaches. `Inter arrival time distribution' (IATD) is a popular concept in the theory of stochastic system modeling but its potential in molecular data analysis has not been fully explored. The present study reports application of IATD in Bioinformatics for clustering and molecular phylogeny. The proposed method provides IATDs of nucleotides in genomic sequences. The distance function based on statistical parameters of IATDs is proposed and distance matrix thus obtained is used for the purpose of clustering and molecular phylogeny. The method is applied on a dataset of 3' non-coding region sequences (NCR) of Dengue virus type 3 (DENV-3), subtype III, reported in 2008. The phylogram thus obtained revealed the geographical distribution of DENV-3 isolates. Sri Lankan DENV-3 isolates were further observed to be clustered in two sub-clades corresponding to pre and post Dengue hemorrhagic fever emergence groups. These results are consistent with those reported earlier, which are obtained using pre-aligned sequence data as an input. These findings encourage applications of the IATD based method in molecular phylogenetic analysis in particular and data mining in general.

  10. A new method to unveil embedded stellar clusters

    NASA Astrophysics Data System (ADS)

    Lombardi, Marco; Lada, Charles J.; Alves, João

    2017-11-01

    In this paper we present a novel method to identify and characterize stellar clusters deeply embedded in a dark molecular cloud. The method is based on measuring stellar surface density in wide-field infrared images using star counting techniques. It takes advantage of the differing H-band luminosity functions (HLFs) of field stars and young stellar populations and is able to statistically associate each star in an image as a member of either the background stellar population or a young stellar population projected on or near the cloud. Moreover, the technique corrects for the effects of differential extinction toward each individual star. We have tested this method against simulations as well as observations. In particular, we have applied the method to 2MASS point sources observed in the Orion A and B complexes, and the results obtained compare very well with those obtained from deep Spitzer and Chandra observations where presence of infrared excess or X-ray emission directly determines membership status for every star. Additionally, our method also identifies unobscured clusters and a low resolution version of the Orion stellar surface density map shows clearly the relatively unobscured and diffuse OB 1a and 1b sub-groups and provides useful insights on their spatial distribution.

  11. Modeling of carbon dioxide condensation in the high pressure flows using the statistical BGK approach

    NASA Astrophysics Data System (ADS)

    Kumar, Rakesh; Li, Zheng; Levin, Deborah A.

    2011-05-01

    In this work, we propose a new heat accommodation model to simulate freely expanding homogeneous condensation flows of gaseous carbon dioxide using a new approach, the statistical Bhatnagar-Gross-Krook method. The motivation for the present work comes from the earlier work of Li et al. [J. Phys. Chem. 114, 5276 (2010)] in which condensation models were proposed and used in the direct simulation Monte Carlo method to simulate the flow of carbon dioxide from supersonic expansions of small nozzles into near-vacuum conditions. Simulations conducted for stagnation pressures of one and three bar were compared with the measurements of gas and cluster number densities, cluster size, and carbon dioxide rotational temperature obtained by Ramos et al. [Phys. Rev. A 72, 3204 (2005)]. Due to the high computational cost of direct simulation Monte Carlo method, comparison between simulations and data could only be performed for these stagnation pressures, with good agreement obtained beyond the condensation onset point, in the farfield. As the stagnation pressure increases, the degree of condensation also increases; therefore, to improve the modeling of condensation onset, one must be able to simulate higher stagnation pressures. In simulations of an expanding flow of argon through a nozzle, Kumar et al. [AIAA J. 48, 1531 (2010)] found that the statistical Bhatnagar-Gross-Krook method provides the same accuracy as direct simulation Monte Carlo method, but, at one half of the computational cost. In this work, the statistical Bhatnagar-Gross-Krook method was modified to account for internal degrees of freedom for multi-species polyatomic gases. With the computational approach in hand, we developed and tested a new heat accommodation model for a polyatomic system to properly account for the heat release of condensation. We then developed condensation models in the framework of the statistical Bhatnagar-Gross-Krook method. Simulations were found to agree well with the experiment for all stagnation pressure cases (1-5 bar), validating the accuracy of the Bhatnagar-Gross-Krook based condensation model in capturing the physics of condensation.

  12. Manual hierarchical clustering of regional geochemical data using a Bayesian finite mixture model

    USGS Publications Warehouse

    Ellefsen, Karl J.; Smith, David

    2016-01-01

    Interpretation of regional scale, multivariate geochemical data is aided by a statistical technique called “clustering.” We investigate a particular clustering procedure by applying it to geochemical data collected in the State of Colorado, United States of America. The clustering procedure partitions the field samples for the entire survey area into two clusters. The field samples in each cluster are partitioned again to create two subclusters, and so on. This manual procedure generates a hierarchy of clusters, and the different levels of the hierarchy show geochemical and geological processes occurring at different spatial scales. Although there are many different clustering methods, we use Bayesian finite mixture modeling with two probability distributions, which yields two clusters. The model parameters are estimated with Hamiltonian Monte Carlo sampling of the posterior probability density function, which usually has multiple modes. Each mode has its own set of model parameters; each set is checked to ensure that it is consistent both with the data and with independent geologic knowledge. The set of model parameters that is most consistent with the independent geologic knowledge is selected for detailed interpretation and partitioning of the field samples.

  13. RRW: repeated random walks on genome-scale protein networks for local cluster discovery

    PubMed Central

    Macropol, Kathy; Can, Tolga; Singh, Ambuj K

    2009-01-01

    Background We propose an efficient and biologically sensitive algorithm based on repeated random walks (RRW) for discovering functional modules, e.g., complexes and pathways, within large-scale protein networks. Compared to existing cluster identification techniques, RRW implicitly makes use of network topology, edge weights, and long range interactions between proteins. Results We apply the proposed technique on a functional network of yeast genes and accurately identify statistically significant clusters of proteins. We validate the biological significance of the results using known complexes in the MIPS complex catalogue database and well-characterized biological processes. We find that 90% of the created clusters have the majority of their catalogued proteins belonging to the same MIPS complex, and about 80% have the majority of their proteins involved in the same biological process. We compare our method to various other clustering techniques, such as the Markov Clustering Algorithm (MCL), and find a significant improvement in the RRW clusters' precision and accuracy values. Conclusion RRW, which is a technique that exploits the topology of the network, is more precise and robust in finding local clusters. In addition, it has the added flexibility of being able to find multi-functional proteins by allowing overlapping clusters. PMID:19740439

  14. Modeling the Movement of Homicide by Type to Inform Public Health Prevention Efforts

    PubMed Central

    Grady, Sue; Pizarro, Jesenia M.; Melde, Chris

    2015-01-01

    Objectives. We modeled the spatiotemporal movement of hotspot clusters of homicide by motive in Newark, New Jersey, to investigate whether different homicide types have different patterns of clustering and movement. Methods. We obtained homicide data from the Newark Police Department Homicide Unit’s investigative files from 1997 through 2007 (n = 560). We geocoded the address at which each homicide victim was found and recorded the date of and the motive for the homicide. We used cluster detection software to model the spatiotemporal movement of statistically significant homicide clusters by motive, using census tract and month of occurrence as the spatial and temporal units of analysis. Results. Gang-motivated homicides showed evidence of clustering and diffusion through Newark. Additionally, gang-motivated homicide clusters overlapped to a degree with revenge and drug-motivated homicide clusters. Escalating dispute and nonintimate familial homicides clustered; however, there was no evidence of diffusion. Intimate partner and robbery homicides did not cluster. Conclusions. By tracking how homicide types diffuse through communities and determining which places have ongoing or emerging homicide problems by type, we can better inform the deployment of prevention and intervention efforts. PMID:26270315

  15. Clustering Multivariate Time Series Using Hidden Markov Models

    PubMed Central

    Ghassempour, Shima; Girosi, Federico; Maeder, Anthony

    2014-01-01

    In this paper we describe an algorithm for clustering multivariate time series with variables taking both categorical and continuous values. Time series of this type are frequent in health care, where they represent the health trajectories of individuals. The problem is challenging because categorical variables make it difficult to define a meaningful distance between trajectories. We propose an approach based on Hidden Markov Models (HMMs), where we first map each trajectory into an HMM, then define a suitable distance between HMMs and finally proceed to cluster the HMMs with a method based on a distance matrix. We test our approach on a simulated, but realistic, data set of 1,255 trajectories of individuals of age 45 and over, on a synthetic validation set with known clustering structure, and on a smaller set of 268 trajectories extracted from the longitudinal Health and Retirement Survey. The proposed method can be implemented quite simply using standard packages in R and Matlab and may be a good candidate for solving the difficult problem of clustering multivariate time series with categorical variables using tools that do not require advanced statistic knowledge, and therefore are accessible to a wide range of researchers. PMID:24662996

  16. A Deterministic Annealing Approach to Clustering AIRS Data

    NASA Technical Reports Server (NTRS)

    Guillaume, Alexandre; Braverman, Amy; Ruzmaikin, Alexander

    2012-01-01

    We will examine the validity of means and standard deviations as a basis for climate data products. We will explore the conditions under which these two simple statistics are inadequate summaries of the underlying empirical probability distributions by contrasting them with a nonparametric, method called Deterministic Annealing technique

  17. Linnorm: improved statistical analysis for single cell RNA-seq expression data

    PubMed Central

    Yip, Shun H.; Wang, Panwen; Kocher, Jean-Pierre A.; Sham, Pak Chung

    2017-01-01

    Abstract Linnorm is a novel normalization and transformation method for the analysis of single cell RNA sequencing (scRNA-seq) data. Linnorm is developed to remove technical noises and simultaneously preserve biological variations in scRNA-seq data, such that existing statistical methods can be improved. Using real scRNA-seq data, we compared Linnorm with existing normalization methods, including NODES, SAMstrt, SCnorm, scran, DESeq and TMM. Linnorm shows advantages in speed, technical noise removal and preservation of cell heterogeneity, which can improve existing methods in the discovery of novel subtypes, pseudo-temporal ordering of cells, clustering analysis, etc. Linnorm also performs better than existing DEG analysis methods, including BASiCS, NODES, SAMstrt, Seurat and DESeq2, in false positive rate control and accuracy. PMID:28981748

  18. Anharmonic effects in the quantum cluster equilibrium method

    NASA Astrophysics Data System (ADS)

    von Domaros, Michael; Perlt, Eva

    2017-03-01

    The well-established quantum cluster equilibrium (QCE) model provides a statistical thermodynamic framework to apply high-level ab initio calculations of finite cluster structures to macroscopic liquid phases using the partition function. So far, the harmonic approximation has been applied throughout the calculations. In this article, we apply an important correction in the evaluation of the one-particle partition function and account for anharmonicity. Therefore, we implemented an analytical approximation to the Morse partition function and the derivatives of its logarithm with respect to temperature, which are required for the evaluation of thermodynamic quantities. This anharmonic QCE approach has been applied to liquid hydrogen chloride and cluster distributions, and the molar volume, the volumetric thermal expansion coefficient, and the isobaric heat capacity have been calculated. An improved description for all properties is observed if anharmonic effects are considered.

  19. DENBRAN: A basic program for a significance test for multivariate normality of clusters from branching patterns in dendrograms

    NASA Astrophysics Data System (ADS)

    Sneath, P. H. A.

    A BASIC program is presented for significance tests to determine whether a dendrogram is derived from clustering of points that belong to a single multivariate normal distribution. The significance tests are based on statistics of the Kolmogorov—Smirnov type, obtained by comparing the observed cumulative graph of branch levels with a graph for the hypothesis of multivariate normality. The program also permits testing whether the dendrogram could be from a cluster of lower dimensionality due to character correlations. The program makes provision for three similarity coefficients, (1) Euclidean distances, (2) squared Euclidean distances, and (3) Simple Matching Coefficients, and for five cluster methods (1) WPGMA, (2) UPGMA, (3) Single Linkage (or Minimum Spanning Trees), (4) Complete Linkage, and (5) Ward's Increase in Sums of Squares. The program is entitled DENBRAN.

  20. Shapiro effect as a possible cause of the low-frequency pulsar timing noise in globular clusters

    NASA Astrophysics Data System (ADS)

    Larchenkova, T. I.; Kopeikin, S. M.

    2006-01-01

    A prolonged timing of millisecond pulsars has revealed low-frequency uncorrelated (infrared) noise, presumably of astrophysical origin, in the pulse arrival time (PAT) residuals for some of them. Currently available pulsar timing methods allow the statistical parameters of this noise to be reliably measured by decomposing the PAT residual function into orthogonal Fourier harmonics. In most cases, pulsars in globular clusters show a low-frequency modulation of their rotational phase and spin rate. The relativistic time delay of the pulsar signal in the curved spacetime of randomly distributed and moving globular cluster stars (the Shapiro effect) is suggested as a possible cause of this modulation. Extremely important (from an astrophysical point of view) information about the structure of the globular cluster core, which is inaccessible to study by other observational methods, could be obtained by analyzing the spectral parameters of the low-frequency noise caused by the Shapiro effect and attributable to the random passages of stars near the line of sight to the pulsar. Given the smallness of the aberration corrections that arise from the nonstationarity of the gravitational field of the randomly distributed ensemble of stars under consideration, a formula is derived for the Shapiro effect for a pulsar in a globular cluster. The derived formula is used to calculate the autocorrelation function of the low-frequency pulsar noise, the slope of its power spectrum, and the behavior of the σz statistic that characterizes the spectral properties of this noise in the form of a time function. The Shapiro effect under discussion is shown to manifest itself for large impact parameters as a low-frequency noise of the pulsar spin rate with a spectral index of n = -1.8 that depends weakly on the specific model distribution of stars in the globular cluster. For small impact parameters, the spectral index of the noise is n = -1.5.

  1. Survey methods for assessing land cover map accuracy

    USGS Publications Warehouse

    Nusser, S.M.; Klaas, E.E.

    2003-01-01

    The increasing availability of digital photographic materials has fueled efforts by agencies and organizations to generate land cover maps for states, regions, and the United States as a whole. Regardless of the information sources and classification methods used, land cover maps are subject to numerous sources of error. In order to understand the quality of the information contained in these maps, it is desirable to generate statistically valid estimates of accuracy rates describing misclassification errors. We explored a full sample survey framework for creating accuracy assessment study designs that balance statistical and operational considerations in relation to study objectives for a regional assessment of GAP land cover maps. We focused not only on appropriate sample designs and estimation approaches, but on aspects of the data collection process, such as gaining cooperation of land owners and using pixel clusters as an observation unit. The approach was tested in a pilot study to assess the accuracy of Iowa GAP land cover maps. A stratified two-stage cluster sampling design addressed sample size requirements for land covers and the need for geographic spread while minimizing operational effort. Recruitment methods used for private land owners yielded high response rates, minimizing a source of nonresponse error. Collecting data for a 9-pixel cluster centered on the sampled pixel was simple to implement, and provided better information on rarer vegetation classes as well as substantial gains in precision relative to observing data at a single-pixel.

  2. An Information-Theoretic-Cluster Visualization for Self-Organizing Maps.

    PubMed

    Brito da Silva, Leonardo Enzo; Wunsch, Donald C

    2018-06-01

    Improved data visualization will be a significant tool to enhance cluster analysis. In this paper, an information-theoretic-based method for cluster visualization using self-organizing maps (SOMs) is presented. The information-theoretic visualization (IT-vis) has the same structure as the unified distance matrix, but instead of depicting Euclidean distances between adjacent neurons, it displays the similarity between the distributions associated with adjacent neurons. Each SOM neuron has an associated subset of the data set whose cardinality controls the granularity of the IT-vis and with which the first- and second-order statistics are computed and used to estimate their probability density functions. These are used to calculate the similarity measure, based on Renyi's quadratic cross entropy and cross information potential (CIP). The introduced visualizations combine the low computational cost and kernel estimation properties of the representative CIP and the data structure representation of a single-linkage-based grouping algorithm to generate an enhanced SOM-based visualization. The visual quality of the IT-vis is assessed by comparing it with other visualization methods for several real-world and synthetic benchmark data sets. Thus, this paper also contains a significant literature survey. The experiments demonstrate the IT-vis cluster revealing capabilities, in which cluster boundaries are sharply captured. Additionally, the information-theoretic visualizations are used to perform clustering of the SOM. Compared with other methods, IT-vis of large SOMs yielded the best results in this paper, for which the quality of the final partitions was evaluated using external validity indices.

  3. Development and selection of Asian-specific humeral implants based on statistical atlas: toward planning minimally invasive surgery.

    PubMed

    Wu, K; Daruwalla, Z J; Wong, K L; Murphy, D; Ren, H

    2015-08-01

    The commercial humeral implants based on the Western population are currently not entirely compatible with Asian patients, due to differences in bone size, shape and structure. Surgeons may have to compromise or use different implants that are less conforming, which may cause complications of as well as inconvenience to the implant position. The construction of Asian humerus atlases of different clusters has therefore been proposed to eradicate this problem and to facilitate planning minimally invasive surgical procedures [6,31]. According to the features of the atlases, new implants could be designed specifically for different patients. Furthermore, an automatic implant selection algorithm has been proposed as well in order to reduce the complications caused by implant and bone mismatch. Prior to the design of the implant, data clustering and extraction of the relevant features were carried out on the datasets of each gender. The fuzzy C-means clustering method is explored in this paper. Besides, two new schemes of implant selection procedures, namely the Procrustes analysis-based scheme and the group average distance-based scheme, were proposed to better search for the matching implants for new coming patients from the database. Both these two algorithms have not been used in this area, while they turn out to have excellent performance in implant selection. Additionally, algorithms to calculate the matching scores between various implants and the patient data are proposed in this paper to assist the implant selection procedure. The results obtained have indicated the feasibility of the proposed development and selection scheme. The 16 sets of male data were divided into two clusters with 8 and 8 subjects, respectively, and the 11 female datasets were also divided into two clusters with 5 and 6 subjects, respectively. Based on the features of each cluster, the implants designed by the proposed algorithm fit very well on their reference humeri and the proposed implant selection procedure allows for a scenario of treating a patient with merely a preoperative anatomical model in order to correctly select the implant that has the best fit. Based on the leave-one-out validation, it can be concluded that both the PA-based method and GAD-based method are able to achieve excellent performance when dealing with the problem of implant selection. The accuracy and average execution time for the PA-based method were 100 % and 0.132 s, respectively, while those of the GAD- based method were 100 % and 0.058 s. Therefore, the GAD-based method outperformed the PA-based method in terms of execution speed. The primary contributions of this paper include the proposal of methods for development of Asian-, gender- and cluster-specific implants based on shape features and selection of the best fit implants for future patients according to their features. To the best of our knowledge, this is the first work that proposes implant design and selection for Asian patients automatically based on features extracted from cluster-specific statistical atlases.

  4. The halo Boltzmann equation

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Biagetti, Matteo; Desjacques, Vincent; Kehagias, Alex

    2016-04-01

    Dark matter halos are the building blocks of the universe as they host galaxies and clusters. The knowledge of the clustering properties of halos is therefore essential for the understanding of the galaxy statistical properties. We derive an effective halo Boltzmann equation which can be used to describe the halo clustering statistics. In particular, we show how the halo Boltzmann equation encodes a statistically biased gravitational force which generates a bias in the peculiar velocities of virialized halos with respect to the underlying dark matter, as recently observed in N-body simulations.

  5. Verification of Eulerian-Eulerian and Eulerian-Lagrangian simulations for fluid-particle flows

    NASA Astrophysics Data System (ADS)

    Kong, Bo; Patel, Ravi G.; Capecelatro, Jesse; Desjardins, Olivier; Fox, Rodney O.

    2017-11-01

    In this work, we study the performance of three simulation techniques for fluid-particle flows: (1) a volume-filtered Euler-Lagrange approach (EL), (2) a quadrature-based moment method using the anisotropic Gaussian closure (AG), and (3) a traditional two-fluid model. By simulating two problems: particles in frozen homogeneous isotropic turbulence (HIT), and cluster-induced turbulence (CIT), the convergence of the methods under grid refinement is found to depend on the simulation method and the specific problem, with CIT simulations facing fewer difficulties than HIT. Although EL converges under refinement for both HIT and CIT, its statistical results exhibit dependence on the techniques used to extract statistics for the particle phase. For HIT, converging both EE methods (TFM and AG) poses challenges, while for CIT, AG and EL produce similar results. Overall, all three methods face challenges when trying to extract converged, parameter-independent statistics due to the presence of shocks in the particle phase. National Science Foundation and National Energy Technology Laboratory.

  6. The use of the temporal scan statistic to detect methicillin-resistant Staphylococcus aureus clusters in a community hospital.

    PubMed

    Faires, Meredith C; Pearl, David L; Ciccotelli, William A; Berke, Olaf; Reid-Smith, Richard J; Weese, J Scott

    2014-07-08

    In healthcare facilities, conventional surveillance techniques using rule-based guidelines may result in under- or over-reporting of methicillin-resistant Staphylococcus aureus (MRSA) outbreaks, as these guidelines are generally unvalidated. The objectives of this study were to investigate the utility of the temporal scan statistic for detecting MRSA clusters, validate clusters using molecular techniques and hospital records, and determine significant differences in the rate of MRSA cases using regression models. Patients admitted to a community hospital between August 2006 and February 2011, and identified with MRSA>48 hours following hospital admission, were included in this study. Between March 2010 and February 2011, MRSA specimens were obtained for spa typing. MRSA clusters were investigated using a retrospective temporal scan statistic. Tests were conducted on a monthly scale and significant clusters were compared to MRSA outbreaks identified by hospital personnel. Associations between the rate of MRSA cases and the variables year, month, and season were investigated using a negative binomial regression model. During the study period, 735 MRSA cases were identified and 167 MRSA isolates were spa typed. Nine different spa types were identified with spa type 2/t002 (88.6%) the most prevalent. The temporal scan statistic identified significant MRSA clusters at the hospital (n=2), service (n=16), and ward (n=10) levels (P ≤ 0.05). Seven clusters were concordant with nine MRSA outbreaks identified by hospital staff. For the remaining clusters, seven events may have been equivalent to true outbreaks and six clusters demonstrated possible transmission events. The regression analysis indicated years 2009-2011, compared to 2006, and months March and April, compared to January, were associated with an increase in the rate of MRSA cases (P ≤ 0.05). The application of the temporal scan statistic identified several MRSA clusters that were not detected by hospital personnel. The identification of specific years and months with increased MRSA rates may be attributable to several hospital level factors including the presence of other pathogens. Within hospitals, the incorporation of the temporal scan statistic to standard surveillance techniques is a valuable tool for healthcare workers to evaluate surveillance strategies and aid in the identification of MRSA clusters.

  7. Geographic Clusters of Basal Cell Carcinoma in a Northern California Health Plan Population.

    PubMed

    Ray, G Thomas; Kulldorff, Martin; Asgari, Maryam M

    2016-11-01

    Rates of skin cancer, including basal cell carcinoma (BCC), the most common cancer, have been increasing over the past 3 decades. A better understanding of geographic clustering of BCCs can help target screening and prevention efforts. Present a methodology to identify spatial clusters of BCC and identify such clusters in a northern California population. This retrospective study used a BCC registry to determine rates of BCC by census block group, and used spatial scan statistics to identify statistically significant geographic clusters of BCCs, adjusting for age, sex, and socioeconomic status. The study population consisted of white, non-Hispanic members of Kaiser Permanente Northern California during years 2011 and 2012. Statistically significant geographic clusters of BCC as determined by spatial scan statistics. Spatial analysis of 28 408 individuals who received a diagnosis of at least 1 BCC in 2011 or 2012 revealed distinct geographic areas with elevated BCC rates. Among the 14 counties studied, BCC incidence ranged from 661 to 1598 per 100 000 person-years. After adjustment for age, sex, and neighborhood socioeconomic status, a pattern of 5 discrete geographic clusters emerged, with a relative risk ranging from 1.12 (95% CI, 1.03-1.21; P = .006) for a cluster in eastern Sonoma and northern Napa Counties to 1.40 (95% CI, 1.15-1.71; P < .001) for a cluster in east Contra Costa and west San Joaquin Counties, compared with persons residing outside that cluster. In this study of a northern California population, we identified several geographic clusters with modestly elevated incidence of BCC. Knowledge of geographic clusters can help inform future research on the underlying etiology of the clustering including factors related to the environment, health care access, or other characteristics of the resident population, and can help target screening efforts to areas of highest yield.

  8. Spatial, temporal and spatio-temporal clusters of measles incidence at the county level in Guangxi, China during 2004-2014: flexibly shaped scan statistics.

    PubMed

    Tang, Xianyan; Geater, Alan; McNeil, Edward; Deng, Qiuyun; Dong, Aihu; Zhong, Ge

    2017-04-04

    Outbreaks of measles re-emerged in Guangxi province during 2013-2014, where measles again became a major public health concern. A better understanding of the patterns of measles cases would help in identifying high-risk areas and periods for optimizing preventive strategies, yet these patterns remain largely unknown. Thus, this study aimed to determine the patterns of measles clusters in space, time and space-time at the county level over the period 2004-2014 in Guangxi. Annual data on measles cases and population sizes for each county were obtained from Guangxi CDC and Guangxi Bureau of Statistics, respectively. Epidemic curves and Kulldorff's temporal scan statistics were used to identify seasonal peaks and high-risk periods. Tango's flexible scan statistics were implemented to determine irregular spatial clusters. Spatio-temporal clusters in elliptical cylinder shapes were detected by Kulldorff's scan statistics. Population attributable risk percent (PAR%) of children aged ≤24 months was used to identify regions with a heavy burden of measles. Seasonal peaks occurred between April and June, and a temporal measles cluster was detected in 2014. Spatial clusters were identified in West, Southwest and North Central Guangxi. Three phases of spatio-temporal clusters with high relative risk were detected: Central Guangxi during 2004-2005, Midwest Guangxi in 2007, and West and Southwest Guangxi during 2013-2014. Regions with high PAR% were mainly clustered in West, Southwest, North and Central Guangxi. A temporal uptrend of measles incidence existed in Guangxi between 2010 and 2014, while downtrend during 2004-2009. The hotspots shifted from Central to West and Southwest Guangxi, regions overburdened with measles. Thus, intensifying surveillance of timeliness and completeness of routine vaccination and implementing supplementary immunization activities for measles should prioritized in these regions.

  9. Statistical segmentation of multidimensional brain datasets

    NASA Astrophysics Data System (ADS)

    Desco, Manuel; Gispert, Juan D.; Reig, Santiago; Santos, Andres; Pascau, Javier; Malpica, Norberto; Garcia-Barreno, Pedro

    2001-07-01

    This paper presents an automatic segmentation procedure for MRI neuroimages that overcomes part of the problems involved in multidimensional clustering techniques like partial volume effects (PVE), processing speed and difficulty of incorporating a priori knowledge. The method is a three-stage procedure: 1) Exclusion of background and skull voxels using threshold-based region growing techniques with fully automated seed selection. 2) Expectation Maximization algorithms are used to estimate the probability density function (PDF) of the remaining pixels, which are assumed to be mixtures of gaussians. These pixels can then be classified into cerebrospinal fluid (CSF), white matter and grey matter. Using this procedure, our method takes advantage of using the full covariance matrix (instead of the diagonal) for the joint PDF estimation. On the other hand, logistic discrimination techniques are more robust against violation of multi-gaussian assumptions. 3) A priori knowledge is added using Markov Random Field techniques. The algorithm has been tested with a dataset of 30 brain MRI studies (co-registered T1 and T2 MRI). Our method was compared with clustering techniques and with template-based statistical segmentation, using manual segmentation as a gold-standard. Our results were more robust and closer to the gold-standard.

  10. Mapping the Indonesian territory, based on pollution, social demography and geographical data, using self organizing feature map

    NASA Astrophysics Data System (ADS)

    Hernawati, Kuswari; Insani, Nur; Bambang S. H., M.; Nur Hadi, W.; Sahid

    2017-08-01

    This research aims to mapping the 33 (thirty-three) provinces in Indonesia, based on the data on air, water and soil pollution, as well as social demography and geography data, into a clustered model. The method used in this study was unsupervised method that combines the basic concept of Kohonen or Self-Organizing Feature Maps (SOFM). The method is done by providing the design parameters for the model based on data related directly/ indirectly to pollution, which are the demographic and social data, pollution levels of air, water and soil, as well as the geographical situation of each province. The parameters used consists of 19 features/characteristics, including the human development index, the number of vehicles, the availability of the plant's water absorption and flood prevention, as well as geographic and demographic situation. The data used were secondary data from the Central Statistics Agency (BPS), Indonesia. The data are mapped into SOFM from a high-dimensional vector space into two-dimensional vector space according to the closeness of location in term of Euclidean distance. The resulting outputs are represented in clustered grouping. Thirty-three provinces are grouped into five clusters, where each cluster has different features/characteristics and level of pollution. The result can used to help the efforts on prevention and resolution of pollution problems on each cluster in an effective and efficient way.

  11. A nonparametric method to generate synthetic populations to adjust for complex sampling design features.

    PubMed

    Dong, Qi; Elliott, Michael R; Raghunathan, Trivellore E

    2014-06-01

    Outside of the survey sampling literature, samples are often assumed to be generated by a simple random sampling process that produces independent and identically distributed (IID) samples. Many statistical methods are developed largely in this IID world. Application of these methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Hence, much time and effort have been devoted to develop the statistical methods to analyze complex survey data and account for the sample design. This issue is particularly important when generating synthetic populations using finite population Bayesian inference, as is often done in missing data or disclosure risk settings, or when combining data from multiple surveys. By extending previous work in finite population Bayesian bootstrap literature, we propose a method to generate synthetic populations from a posterior predictive distribution in a fashion inverts the complex sampling design features and generates simple random samples from a superpopulation point of view, making adjustment on the complex data so that they can be analyzed as simple random samples. We consider a simulation study with a stratified, clustered unequal-probability of selection sample design, and use the proposed nonparametric method to generate synthetic populations for the 2006 National Health Interview Survey (NHIS), and the Medical Expenditure Panel Survey (MEPS), which are stratified, clustered unequal-probability of selection sample designs.

  12. A nonparametric method to generate synthetic populations to adjust for complex sampling design features

    PubMed Central

    Dong, Qi; Elliott, Michael R.; Raghunathan, Trivellore E.

    2017-01-01

    Outside of the survey sampling literature, samples are often assumed to be generated by a simple random sampling process that produces independent and identically distributed (IID) samples. Many statistical methods are developed largely in this IID world. Application of these methods to data from complex sample surveys without making allowance for the survey design features can lead to erroneous inferences. Hence, much time and effort have been devoted to develop the statistical methods to analyze complex survey data and account for the sample design. This issue is particularly important when generating synthetic populations using finite population Bayesian inference, as is often done in missing data or disclosure risk settings, or when combining data from multiple surveys. By extending previous work in finite population Bayesian bootstrap literature, we propose a method to generate synthetic populations from a posterior predictive distribution in a fashion inverts the complex sampling design features and generates simple random samples from a superpopulation point of view, making adjustment on the complex data so that they can be analyzed as simple random samples. We consider a simulation study with a stratified, clustered unequal-probability of selection sample design, and use the proposed nonparametric method to generate synthetic populations for the 2006 National Health Interview Survey (NHIS), and the Medical Expenditure Panel Survey (MEPS), which are stratified, clustered unequal-probability of selection sample designs. PMID:29200608

  13. Identification of atypical flight patterns

    NASA Technical Reports Server (NTRS)

    Statler, Irving C. (Inventor); Ferryman, Thomas A. (Inventor); Amidan, Brett G. (Inventor); Whitney, Paul D. (Inventor); White, Amanda M. (Inventor); Willse, Alan R. (Inventor); Cooley, Scott K. (Inventor); Jay, Joseph Griffith (Inventor); Lawrence, Robert E. (Inventor); Mosbrucker, Chris (Inventor)

    2005-01-01

    Method and system for analyzing aircraft data, including multiple selected flight parameters for a selected phase of a selected flight, and for determining when the selected phase of the selected flight is atypical, when compared with corresponding data for the same phase for other similar flights. A flight signature is computed using continuous-valued and discrete-valued flight parameters for the selected flight parameters and is optionally compared with a statistical distribution of other observed flight signatures, yielding atypicality scores for the same phase for other similar flights. A cluster analysis is optionally applied to the flight signatures to define an optimal collection of clusters. A level of atypicality for a selected flight is estimated, based upon an index associated with the cluster analysis.

  14. A metric to search for relevant words

    NASA Astrophysics Data System (ADS)

    Zhou, Hongding; Slater, Gary W.

    2003-11-01

    We propose a new metric to evaluate and rank the relevance of words in a text. The method uses the density fluctuations of a word to compute an index that measures its degree of clustering. Highly significant words tend to form clusters, while common words are essentially uniformly spread in a text. If a word is not rare, the metric is stable when we move any individual occurrence of this word in the text. Furthermore, we prove that the metric always increases when words are moved to form larger clusters, or when several independent documents are merged. Using the Holy Bible as an example, we show that our approach reduces the significance of common words when compared to a recently proposed statistical metric.

  15. Detecting space-time cancer clusters using residential histories

    NASA Astrophysics Data System (ADS)

    Jacquez, Geoffrey M.; Meliker, Jaymie R.

    2007-04-01

    Methods for analyzing geographic clusters of disease typically ignore the space-time variability inherent in epidemiologic datasets, do not adequately account for known risk factors (e.g., smoking and education) or covariates (e.g., age, gender, and race), and do not permit investigation of the latency window between exposure and disease. Our research group recently developed Q-statistics for evaluating space-time clustering in cancer case-control studies with residential histories. This technique relies on time-dependent nearest neighbor relationships to examine clustering at any moment in the life-course of the residential histories of cases relative to that of controls. In addition, in place of the widely used null hypothesis of spatial randomness, each individual's probability of being a case is instead based on his/her risk factors and covariates. Case-control clusters will be presented using residential histories of 220 bladder cancer cases and 440 controls in Michigan. In preliminary analyses of this dataset, smoking, age, gender, race and education were sufficient to explain the majority of the clustering of residential histories of the cases. Clusters of unexplained risk, however, were identified surrounding the business address histories of 10 industries that emit known or suspected bladder cancer carcinogens. The clustering of 5 of these industries began in the 1970's and persisted through the 1990's. This systematic approach for evaluating space-time clustering has the potential to generate novel hypotheses about environmental risk factors. These methods may be extended to detect differences in space-time patterns of any two groups of people, making them valuable for security intelligence and surveillance operations.

  16. Evaluation of sliding baseline methods for spatial estimation for cluster detection in the biosurveillance system

    PubMed Central

    Xing, Jian; Burkom, Howard; Moniz, Linda; Edgerton, James; Leuze, Michael; Tokars, Jerome

    2009-01-01

    Background The Centers for Disease Control and Prevention's (CDC's) BioSense system provides near-real time situational awareness for public health monitoring through analysis of electronic health data. Determination of anomalous spatial and temporal disease clusters is a crucial part of the daily disease monitoring task. Our study focused on finding useful anomalies at manageable alert rates according to available BioSense data history. Methods The study dataset included more than 3 years of daily counts of military outpatient clinic visits for respiratory and rash syndrome groupings. We applied four spatial estimation methods in implementations of space-time scan statistics cross-checked in Matlab and C. We compared the utility of these methods according to the resultant background cluster rate (a false alarm surrogate) and sensitivity to injected cluster signals. The comparison runs used a spatial resolution based on the facility zip code in the patient record and a finer resolution based on the residence zip code. Results Simple estimation methods that account for day-of-week (DOW) data patterns yielded a clear advantage both in background cluster rate and in signal sensitivity. A 28-day baseline gave the most robust results for this estimation; the preferred baseline is long enough to remove daily fluctuations but short enough to reflect recent disease trends and data representation. Background cluster rates were lower for the rash syndrome counts than for the respiratory counts, likely because of seasonality and the large scale of the respiratory counts. Conclusion The spatial estimation method should be chosen according to characteristics of the selected data streams. In this dataset with strong day-of-week effects, the overall best detection performance was achieved using subregion averages over a 28-day baseline stratified by weekday or weekend/holiday behavior. Changing the estimation method for particular scenarios involving different spatial resolution or other syndromes can yield further improvement. PMID:19615075

  17. Privacy Protection Versus Cluster Detection in Spatial Epidemiology

    PubMed Central

    Olson, Karen L.; Grannis, Shaun J.; Mandl, Kenneth D.

    2006-01-01

    Objectives. Patient data that includes precise locations can reveal patients’ identities, whereas data aggregated into administrative regions may preserve privacy and confidentiality. We investigated the effect of varying degrees of address precision (exact latitude and longitude vs the center points of zip code or census tracts) on detection of spatial clusters of cases. Methods. We simulated disease outbreaks by adding supplementary spatially clustered emergency department visits to authentic hospital emergency department syndromic surveillance data. We identified clusters with a spatial scan statistic and evaluated detection rate and accuracy. Results. More clusters were identified, and clusters were more accurately detected, when exact locations were used. That is, these clusters contained at least half of the simulated points and involved few additional emergency department visits. These results were especially apparent when the synthetic clustered points crossed administrative boundaries and fell into multiple zip code or census tracts. Conclusions. The spatial cluster detection algorithm performed better when addresses were analyzed as exact locations than when they were analyzed as center points of zip code or census tracts, particularly when the clustered points crossed administrative boundaries. Use of precise addresses offers improved performance, but this practice must be weighed against privacy concerns in the establishment of public health data exchange policies. PMID:17018828

  18. [Temporal-spatial analysis of bacillary dysentery in the Three Gorges Area of China, 2005-2016].

    PubMed

    Zhang, P; Zhang, J; Chang, Z R; Li, Z J

    2018-01-10

    Objective: To analyze the spatial and temporal distributions of bacillary dysentery in Chongqing, Yichang and Enshi (the Three Gorges Area) from 2005 to 2016, and provide evidence for the disease prevention and control. Methods: The incidence data of bacillary dysentery in the Three Gorges Area during this period were collected from National Notifiable Infectious Disease Reporting System. The spatial-temporal scan statistic was conducted with software SaTScan 9.4 and bacillary dysentery clusters were visualized with software ArcGIS 10.3. Results: A total of 126 196 cases were reported in the Three Gorges Area during 2005-2016, with an average incidence rate of 29.67/100 000. The overall incidence was in a downward trend, with an average annual decline rate of 4.74%. Cases occurred all the year round but with an obvious seasonal increase between May and October. Among the reported cases, 44.71% (56 421/126 196) were children under 5-year-old, the cases in children outside child care settings accounted for 41.93% (52 918/126 196) of the total. The incidence rates in districts of Yuzhong, Dadukou, Jiangbei, Shapingba, Jiulongpo, Nanan, Yubei, Chengkou of Chongqing and districts of Xiling and Wujiagang of Yichang city of Hubei province were high, ranging from 60.20/100 000 to 114.81/100 000. Spatial-temporal scan statistic for the spatial and temporal distributions of bacillary dysentery during this period revealed that the temporal distribution was during May-October, and there were 12 class Ⅰ clusters, 35 class Ⅱ clusters, and 9 clusters without statistical significance in counties with high incidence. All the class Ⅰ clusters were in urban area of Chongqing (Yuzhong, Dadukou, Jiangbei, Shapingba, Jiulongpo, Nanan, Beibei, Yubei, Banan) and surrounding counties, and the class Ⅱ clusters transformed from concentrated distribution to scattered distribution. Conclusions: Temporal and spatial cluster of bacillary dysentery incidence existed in the three gorges area during 2005-2016. It is necessary to strengthen the bacillary dysentery prevention and control in urban areas of Chongqing and Yichang.

  19. Connecting optical and X-ray tracers of galaxy cluster relaxation

    NASA Astrophysics Data System (ADS)

    Roberts, Ian D.; Parker, Laura C.; Hlavacek-Larrondo, Julie

    2018-04-01

    Substantial effort has been devoted in determining the ideal proxy for quantifying the morphology of the hot intracluster medium in clusters of galaxies. These proxies, based on X-ray emission, typically require expensive, high-quality X-ray observations making them difficult to apply to large surveys of groups and clusters. Here, we compare optical relaxation proxies with X-ray asymmetries and centroid shifts for a sample of Sloan Digital Sky Survey clusters with high-quality, archival X-ray data from Chandra and XMM-Newton. The three optical relaxation measures considered are the shape of the member-galaxy projected velocity distribution - measured by the Anderson-Darling (AD) statistic, the stellar mass gap between the most-massive and second-most-massive cluster galaxy, and the offset between the most-massive galaxy (MMG) position and the luminosity-weighted cluster centre. The AD statistic and stellar mass gap correlate significantly with X-ray relaxation proxies, with the AD statistic being the stronger correlator. Conversely, we find no evidence for a correlation between X-ray asymmetry or centroid shift and the MMG offset. High-mass clusters (Mhalo > 1014.5 M⊙) in this sample have X-ray asymmetries, centroid shifts, and Anderson-Darling statistics which are systematically larger than for low-mass systems. Finally, considering the dichotomy of Gaussian and non-Gaussian clusters (measured by the AD test), we show that the probability of being a non-Gaussian cluster correlates significantly with X-ray asymmetry but only shows a marginal correlation with centroid shift. These results confirm the shape of the radial velocity distribution as a useful proxy for cluster relaxation, which can then be applied to large redshift surveys lacking extensive X-ray coverage.

  20. Efficient ensemble forecasting of marine ecology with clustered 1D models and statistical lateral exchange: application to the Red Sea

    NASA Astrophysics Data System (ADS)

    Dreano, Denis; Tsiaras, Kostas; Triantafyllou, George; Hoteit, Ibrahim

    2017-07-01

    Forecasting the state of large marine ecosystems is important for many economic and public health applications. However, advanced three-dimensional (3D) ecosystem models, such as the European Regional Seas Ecosystem Model (ERSEM), are computationally expensive, especially when implemented within an ensemble data assimilation system requiring several parallel integrations. As an alternative to 3D ecological forecasting systems, we propose to implement a set of regional one-dimensional (1D) water-column ecological models that run at a fraction of the computational cost. The 1D model domains are determined using a Gaussian mixture model (GMM)-based clustering method and satellite chlorophyll-a (Chl-a) data. Regionally averaged Chl-a data is assimilated into the 1D models using the singular evolutive interpolated Kalman (SEIK) filter. To laterally exchange information between subregions and improve the forecasting skills, we introduce a new correction step to the assimilation scheme, in which we assimilate a statistical forecast of future Chl-a observations based on information from neighbouring regions. We apply this approach to the Red Sea and show that the assimilative 1D ecological models can forecast surface Chl-a concentration with high accuracy. The statistical assimilation step further improves the forecasting skill by as much as 50%. This general approach of clustering large marine areas and running several interacting 1D ecological models is very flexible. It allows many combinations of clustering, filtering and regression technics to be used and can be applied to build efficient forecasting systems in other large marine ecosystems.

  1. Constraining the mass–richness relationship of redMaPPer clusters with angular clustering

    DOE PAGES

    Baxter, Eric J.; Rozo, Eduardo; Jain, Bhuvnesh; ...

    2016-08-04

    The potential of using cluster clustering for calibrating the mass–richness relation of galaxy clusters has been recognized theoretically for over a decade. In this paper, we demonstrate the feasibility of this technique to achieve high-precision mass calibration using redMaPPer clusters in the Sloan Digital Sky Survey North Galactic Cap. By including cross-correlations between several richness bins in our analysis, we significantly improve the statistical precision of our mass constraints. The amplitude of the mass–richness relation is constrained to 7 per cent statistical precision by our analysis. However, the error budget is systematics dominated, reaching a 19 per cent total errormore » that is dominated by theoretical uncertainty in the bias–mass relation for dark matter haloes. We confirm the result from Miyatake et al. that the clustering amplitude of redMaPPer clusters depends on galaxy concentration as defined therein, and we provide additional evidence that this dependence cannot be sourced by mass dependences: some other effect must account for the observed variation in clustering amplitude with galaxy concentration. Assuming that the observed dependence of redMaPPer clustering on galaxy concentration is a form of assembly bias, we find that such effects introduce a systematic error on the amplitude of the mass–richness relation that is comparable to the error bar from statistical noise. Finally, the results presented here demonstrate the power of cluster clustering for mass calibration and cosmology provided the current theoretical systematics can be ameliorated.« less

  2. ICAP - An Interactive Cluster Analysis Procedure for analyzing remotely sensed data

    NASA Technical Reports Server (NTRS)

    Wharton, S. W.; Turner, B. J.

    1981-01-01

    An Interactive Cluster Analysis Procedure (ICAP) was developed to derive classifier training statistics from remotely sensed data. ICAP differs from conventional clustering algorithms by allowing the analyst to optimize the cluster configuration by inspection, rather than by manipulating process parameters. Control of the clustering process alternates between the algorithm, which creates new centroids and forms clusters, and the analyst, who can evaluate and elect to modify the cluster structure. Clusters can be deleted, or lumped together pairwise, or new centroids can be added. A summary of the cluster statistics can be requested to facilitate cluster manipulation. The principal advantage of this approach is that it allows prior information (when available) to be used directly in the analysis, since the analyst interacts with ICAP in a straightforward manner, using basic terms with which he is more likely to be familiar. Results from testing ICAP showed that an informed use of ICAP can improve classification, as compared to an existing cluster analysis procedure.

  3. An integrated workflow for robust alignment and simplified quantitative analysis of NMR spectrometry data.

    PubMed

    Vu, Trung N; Valkenborg, Dirk; Smets, Koen; Verwaest, Kim A; Dommisse, Roger; Lemière, Filip; Verschoren, Alain; Goethals, Bart; Laukens, Kris

    2011-10-20

    Nuclear magnetic resonance spectroscopy (NMR) is a powerful technique to reveal and compare quantitative metabolic profiles of biological tissues. However, chemical and physical sample variations make the analysis of the data challenging, and typically require the application of a number of preprocessing steps prior to data interpretation. For example, noise reduction, normalization, baseline correction, peak picking, spectrum alignment and statistical analysis are indispensable components in any NMR analysis pipeline. We introduce a novel suite of informatics tools for the quantitative analysis of NMR metabolomic profile data. The core of the processing cascade is a novel peak alignment algorithm, called hierarchical Cluster-based Peak Alignment (CluPA). The algorithm aligns a target spectrum to the reference spectrum in a top-down fashion by building a hierarchical cluster tree from peak lists of reference and target spectra and then dividing the spectra into smaller segments based on the most distant clusters of the tree. To reduce the computational time to estimate the spectral misalignment, the method makes use of Fast Fourier Transformation (FFT) cross-correlation. Since the method returns a high-quality alignment, we can propose a simple methodology to study the variability of the NMR spectra. For each aligned NMR data point the ratio of the between-group and within-group sum of squares (BW-ratio) is calculated to quantify the difference in variability between and within predefined groups of NMR spectra. This differential analysis is related to the calculation of the F-statistic or a one-way ANOVA, but without distributional assumptions. Statistical inference based on the BW-ratio is achieved by bootstrapping the null distribution from the experimental data. The workflow performance was evaluated using a previously published dataset. Correlation maps, spectral and grey scale plots show clear improvements in comparison to other methods, and the down-to-earth quantitative analysis works well for the CluPA-aligned spectra. The whole workflow is embedded into a modular and statistically sound framework that is implemented as an R package called "speaq" ("spectrum alignment and quantitation"), which is freely available from http://code.google.com/p/speaq/.

  4. Effect of a new motorway on social-spatial patterning of road traffic accidents: A retrospective longitudinal natural experimental study

    PubMed Central

    Mitchell, Richard; Ogilvie, David

    2017-01-01

    Background The World Health Organisation reports that road traffic accidents (accidents) could become the seventh leading cause of death globally by 2030. Accidents often occur in spatial clusters and, generally, there are more accidents in less advantaged areas. Infrastructure changes, such as new roads, can affect the locations and magnitude of accident clusters but evidence of impact is lacking. A new 5-mile motorway extension was opened in 2011 in Glasgow, Scotland. Previous research found no impact on the number of accidents but did not consider their spatial location or socio-economic setting. We evaluated impacts on these, both locally and city-wide. Methods We used STATS19 data covering the period 2008 to 2014 and describing the location and details of all reported accidents involving a personal injury. Poisson-based continuous scan statistics were used to detect spatial clusters of accidents and any change in these over time. Change in the socio-economic distribution of accident cluster locations during the study period was also assessed. Results In each year accidents were strongly clustered, with statistically significant clusters more likely to occur in socio-economically deprived areas. There was no significant shift in the magnitude or location of accident clusters during motorway construction or following opening, either locally or city-wide. There was also no impact on the socio-economic patterning of accident cluster locations. Conclusions Although urban infrastructure changes occur constantly, all around the world, this is the first study to evaluate the impact of such changes on road accident clusters. Despite expectations to the contrary from both proponents and opponents of the M74 extension, we found no beneficial or adverse change in the socio-spatial distribution of accidents associated with its construction, opening or operation. Our approach and findings can help inform urban planning internationally. PMID:28880956

  5. Spatial clustering and local risk of leprosy in São Paulo, Brazil.

    PubMed

    Ramos, Antônio Carlos Vieira; Yamamura, Mellina; Arroyo, Luiz Henrique; Popolin, Marcela Paschoal; Chiaravalloti Neto, Francisco; Palha, Pedro Fredemir; Uchoa, Severina Alice da Costa; Pieri, Flávia Meneguetti; Pinto, Ione Carvalho; Fiorati, Regina Célia; Queiroz, Ana Angélica Rêgo de; Belchior, Aylana de Souza; Dos Santos, Danielle Talita; Garcia, Maria Concebida da Cunha; Crispim, Juliane de Almeida; Alves, Luana Seles; Berra, Thaís Zamboni; Arcêncio, Ricardo Alexandre

    2017-02-01

    Although the detection rate is decreasing, the proportion of new cases with WHO grade 2 disability (G2D) is increasing, creating concern among policy makers and the Brazilian government. This study aimed to identify spatial clustering of leprosy and classify high-risk areas in a major leprosy cluster using the SatScan method. Data were obtained including all leprosy cases diagnosed between January 2006 and December 2013. In addition to the clinical variable, information was also gathered regarding the G2D of the patient at diagnosis and after treatment. The Scan Spatial statistic test, developed by Kulldorff e Nagarwalla, was used to identify spatial clustering and to measure the local risk (Relative Risk-RR) of leprosy. Maps considering these risks and their confidence intervals were constructed. A total of 434 cases were identified, including 188 (43.31%) borderline leprosy and 101 (23.28%) lepromatous leprosy cases. There was a predominance of males, with ages ranging from 15 to 59 years, and 51 patients (11.75%) presented G2D. Two significant spatial clusters and three significant spatial-temporal clusters were also observed. The main spatial cluster (p = 0.000) contained 90 census tracts, a population of approximately 58,438 inhabitants, detection rate of 22.6 cases per 100,000 people and RR of approximately 3.41 (95%CI = 2.721-4.267). Regarding the spatial-temporal clusters, two clusters were observed, with RR ranging between 24.35 (95%CI = 11.133-52.984) and 15.24 (95%CI = 10.114-22.919). These findings could contribute to improvements in policies and programming, aiming for the eradication of leprosy in Brazil. The Spatial Scan statistic test was found to be an interesting resource for health managers and healthcare professionals to map the vulnerability of areas in terms of leprosy transmission risk and areas of underreporting.

  6. Spatiotemporal Analysis of the Ebola Hemorrhagic Fever in West Africa in 2014

    NASA Astrophysics Data System (ADS)

    Xu, M.; Cao, C. X.; Guo, H. F.

    2017-09-01

    Ebola hemorrhagic fever (EHF) is an acute hemorrhagic diseases caused by the Ebola virus, which is highly contagious. This paper aimed to explore the possible gathering area of EHF cases in West Africa in 2014, and identify endemic areas and their tendency by means of time-space analysis. We mapped distribution of EHF incidences and explored statistically significant space, time and space-time disease clusters. We utilized hotspot analysis to find the spatial clustering pattern on the basis of the actual outbreak cases. spatial-temporal cluster analysis is used to analyze the spatial or temporal distribution of agglomeration disease, examine whether its distribution is statistically significant. Local clusters were investigated using Kulldorff's scan statistic approach. The result reveals that the epidemic mainly gathered in the western part of Africa near north Atlantic with obvious regional distribution. For the current epidemic, we have found areas in high incidence of EVD by means of spatial cluster analysis.

  7. Hot spot detection and spatio-temporal dispersion of dengue fever in Hanoi, Vietnam

    PubMed Central

    Toan, Do Thi Thanh; Hu, Wenbiao; Thai, Pham Quang; Hoat, Luu Ngoc; Wright, Pamela; Martens, Pim

    2013-01-01

    Introduction Dengue fever (DF) in Vietnam remains a serious emerging arboviral disease, which generates significant concerns among international health authorities. Incidence rates of DF have increased significantly during the last few years in many provinces and cities, especially Hanoi. The purpose of this study was to detect DF hot spots and identify the disease dynamics dispersion of DF over the period between 2004 and 2009 in Hanoi, Vietnam. Methods Daily data on DF cases and population data for each postcode area of Hanoi between January 1998 and December 2009 were obtained from the Hanoi Center for Preventive Health and the General Statistic Office of Vietnam. Moran's I statistic was used to assess the spatial autocorrelation of reported DF. Spatial scan statistics and logistic regression were used to identify space–time clusters and dispersion of DF. Results The study revealed a clear trend of geographic expansion of DF transmission in Hanoi through the study periods (OR 1.17, 95% CI 1.02–1.34). The spatial scan statistics showed that 6/14 (42.9%) districts in Hanoi had significant cluster patterns, which lasted 29 days and were limited to a radius of 1,000 m. The study also demonstrated that most DF cases occurred between June and November, during which the rainfall and temperatures are highest. Conclusions There is evidence for the existence of statistically significant clusters of DF in Hanoi, and that the geographical distribution of DF has expanded over recent years. This finding provides a foundation for further investigation into the social and environmental factors responsible for changing disease patterns, and provides data to inform program planning for DF control. PMID:23364076

  8. Multivariate analysis: A statistical approach for computations

    NASA Astrophysics Data System (ADS)

    Michu, Sachin; Kaushik, Vandana

    2014-10-01

    Multivariate analysis is a type of multivariate statistical approach commonly used in, automotive diagnosis, education evaluating clusters in finance etc and more recently in the health-related professions. The objective of the paper is to provide a detailed exploratory discussion about factor analysis (FA) in image retrieval method and correlation analysis (CA) of network traffic. Image retrieval methods aim to retrieve relevant images from a collected database, based on their content. The problem is made more difficult due to the high dimension of the variable space in which the images are represented. Multivariate correlation analysis proposes an anomaly detection and analysis method based on the correlation coefficient matrix. Anomaly behaviors in the network include the various attacks on the network like DDOs attacks and network scanning.

  9. Daily Reportable Disease Spatiotemporal Cluster Detection, New York City, New York, USA, 2014-2015.

    PubMed

    Greene, Sharon K; Peterson, Eric R; Kapell, Deborah; Fine, Annie D; Kulldorff, Martin

    2016-10-01

    Each day, the New York City Department of Health and Mental Hygiene uses the free SaTScan software to apply prospective space-time permutation scan statistics to strengthen early outbreak detection for 35 reportable diseases. This method prompted early detection of outbreaks of community-acquired legionellosis and shigellosis.

  10. Automatic Coding of Short Text Responses via Clustering in Educational Assessment

    ERIC Educational Resources Information Center

    Zehner, Fabian; Sälzer, Christine; Goldhammer, Frank

    2016-01-01

    Automatic coding of short text responses opens new doors in assessment. We implemented and integrated baseline methods of natural language processing and statistical modelling by means of software components that are available under open licenses. The accuracy of automatic text coding is demonstrated by using data collected in the "Programme…

  11. The MUSE-Wide survey: detection of a clustering signal from Lyman α emitters in the range 3 < z < 6

    NASA Astrophysics Data System (ADS)

    Diener, C.; Wisotzki, L.; Schmidt, K. B.; Herenz, E. C.; Urrutia, T.; Garel, T.; Kerutt, J.; Saust, R. L.; Bacon, R.; Cantalupo, S.; Contini, T.; Guiderdoni, B.; Marino, R. A.; Richard, J.; Schaye, J.; Soucail, G.; Weilbacher, P. M.

    2017-11-01

    We present a clustering analysis of a sample of 238 Ly α emitters at redshift 3 ≲ z ≲ 6 from the MUSE-Wide survey. This survey mosaics extragalactic legacy fields with 1h MUSE pointings to detect statistically relevant samples of emission line galaxies. We analysed the first year observations from MUSE-Wide making use of the clustering signal in the line-of-sight direction. This method relies on comparing pair-counts at close redshifts for a fixed transverse distance and thus exploits the full potential of the redshift range covered by our sample. A clear clustering signal with a correlation length of r0=2.9^{+1.0}_{-1.1} Mpc (comoving) is detected. Whilst this result is based on only about a quarter of the full survey size, it already shows the immense potential of MUSE for efficiently observing and studying the clustering of Ly α emitters.

  12. Mobility of large clusters on a semiconductor surface: Kinetic Monte Carlo simulation results

    NASA Astrophysics Data System (ADS)

    M, Esen; A, T. Tüzemen; M, Ozdemir

    2016-01-01

    The mobility of clusters on a semiconductor surface for various values of cluster size is studied as a function of temperature by kinetic Monte Carlo method. The cluster resides on the surface of a square grid. Kinetic processes such as the diffusion of single particles on the surface, their attachment and detachment to/from clusters, diffusion of particles along cluster edges are considered. The clusters considered in this study consist of 150-6000 atoms per cluster on average. A statistical probability of motion to each direction is assigned to each particle where a particle with four nearest neighbors is assumed to be immobile. The mobility of a cluster is found from the root mean square displacement of the center of mass of the cluster as a function of time. It is found that the diffusion coefficient of clusters goes as D = A(T)Nα where N is the average number of particles in the cluster, A(T) is a temperature-dependent constant and α is a parameter with a value of about -0.64 < α < -0.75. The value of α is found to be independent of cluster sizes and temperature values (170-220 K) considered in this study. As the diffusion along the perimeter of the cluster becomes prohibitive, the exponent approaches a value of -0.5. The diffusion coefficient is found to change by one order of magnitude as a function of cluster size.

  13. Molecular heterogeneity at the network level: high-dimensional testing, clustering and a TCGA case study.

    PubMed

    Städler, Nicolas; Dondelinger, Frank; Hill, Steven M; Akbani, Rehan; Lu, Yiling; Mills, Gordon B; Mukherjee, Sach

    2017-09-15

    Molecular pathways and networks play a key role in basic and disease biology. An emerging notion is that networks encoding patterns of molecular interplay may themselves differ between contexts, such as cell type, tissue or disease (sub)type. However, while statistical testing of differences in mean expression levels has been extensively studied, testing of network differences remains challenging. Furthermore, since network differences could provide important and biologically interpretable information to identify molecular subgroups, there is a need to consider the unsupervised task of learning subgroups and networks that define them. This is a nontrivial clustering problem, with neither subgroups nor subgroup-specific networks known at the outset. We leverage recent ideas from high-dimensional statistics for testing and clustering in the network biology setting. The methods we describe can be applied directly to most continuous molecular measurements and networks do not need to be specified beforehand. We illustrate the ideas and methods in a case study using protein data from The Cancer Genome Atlas (TCGA). This provides evidence that patterns of interplay between signalling proteins differ significantly between cancer types. Furthermore, we show how the proposed approaches can be used to learn subtypes and the molecular networks that define them. As the Bioconductor package nethet. staedler.n@gmail.com or sach.mukherjee@dzne.de. Supplementary data are available at Bioinformatics online. © The Author(s) 2017. Published by Oxford University Press.

  14. Chemical indices and methods of multivariate statistics as a tool for odor classification.

    PubMed

    Mahlke, Ingo T; Thiesen, Peter H; Niemeyer, Bernd

    2007-04-01

    Industrial and agricultural off-gas streams are comprised of numerous volatile compounds, many of which have substantially different odorous properties. State-of-the-art waste-gas treatment includes the characterization of these molecules and is directed at, if possible, either the avoidance of such odorants during processing or the use of existing standardized air purification techniques like bioscrubbing or afterburning, which however, often show low efficiency under ecological and economical regards. Selective odor separation from the off-gas streams could ease many of these disadvantages but is not yet widely applicable. Thus, the aim of this paper is to identify possible model substances in selective odor separation research from 155 volatile molecules mainly originating from livestock facilities, fat refineries, and cocoa and coffee production by knowledge-based methods. All compounds are examined with regard to their structure and information-content using topological and information-theoretical indices. Resulting data are fitted in an observation matrix, and similarities between the substances are computed. Principal component analysis and k-means cluster analysis are conducted showing that clustering of indices data can depict odor information correlating well to molecular composition and molecular shape. Quantitative molecule describtion along with the application of such statistical means therefore provide a good classification tool of malodorant structure properties with no thermodynamic data needed. The approximate look-alike shape of odorous compounds within the clusters suggests a fair choice of possible model molecules.

  15. A curvature-based weighted fuzzy c-means algorithm for point clouds de-noising

    NASA Astrophysics Data System (ADS)

    Cui, Xin; Li, Shipeng; Yan, Xiutian; He, Xinhua

    2018-04-01

    In order to remove the noise of three-dimensional scattered point cloud and smooth the data without damnify the sharp geometric feature simultaneity, a novel algorithm is proposed in this paper. The feature-preserving weight is added to fuzzy c-means algorithm which invented a curvature weighted fuzzy c-means clustering algorithm. Firstly, the large-scale outliers are removed by the statistics of r radius neighboring points. Then, the algorithm estimates the curvature of the point cloud data by using conicoid parabolic fitting method and calculates the curvature feature value. Finally, the proposed clustering algorithm is adapted to calculate the weighted cluster centers. The cluster centers are regarded as the new points. The experimental results show that this approach is efficient to different scale and intensities of noise in point cloud with a high precision, and perform a feature-preserving nature at the same time. Also it is robust enough to different noise model.

  16. Removal of impulse noise clusters from color images with local order statistics

    NASA Astrophysics Data System (ADS)

    Ruchay, Alexey; Kober, Vitaly

    2017-09-01

    This paper proposes a novel algorithm for restoring images corrupted with clusters of impulse noise. The noise clusters often occur when the probability of impulse noise is very high. The proposed noise removal algorithm consists of detection of bulky impulse noise in three color channels with local order statistics followed by removal of the detected clusters by means of vector median filtering. With the help of computer simulation we show that the proposed algorithm is able to effectively remove clustered impulse noise. The performance of the proposed algorithm is compared in terms of image restoration metrics with that of common successful algorithms.

  17. The VMC survey. XXVIII. Improved measurements of the proper motion of the Galactic globular cluster 47 Tucanae

    NASA Astrophysics Data System (ADS)

    Niederhofer, Florian; Cioni, Maria-Rosa L.; Rubele, Stefano; Schmidt, Thomas; Bekki, Kenji; de Grijs, Richard; Emerson, Jim; Ivanov, Valentin D.; Oliveira, Joana M.; Petr-Gotzens, Monika G.; Ripepi, Vincenzo; Sun, Ning-Chen; van Loon, Jacco Th.

    2018-05-01

    We use deep multi-epoch point-spread function (PSF) photometry taken with the Visible and Infrared Survey Telescope for Astronomy (VISTA) to measure and analyze the proper motions of stars within the Galactic globular cluster 47 Tucanae (47 Tuc, NGC 104). The observations are part of the ongoing near-infrared VISTA survey of the Magellanic Cloud system (VMC). The data analyzed in this study correspond to one VMC tile, which covers a total sky area of 1.77 deg2. Absolute proper motions with respect to 9070 background galaxies are calculated from a linear regression model applied to the positions of stars in 11 epochs in the Ks filter. The data extend over a total time baseline of about 17 months. We found an overall median proper motion of the stars within 47 Tuc of (μαcos(δ), μδ) = (+5.89 ± 0.02 (statistical) ± 0.13 (systematic), -2.14 ± 0.02 (statistical) ± 0.08 (systematic)) mas yr-1, based on the measurements of 35 000 individual sources between 5' and 42' from the cluster center. We compared our result to the proper motions from the newest US Naval Observatory CCD Astrograph Catalog (UCAC5), which includes data from the Gaia data release 1. Selecting cluster members ( 2700 stars), we found a median proper motion of (μαcos(δ), μδ) = (+5.30 ± 0.03 (statistical) ± 0.70 (systematic), -2.70 ± 0.03 (statistical) ± 0.70 (systematic)) mas yr-1. Comparing the results with measurements in the literature, we found that the values derived from the VMC data are consistent with the UCAC5 result, and are close to measurements obtained using the Hubble Space Telescope. We combined our proper motion results with radial velocity measurements from the literature and reconstructed the orbit of 47 Tuc, finding that the cluster is on an orbit with a low ellipticity and is confined within the inner 7.5 kpc of the Galaxy. We show that the use of an increased time baseline in combination with PSF-determined stellar centroids in crowded regions significantly improves the accuracy of the method. In future works, we will apply the methods described here to more VMC tiles to study in detail the kinematics of the Magellanic Clouds. Based on observations made with VISTA at the Paranal Observatory under program ID 179.B-2003.

  18. Cluster analysis of European Y-chromosomal STR haplotypes using the discrete Laplace method.

    PubMed

    Andersen, Mikkel Meyer; Eriksen, Poul Svante; Morling, Niels

    2014-07-01

    The European Y-chromosomal short tandem repeat (STR) haplotype distribution has previously been analysed in various ways. Here, we introduce a new way of analysing population substructure using a new method based on clustering within the discrete Laplace exponential family that models the probability distribution of the Y-STR haplotypes. Creating a consistent statistical model of the haplotypes enables us to perform a wide range of analyses. Previously, haplotype frequency estimation using the discrete Laplace method has been validated. In this paper we investigate how the discrete Laplace method can be used for cluster analysis to further validate the discrete Laplace method. A very important practical fact is that the calculations can be performed on a normal computer. We identified two sub-clusters of the Eastern and Western European Y-STR haplotypes similar to results of previous studies. We also compared pairwise distances (between geographically separated samples) with those obtained using the AMOVA method and found good agreement. Further analyses that are impossible with AMOVA were made using the discrete Laplace method: analysis of the homogeneity in two different ways and calculating marginal STR distributions. We found that the Y-STR haplotypes from e.g. Finland were relatively homogeneous as opposed to the relatively heterogeneous Y-STR haplotypes from e.g. Lublin, Eastern Poland and Berlin, Germany. We demonstrated that the observed distributions of alleles at each locus were similar to the expected ones. We also compared pairwise distances between geographically separated samples from Africa with those obtained using the AMOVA method and found good agreement. Copyright © 2014 Elsevier Ireland Ltd. All rights reserved.

  19. Linnorm: improved statistical analysis for single cell RNA-seq expression data.

    PubMed

    Yip, Shun H; Wang, Panwen; Kocher, Jean-Pierre A; Sham, Pak Chung; Wang, Junwen

    2017-12-15

    Linnorm is a novel normalization and transformation method for the analysis of single cell RNA sequencing (scRNA-seq) data. Linnorm is developed to remove technical noises and simultaneously preserve biological variations in scRNA-seq data, such that existing statistical methods can be improved. Using real scRNA-seq data, we compared Linnorm with existing normalization methods, including NODES, SAMstrt, SCnorm, scran, DESeq and TMM. Linnorm shows advantages in speed, technical noise removal and preservation of cell heterogeneity, which can improve existing methods in the discovery of novel subtypes, pseudo-temporal ordering of cells, clustering analysis, etc. Linnorm also performs better than existing DEG analysis methods, including BASiCS, NODES, SAMstrt, Seurat and DESeq2, in false positive rate control and accuracy. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.

  20. GazeAppraise v. 0.1

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Wilson, Andrew; Haass, Michael; Rintoul, Mark Daniel

    GazeAppraise advances the state of the art of gaze pattern analysis using methods that simultaneously analyze spatial and temporal characteristics of gaze patterns. GazeAppraise enables novel research in visual perception and cognition; for example, using shape features as distinguishing elements to assess individual differences in visual search strategy. Given a set of point-to-point gaze sequences, hereafter referred to as scanpaths, the method constructs multiple descriptive features for each scanpath. Once the scanpath features have been calculated, they are used to form a multidimensional vector representing each scanpath and cluster analysis is performed on the set of vectors from all scanpaths.more » An additional benefit of this method is the identification of causal or correlated characteristics of the stimuli, subjects, and visual task through statistical analysis of descriptive metadata distributions within and across clusters.« less

  1. Multi-Parent Clustering Algorithms from Stochastic Grammar Data Models

    NASA Technical Reports Server (NTRS)

    Mjoisness, Eric; Castano, Rebecca; Gray, Alexander

    1999-01-01

    We introduce a statistical data model and an associated optimization-based clustering algorithm which allows data vectors to belong to zero, one or several "parent" clusters. For each data vector the algorithm makes a discrete decision among these alternatives. Thus, a recursive version of this algorithm would place data clusters in a Directed Acyclic Graph rather than a tree. We test the algorithm with synthetic data generated according to the statistical data model. We also illustrate the algorithm using real data from large-scale gene expression assays.

  2. Pattern Activity Clustering and Evaluation (PACE)

    NASA Astrophysics Data System (ADS)

    Blasch, Erik; Banas, Christopher; Paul, Michael; Bussjager, Becky; Seetharaman, Guna

    2012-06-01

    With the vast amount of network information available on activities of people (i.e. motions, transportation routes, and site visits) there is a need to explore the salient properties of data that detect and discriminate the behavior of individuals. Recent machine learning approaches include methods of data mining, statistical analysis, clustering, and estimation that support activity-based intelligence. We seek to explore contemporary methods in activity analysis using machine learning techniques that discover and characterize behaviors that enable grouping, anomaly detection, and adversarial intent prediction. To evaluate these methods, we describe the mathematics and potential information theory metrics to characterize behavior. A scenario is presented to demonstrate the concept and metrics that could be useful for layered sensing behavior pattern learning and analysis. We leverage work on group tracking, learning and clustering approaches; as well as utilize information theoretical metrics for classification, behavioral and event pattern recognition, and activity and entity analysis. The performance evaluation of activity analysis supports high-level information fusion of user alerts, data queries and sensor management for data extraction, relations discovery, and situation analysis of existing data.

  3. Verification of Eulerian-Eulerian and Eulerian-Lagrangian simulations for turbulent fluid-particle flows

    DOE PAGES

    Patel, Ravi G.; Desjardins, Olivier; Kong, Bo; ...

    2017-09-01

    Here, we present a verification study of three simulation techniques for fluid–particle flows, including an Euler–Lagrange approach (EL) inspired by Jackson's seminal work on fluidized particles, a quadrature–based moment method based on the anisotropic Gaussian closure (AG), and the traditional two-fluid model. We perform simulations of two problems: particles in frozen homogeneous isotropic turbulence (HIT) and cluster-induced turbulence (CIT). For verification, we evaluate various techniques for extracting statistics from EL and study the convergence properties of the three methods under grid refinement. The convergence is found to depend on the simulation method and on the problem, with CIT simulations posingmore » fewer difficulties than HIT. Specifically, EL converges under refinement for both HIT and CIT, but statistics exhibit dependence on the postprocessing parameters. For CIT, AG produces similar results to EL. For HIT, converging both TFM and AG poses challenges. Overall, extracting converged, parameter-independent Eulerian statistics remains a challenge for all methods.« less

  4. Verification of Eulerian-Eulerian and Eulerian-Lagrangian simulations for turbulent fluid-particle flows

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Patel, Ravi G.; Desjardins, Olivier; Kong, Bo

    Here, we present a verification study of three simulation techniques for fluid–particle flows, including an Euler–Lagrange approach (EL) inspired by Jackson's seminal work on fluidized particles, a quadrature–based moment method based on the anisotropic Gaussian closure (AG), and the traditional two-fluid model. We perform simulations of two problems: particles in frozen homogeneous isotropic turbulence (HIT) and cluster-induced turbulence (CIT). For verification, we evaluate various techniques for extracting statistics from EL and study the convergence properties of the three methods under grid refinement. The convergence is found to depend on the simulation method and on the problem, with CIT simulations posingmore » fewer difficulties than HIT. Specifically, EL converges under refinement for both HIT and CIT, but statistics exhibit dependence on the postprocessing parameters. For CIT, AG produces similar results to EL. For HIT, converging both TFM and AG poses challenges. Overall, extracting converged, parameter-independent Eulerian statistics remains a challenge for all methods.« less

  5. Coagulation-fragmentation for a finite number of particles and application to telomere clustering in the yeast nucleus

    NASA Astrophysics Data System (ADS)

    Hozé, Nathanaël; Holcman, David

    2012-01-01

    We develop a coagulation-fragmentation model to study a system composed of a small number of stochastic objects moving in a confined domain, that can aggregate upon binding to form local clusters of arbitrary sizes. A cluster can also dissociate into two subclusters with a uniform probability. To study the statistics of clusters, we combine a Markov chain analysis with a partition number approach. Interestingly, we obtain explicit formulas for the size and the number of clusters in terms of hypergeometric functions. Finally, we apply our analysis to study the statistical physics of telomeres (ends of chromosomes) clustering in the yeast nucleus and show that the diffusion-coagulation-fragmentation process can predict the organization of telomeres.

  6. Statistical analysis of short-term water stress conditions at Riggs Creek OzFlux tower site

    NASA Astrophysics Data System (ADS)

    Azmi, Mohammad; Rüdiger, Christoph; Walker, Jeffrey P.

    2017-10-01

    A large range of indices and proxies are available to describe the water stress conditions of an area subject to different applications, which have varying capabilities and limitations depending on the prevailing local climatic conditions and land cover. The present study uses a range of spatio-temporally high-resolution (daily and within daily) data sources to evaluate a number of drought indices (DIs) for the Riggs Creek OzFlux tower site in southeastern Australia. Therefore, the main aim of this study is to evaluate the statistical characteristics of individual DIs subject to short-term water stress conditions. In order to derive a more general and therefore representative DI, a new criterion is required to specify the statistical similarity between each pair of indices to allow determining the dominant drought types along with their representative DIs. The results show that the monitoring of water stress at this case study area can be achieved by evaluating the individual behaviour of three clusters of (i) vegetation conditions, (ii) water availability and (iii) water consumptions. This indicates that it is not necessary to assess all individual DIs one by one to derive a comprehensive and informative data set about the water stress of an area; instead, this can be achieved by analysing one of the DIs from each cluster or deriving a new combinatory index for each cluster, based on established combination methods.

  7. Detection of Single Standing Dead Trees from Aerial Color Infrared Imagery by Segmentation with Shape and Intensity Priors

    NASA Astrophysics Data System (ADS)

    Polewski, P.; Yao, W.; Heurich, M.; Krzystek, P.; Stilla, U.

    2015-03-01

    Standing dead trees, known as snags, are an essential factor in maintaining biodiversity in forest ecosystems. Combined with their role as carbon sinks, this makes for a compelling reason to study their spatial distribution. This paper presents an integrated method to detect and delineate individual dead tree crowns from color infrared aerial imagery. Our approach consists of two steps which incorporate statistical information about prior distributions of both the image intensities and the shapes of the target objects. In the first step, we perform a Gaussian Mixture Model clustering in the pixel color space with priors on the cluster means, obtaining up to 3 components corresponding to dead trees, living trees, and shadows. We then refine the dead tree regions using a level set segmentation method enriched with a generative model of the dead trees' shape distribution as well as a discriminative model of their pixel intensity distribution. The iterative application of the statistical shape template yields the set of delineated dead crowns. The prior information enforces the consistency of the template's shape variation with the shape manifold defined by manually labeled training examples, which makes it possible to separate crowns located in close proximity and prevents the formation of large crown clusters. Also, the statistical information built into the segmentation gives rise to an implicit detection scheme, because the shape template evolves towards an empty contour if not enough evidence for the object is present in the image. We test our method on 3 sample plots from the Bavarian Forest National Park with reference data obtained by manually marking individual dead tree polygons in the images. Our results are scenario-dependent and range from a correctness/completeness of 0.71/0.81 up to 0.77/1, with an average center-of-gravity displacement of 3-5 pixels between the detected and reference polygons.

  8. Permutation testing of orthogonal factorial effects in a language-processing experiment using fMRI.

    PubMed

    Suckling, John; Davis, Matthew H; Ooi, Cinly; Wink, Alle Meije; Fadili, Jalal; Salvador, Raymond; Welchew, David; Sendur, Levent; Maxim, Vochita; Bullmore, Edward T

    2006-05-01

    The block-paradigm of the Functional Image Analysis Contest (FIAC) dataset was analysed with the Brain Activation and Morphological Mapping software. Permutation methods in the wavelet domain were used for inference on cluster-based test statistics of orthogonal contrasts relevant to the factorial design of the study, namely: the average response across all active blocks, the main effect of speaker, the main effect of sentence, and the interaction between sentence and speaker. Extensive activation was seen with all these contrasts. In particular, different vs. same-speaker blocks produced elevated activation in bilateral regions of the superior temporal lobe and repetition suppression for linguistic materials (same vs. different-sentence blocks) in left inferior frontal regions. These are regions previously reported in the literature. Additional regions were detected in this study, perhaps due to the enhanced sensitivity of the methodology. Within-block sentence suppression was tested post-hoc by regression of an exponential decay model onto the extracted time series from the left inferior frontal gyrus, but no strong evidence of such an effect was found. The significance levels set for the activation maps are P-values at which we expect <1 false-positive cluster per image. Nominal type I error control was verified by empirical testing of a test statistic corresponding to a randomly ordered design matrix. The small size of the BOLD effect necessitates sensitive methods of detection of brain activation. Permutation methods permit the necessary flexibility to develop novel test statistics to meet this challenge.

  9. Percolation Analysis as a Tool to Describe the Topology of the Large Scale Structure of the Universe

    NASA Astrophysics Data System (ADS)

    Yess, Capp D.

    1997-09-01

    Percolation analysis is the study of the properties of clusters. In cosmology, it is the statistics of the size and number of clusters. This thesis presents a refinement of percolation analysis and its application to astronomical data. An overview of the standard model of the universe and the development of large scale structure is presented in order to place the study in historical and scientific context. Then using percolation statistics we, for the first time, demonstrate the universal character of a network pattern in the real space, mass distributions resulting from nonlinear gravitational instability of initial Gaussian fluctuations. We also find that the maximum of the number of clusters statistic in the evolved, nonlinear distributions is determined by the effective slope of the power spectrum. Next, we present percolation analyses of Wiener Reconstructions of the IRAS 1.2 Jy Redshift Survey. There are ten reconstructions of galaxy density fields in real space spanning the range β = 0.1 to 1.0, where β=Ω0.6/b,/ Ω is the present dimensionless density and b is the linear bias factor. Our method uses the growth of the largest cluster statistic to characterize the topology of a density field, where Gaussian randomized versions of the reconstructions are used as standards for analysis. For the reconstruction volume of radius, R≈100h-1 Mpc, percolation analysis reveals a slight 'meatball' topology for the real space, galaxy distribution of the IRAS survey. Finally, we employ a percolation technique developed for pointwise distributions to analyze two-dimensional projections of the three northern and three southern slices in the Las Campanas Redshift Survey and then give consideration to further study of the methodology, errors and application of percolation. We track the growth of the largest cluster as a topological indicator to a depth of 400 h-1 Mpc, and report an unambiguous signal, with high signal-to-noise ratio, indicating a network topology which in two dimensions is indicative of a filamentary distribution. It is hoped that one day percolation analysis can characterize the structure of the universe to a degree that will aid theorists in confidently describing the nature of our world.

  10. Modest validity and fair reproducibility of dietary patterns derived by cluster analysis.

    PubMed

    Funtikova, Anna N; Benítez-Arciniega, Alejandra A; Fitó, Montserrat; Schröder, Helmut

    2015-03-01

    Cluster analysis is widely used to analyze dietary patterns. We aimed to analyze the validity and reproducibility of the dietary patterns defined by cluster analysis derived from a food frequency questionnaire (FFQ). We hypothesized that the dietary patterns derived by cluster analysis have fair to modest reproducibility and validity. Dietary data were collected from 107 individuals from population-based survey, by an FFQ at baseline (FFQ1) and after 1 year (FFQ2), and by twelve 24-hour dietary recalls (24-HDR). Repeatability and validity were measured by comparing clusters obtained by the FFQ1 and FFQ2 and by the FFQ2 and 24-HDR (reference method), respectively. Cluster analysis identified a "fruits & vegetables" and a "meat" pattern in each dietary data source. Cluster membership was concordant for 66.7% of participants in FFQ1 and FFQ2 (reproducibility), and for 67.0% in FFQ2 and 24-HDR (validity). Spearman correlation analysis showed reasonable reproducibility, especially in the "fruits & vegetables" pattern, and lower validity also especially in the "fruits & vegetables" pattern. κ statistic revealed a fair validity and reproducibility of clusters. Our findings indicate a reasonable reproducibility and fair to modest validity of dietary patterns derived by cluster analysis. Copyright © 2015 Elsevier Inc. All rights reserved.

  11. The Common Prescription Patterns Based on the Hierarchical Clustering of Herb-Pairs Efficacies

    PubMed Central

    2016-01-01

    Prescription patterns are rules or regularities used to generate, recognize, or judge a prescription. Most of existing studies focused on the specific prescription patterns for diverse diseases or syndromes, while little attention was paid to the common patterns, which reflect the global view of the regularities of prescriptions. In this paper, we designed a method CPPM to find the common prescription patterns. The CPPM is based on the hierarchical clustering of herb-pair efficacies (HPEs). Firstly, HPEs were hierarchically clustered; secondly, the individual herbs are labeled by the HPEC (the clusters of HPEs); and then the prescription patterns were extracted from the combinations of HPEC; finally the common patterns are recognized statistically. The results showed that HPEs have hierarchical clustering structure. When the clustering level is 2 and the HPEs were classified into two clusters, the common prescription patterns are obvious. Among 332 candidate prescriptions, 319 prescriptions follow the common patterns. The description of the patterns is that if a prescription contains the herbs of the cluster (C 1), it is very likely to have other herbs of another cluster (C 2); while a prescription has the herbs of C 2, it may have no herbs of C 1. Finally, we discussed that the common patterns are mathematically coincident with the Blood-Qi theory. PMID:27190534

  12. Multivariate statistical analysis: Principles and applications to coorbital streams of meteorite falls

    NASA Technical Reports Server (NTRS)

    Wolf, S. F.; Lipschutz, M. E.

    1993-01-01

    Multivariate statistical analysis techniques (linear discriminant analysis and logistic regression) can provide powerful discrimination tools which are generally unfamiliar to the planetary science community. Fall parameters were used to identify a group of 17 H chondrites (Cluster 1) that were part of a coorbital stream which intersected Earth's orbit in May, from 1855 - 1895, and can be distinguished from all other H chondrite falls. Using multivariate statistical techniques, it was demonstrated that a totally different criterion, labile trace element contents - hence thermal histories - or 13 Cluster 1 meteorites are distinguishable from those of 45 non-Cluster 1 H chondrites. Here, we focus upon the principles of multivariate statistical techniques and illustrate their application using non-meteoritic and meteoritic examples.

  13. Clustering, randomness and regularity in cloud fields. I - Theoretical considerations. II - Cumulus cloud fields

    NASA Technical Reports Server (NTRS)

    Weger, R. C.; Lee, J.; Zhu, Tianri; Welch, R. M.

    1992-01-01

    The current controversy existing in reference to the regularity vs. clustering in cloud fields is examined by means of analysis and simulation studies based upon nearest-neighbor cumulative distribution statistics. It is shown that the Poisson representation of random point processes is superior to pseudorandom-number-generated models and that pseudorandom-number-generated models bias the observed nearest-neighbor statistics towards regularity. Interpretation of this nearest-neighbor statistics is discussed for many cases of superpositions of clustering, randomness, and regularity. A detailed analysis is carried out of cumulus cloud field spatial distributions based upon Landsat, AVHRR, and Skylab data, showing that, when both large and small clouds are included in the cloud field distributions, the cloud field always has a strong clustering signal.

  14. Bayesian clustering of DNA sequences using Markov chains and a stochastic partition model.

    PubMed

    Jääskinen, Väinö; Parkkinen, Ville; Cheng, Lu; Corander, Jukka

    2014-02-01

    In many biological applications it is necessary to cluster DNA sequences into groups that represent underlying organismal units, such as named species or genera. In metagenomics this grouping needs typically to be achieved on the basis of relatively short sequences which contain different types of errors, making the use of a statistical modeling approach desirable. Here we introduce a novel method for this purpose by developing a stochastic partition model that clusters Markov chains of a given order. The model is based on a Dirichlet process prior and we use conjugate priors for the Markov chain parameters which enables an analytical expression for comparing the marginal likelihoods of any two partitions. To find a good candidate for the posterior mode in the partition space, we use a hybrid computational approach which combines the EM-algorithm with a greedy search. This is demonstrated to be faster and yield highly accurate results compared to earlier suggested clustering methods for the metagenomics application. Our model is fairly generic and could also be used for clustering of other types of sequence data for which Markov chains provide a reasonable way to compress information, as illustrated by experiments on shotgun sequence type data from an Escherichia coli strain.

  15. Direct Standard-Free Quantitation of Tamiflu® and Other Pharmaceutical Tablets using Clustering Agents with Electrospray Ionization Mass Spectrometry

    PubMed Central

    Flick, Tawnya G.; Leib, Ryan D.; Williams, Evan R.

    2010-01-01

    Accurate and rapid quantitation is advantageous to identify counterfeit and substandard pharmaceutical drugs. A standard-free electrospray ionization mass spectrometry method is used to directly determine the dosage in the prescription and over-the-counter drugs, Tamiflu®, Sudafed®, and Dramamine®. A tablet of each drug was dissolved in aqueous solution, filtered, and introduced into solutions containing a known concentration of either L-tryptophan, L-phenylalanine or prednisone as clustering agents. The active ingredient(s) incorporates statistically into large clusters of the clustering agent where effects of differential ionization/detection are substantially reduced. From the abundances of large clusters, the dosages of the active ingredients in each of the tablets were determined to typically better than 20% accuracy even when the ionization/detection efficiency of the individual components differed by over 100×. Although this unorthodox method for quantitation is not as accurate as using conventional standards, it has the advantages that it is fast, it can be applied to mixtures where the identities of the analytes are unknown, and it can be used when suitable standards may not be readily available, such as schedule I or II controlled substances or new designer drugs that have not previously been identified. PMID:20092258

  16. a Three-Step Spatial-Temporal Clustering Method for Human Activity Pattern Analysis

    NASA Astrophysics Data System (ADS)

    Huang, W.; Li, S.; Xu, S.

    2016-06-01

    How people move in cities and what they do in various locations at different times form human activity patterns. Human activity pattern plays a key role in in urban planning, traffic forecasting, public health and safety, emergency response, friend recommendation, and so on. Therefore, scholars from different fields, such as social science, geography, transportation, physics and computer science, have made great efforts in modelling and analysing human activity patterns or human mobility patterns. One of the essential tasks in such studies is to find the locations or places where individuals stay to perform some kind of activities before further activity pattern analysis. In the era of Big Data, the emerging of social media along with wearable devices enables human activity data to be collected more easily and efficiently. Furthermore, the dimension of the accessible human activity data has been extended from two to three (space or space-time) to four dimensions (space, time and semantics). More specifically, not only a location and time that people stay and spend are collected, but also what people "say" for in a location at a time can be obtained. The characteristics of these datasets shed new light on the analysis of human mobility, where some of new methodologies should be accordingly developed to handle them. Traditional methods such as neural networks, statistics and clustering have been applied to study human activity patterns using geosocial media data. Among them, clustering methods have been widely used to analyse spatiotemporal patterns. However, to our best knowledge, few of clustering algorithms are specifically developed for handling the datasets that contain spatial, temporal and semantic aspects all together. In this work, we propose a three-step human activity clustering method based on space, time and semantics to fill this gap. One-year Twitter data, posted in Toronto, Canada, is used to test the clustering-based method. The results show that the approximate 55% spatiotemporal clusters distributed in different locations can be eventually grouped as the same type of clusters with consideration of semantic aspect.

  17. Accelerating Information Retrieval from Profile Hidden Markov Model Databases.

    PubMed

    Tamimi, Ahmad; Ashhab, Yaqoub; Tamimi, Hashem

    2016-01-01

    Profile Hidden Markov Model (Profile-HMM) is an efficient statistical approach to represent protein families. Currently, several databases maintain valuable protein sequence information as profile-HMMs. There is an increasing interest to improve the efficiency of searching Profile-HMM databases to detect sequence-profile or profile-profile homology. However, most efforts to enhance searching efficiency have been focusing on improving the alignment algorithms. Although the performance of these algorithms is fairly acceptable, the growing size of these databases, as well as the increasing demand for using batch query searching approach, are strong motivations that call for further enhancement of information retrieval from profile-HMM databases. This work presents a heuristic method to accelerate the current profile-HMM homology searching approaches. The method works by cluster-based remodeling of the database to reduce the search space, rather than focusing on the alignment algorithms. Using different clustering techniques, 4284 TIGRFAMs profiles were clustered based on their similarities. A representative for each cluster was assigned. To enhance sensitivity, we proposed an extended step that allows overlapping among clusters. A validation benchmark of 6000 randomly selected protein sequences was used to query the clustered profiles. To evaluate the efficiency of our approach, speed and recall values were measured and compared with the sequential search approach. Using hierarchical, k-means, and connected component clustering techniques followed by the extended overlapping step, we obtained an average reduction in time of 41%, and an average recall of 96%. Our results demonstrate that representation of profile-HMMs using a clustering-based approach can significantly accelerate data retrieval from profile-HMM databases.

  18. Systematic Correlation Matrix Evaluation (SCoMaE) - a bottom-up, science-led approach to identifying indicators

    NASA Astrophysics Data System (ADS)

    Mengis, Nadine; Keller, David P.; Oschlies, Andreas

    2018-01-01

    This study introduces the Systematic Correlation Matrix Evaluation (SCoMaE) method, a bottom-up approach which combines expert judgment and statistical information to systematically select transparent, nonredundant indicators for a comprehensive assessment of the state of the Earth system. The methods consists of two basic steps: (1) the calculation of a correlation matrix among variables relevant for a given research question and (2) the systematic evaluation of the matrix, to identify clusters of variables with similar behavior and respective mutually independent indicators. Optional further analysis steps include (3) the interpretation of the identified clusters, enabling a learning effect from the selection of indicators, (4) testing the robustness of identified clusters with respect to changes in forcing or boundary conditions, (5) enabling a comparative assessment of varying scenarios by constructing and evaluating a common correlation matrix, and (6) the inclusion of expert judgment, for example, to prescribe indicators, to allow for considerations other than statistical consistency. The example application of the SCoMaE method to Earth system model output forced by different CO2 emission scenarios reveals the necessity of reevaluating indicators identified in a historical scenario simulation for an accurate assessment of an intermediate-high, as well as a business-as-usual, climate change scenario simulation. This necessity arises from changes in prevailing correlations in the Earth system under varying climate forcing. For a comparative assessment of the three climate change scenarios, we construct and evaluate a common correlation matrix, in which we identify robust correlations between variables across the three considered scenarios.

  19. Geospatial clustering in sugar-sweetened beverage consumption among Boston youth.

    PubMed

    Tamura, Kosuke; Duncan, Dustin T; Athens, Jessica K; Bragg, Marie A; Rienti, Michael; Aldstadt, Jared; Scott, Marc A; Elbel, Brian

    2017-09-01

    The objective was to detect geospatial clustering of sugar-sweetened beverage (SSB) intake in Boston adolescents (age = 16.3 ± 1.3 years [range: 13-19]; female = 56.1%; White = 10.4%, Black = 42.6%, Hispanics = 32.4%, and others = 14.6%) using spatial scan statistics. We used data on self-reported SSB intake from the 2008 Boston Youth Survey Geospatial Dataset (n = 1292). Two binary variables were created: consumption of SSB (never versus any) on (1) soda and (2) other sugary drinks (e.g., lemonade). A Bernoulli spatial scan statistic was used to identify geospatial clusters of soda and other sugary drinks in unadjusted models and models adjusted for age, gender, and race/ethnicity. There was no statistically significant clustering of soda consumption in the unadjusted model. In contrast, a cluster of non-soda SSB consumption emerged in the middle of Boston (relative risk = 1.20, p = .005), indicating that adolescents within the cluster had a 20% higher probability of reporting non-soda SSB intake than outside the cluster. The cluster was no longer significant in the adjusted model, suggesting spatial variation in non-soda SSB drink intake correlates with the geographic distribution of students by race/ethnicity, age, and gender.

  20. K-means cluster analysis of tourist destination in special region of Yogyakarta using spatial approach and social network analysis (a case study: post of @explorejogja instagram account in 2016)

    NASA Astrophysics Data System (ADS)

    Iswandhani, N.; Muhajir, M.

    2018-03-01

    This research was conducted in Department of Statistics Islamic University of Indonesia. The data used are primary data obtained by post @explorejogja instagram account from January until December 2016. In the @explorejogja instagram account found many tourist destinations that can be visited by tourists both in the country and abroad, Therefore it is necessary to form a cluster of existing tourist destinations based on the number of likes from user instagram assumed as the most popular. The purpose of this research is to know the most popular distribution of tourist spot, the cluster formation of tourist destinations, and central popularity of tourist destinations based on @explorejogja instagram account in 2016. Statistical analysis used is descriptive statistics, k-means clustering, and social network analysis. The results of this research were obtained the top 10 most popular destinations in Yogyakarta, map of html-based tourist destination distribution consisting of 121 tourist destination points, formed 3 clusters each consisting of cluster 1 with 52 destinations, cluster 2 with 9 destinations and cluster 3 with 60 destinations, and Central popularity of tourist destinations in the special region of Yogyakarta by district.

  1. Sample size calculations for the design of cluster randomized trials: A summary of methodology.

    PubMed

    Gao, Fei; Earnest, Arul; Matchar, David B; Campbell, Michael J; Machin, David

    2015-05-01

    Cluster randomized trial designs are growing in popularity in, for example, cardiovascular medicine research and other clinical areas and parallel statistical developments concerned with the design and analysis of these trials have been stimulated. Nevertheless, reviews suggest that design issues associated with cluster randomized trials are often poorly appreciated and there remain inadequacies in, for example, describing how the trial size is determined and the associated results are presented. In this paper, our aim is to provide pragmatic guidance for researchers on the methods of calculating sample sizes. We focus attention on designs with the primary purpose of comparing two interventions with respect to continuous, binary, ordered categorical, incidence rate and time-to-event outcome variables. Issues of aggregate and non-aggregate cluster trials, adjustment for variation in cluster size and the effect size are detailed. The problem of establishing the anticipated magnitude of between- and within-cluster variation to enable planning values of the intra-cluster correlation coefficient and the coefficient of variation are also described. Illustrative examples of calculations of trial sizes for each endpoint type are included. Copyright © 2015 Elsevier Inc. All rights reserved.

  2. Unbiased methods for removing systematics from galaxy clustering measurements

    NASA Astrophysics Data System (ADS)

    Elsner, Franz; Leistedt, Boris; Peiris, Hiranya V.

    2016-02-01

    Measuring the angular clustering of galaxies as a function of redshift is a powerful method for extracting information from the three-dimensional galaxy distribution. The precision of such measurements will dramatically increase with ongoing and future wide-field galaxy surveys. However, these are also increasingly sensitive to observational and astrophysical contaminants. Here, we study the statistical properties of three methods proposed for controlling such systematics - template subtraction, basic mode projection, and extended mode projection - all of which make use of externally supplied template maps, designed to characterize and capture the spatial variations of potential systematic effects. Based on a detailed mathematical analysis, and in agreement with simulations, we find that the template subtraction method in its original formulation returns biased estimates of the galaxy angular clustering. We derive closed-form expressions that should be used to correct results for this shortcoming. Turning to the basic mode projection algorithm, we prove it to be free of any bias, whereas we conclude that results computed with extended mode projection are biased. Within a simplified setup, we derive analytical expressions for the bias and discuss the options for correcting it in more realistic configurations. Common to all three methods is an increased estimator variance induced by the cleaning process, albeit at different levels. These results enable unbiased high-precision clustering measurements in the presence of spatially varying systematics, an essential step towards realizing the full potential of current and planned galaxy surveys.

  3. Brain vascular image segmentation based on fuzzy local information C-means clustering

    NASA Astrophysics Data System (ADS)

    Hu, Chaoen; Liu, Xia; Liang, Xiao; Hui, Hui; Yang, Xin; Tian, Jie

    2017-02-01

    Light sheet fluorescence microscopy (LSFM) is a powerful optical resolution fluorescence microscopy technique which enables to observe the mouse brain vascular network in cellular resolution. However, micro-vessel structures are intensity inhomogeneity in LSFM images, which make an inconvenience for extracting line structures. In this work, we developed a vascular image segmentation method by enhancing vessel details which should be useful for estimating statistics like micro-vessel density. Since the eigenvalues of hessian matrix and its sign describes different geometric structure in images, which enable to construct vascular similarity function and enhance line signals, the main idea of our method is to cluster the pixel values of the enhanced image. Our method contained three steps: 1) calculate the multiscale gradients and the differences between eigenvalues of Hessian matrix. 2) In order to generate the enhanced microvessels structures, a feed forward neural network was trained by 2.26 million pixels for dealing with the correlations between multi-scale gradients and the differences between eigenvalues. 3) The fuzzy local information c-means clustering (FLICM) was used to cluster the pixel values in enhance line signals. To verify the feasibility and effectiveness of this method, mouse brain vascular images have been acquired by a commercial light-sheet microscope in our lab. The experiment of the segmentation method showed that dice similarity coefficient can reach up to 85%. The results illustrated that our approach extracting line structures of blood vessels dramatically improves the vascular image and enable to accurately extract blood vessels in LSFM images.

  4. Use of a spatial scan statistic to identify clusters of births occurring outside Ghanaian health facilities for targeted intervention.

    PubMed

    Bosomprah, Samuel; Dotse-Gborgbortsi, Winfred; Aboagye, Patrick; Matthews, Zoe

    2016-11-01

    To identify and evaluate clusters of births that occurred outside health facilities in Ghana for targeted intervention. A retrospective study was conducted using a convenience sample of live births registered in Ghanaian health facilities from January 1 to December 31, 2014. Data were extracted from the district health information system. A spatial scan statistic was used to investigate clusters of home births through a discrete Poisson probability model. Scanning with a circular spatial window was conducted only for clusters with high rates of such deliveries. The district was used as the geographic unit of analysis. The likelihood P value was estimated using Monte Carlo simulations. Ten statistically significant clusters with a high rate of home birth were identified. The relative risks ranged from 1.43 ("least likely" cluster; P=0.001) to 1.95 ("most likely" cluster; P=0.001). The relative risks of the top five "most likely" clusters ranged from 1.68 to 1.95; these clusters were located in Ashanti, Brong Ahafo, and the Western, Eastern, and Greater regions of Accra. Health facility records, geospatial techniques, and geographic information systems provided locally relevant information to assist policy makers in delivering targeted interventions to small geographic areas. Copyright © 2016 International Federation of Gynecology and Obstetrics. Published by Elsevier Ireland Ltd. All rights reserved.

  5. A new scoring system in Cystic Fibrosis: statistical tools for database analysis - a preliminary report.

    PubMed

    Hafen, G M; Hurst, C; Yearwood, J; Smith, J; Dzalilov, Z; Robinson, P J

    2008-10-05

    Cystic fibrosis is the most common fatal genetic disorder in the Caucasian population. Scoring systems for assessment of Cystic fibrosis disease severity have been used for almost 50 years, without being adapted to the milder phenotype of the disease in the 21st century. The aim of this current project is to develop a new scoring system using a database and employing various statistical tools. This study protocol reports the development of the statistical tools in order to create such a scoring system. The evaluation is based on the Cystic Fibrosis database from the cohort at the Royal Children's Hospital in Melbourne. Initially, unsupervised clustering of the all data records was performed using a range of clustering algorithms. In particular incremental clustering algorithms were used. The clusters obtained were characterised using rules from decision trees and the results examined by clinicians. In order to obtain a clearer definition of classes expert opinion of each individual's clinical severity was sought. After data preparation including expert-opinion of an individual's clinical severity on a 3 point-scale (mild, moderate and severe disease), two multivariate techniques were used throughout the analysis to establish a method that would have a better success in feature selection and model derivation: 'Canonical Analysis of Principal Coordinates' and 'Linear Discriminant Analysis'. A 3-step procedure was performed with (1) selection of features, (2) extracting 5 severity classes out of a 3 severity class as defined per expert-opinion and (3) establishment of calibration datasets. (1) Feature selection: CAP has a more effective "modelling" focus than DA.(2) Extraction of 5 severity classes: after variables were identified as important in discriminating contiguous CF severity groups on the 3-point scale as mild/moderate and moderate/severe, Discriminant Function (DF) was used to determine the new groups mild, intermediate moderate, moderate, intermediate severe and severe disease. (3) Generated confusion tables showed a misclassification rate of 19.1% for males and 16.5% for females, with a majority of misallocations into adjacent severity classes particularly for males. Our preliminary data show that using CAP for detection of selection features and Linear DA to derive the actual model in a CF database might be helpful in developing a scoring system. However, there are several limitations, particularly more data entry points are needed to finalize a score and the statistical tools have further to be refined and validated, with re-running the statistical methods in the larger dataset.

  6. Free-energy landscapes from adaptively biased methods: Application to quantum systems

    NASA Astrophysics Data System (ADS)

    Calvo, F.

    2010-10-01

    Several parallel adaptive biasing methods are applied to the calculation of free-energy pathways along reaction coordinates, choosing as a difficult example the double-funnel landscape of the 38-atom Lennard-Jones cluster. In the case of classical statistics, the Wang-Landau and adaptively biased molecular-dynamics (ABMD) methods are both found efficient if multiple walkers and replication and deletion schemes are used. An extension of the ABMD technique to quantum systems, implemented through the path-integral MD framework, is presented and tested on Ne38 against the quantum superposition method.

  7. Statistical methods for astronomical data with upper limits. II - Correlation and regression

    NASA Technical Reports Server (NTRS)

    Isobe, T.; Feigelson, E. D.; Nelson, P. I.

    1986-01-01

    Statistical methods for calculating correlations and regressions in bivariate censored data where the dependent variable can have upper or lower limits are presented. Cox's regression and the generalization of Kendall's rank correlation coefficient provide significant levels of correlations, and the EM algorithm, under the assumption of normally distributed errors, and its nonparametric analog using the Kaplan-Meier estimator, give estimates for the slope of a regression line. Monte Carlo simulations demonstrate that survival analysis is reliable in determining correlations between luminosities at different bands. Survival analysis is applied to CO emission in infrared galaxies, X-ray emission in radio galaxies, H-alpha emission in cooling cluster cores, and radio emission in Seyfert galaxies.

  8. Getting the big picture in community science: methods that capture context.

    PubMed

    Luke, Douglas A

    2005-06-01

    Community science has a rich tradition of using theories and research designs that are consistent with its core value of contextualism. However, a survey of empirical articles published in the American Journal of Community Psychology shows that community scientists utilize a narrow range of statistical tools that are not well suited to assess contextual data. Multilevel modeling, geographic information systems (GIS), social network analysis, and cluster analysis are recommended as useful tools to address contextual questions in community science. An argument for increased methodological consilience is presented, where community scientists are encouraged to adopt statistical methodology that is capable of modeling a greater proportion of the data than is typical with traditional methods.

  9. Space-Time Analysis of Testicular Cancer Clusters Using Residential Histories: A Case-Control Study in Denmark

    PubMed Central

    Sloan, Chantel D.; Nordsborg, Rikke B.; Jacquez, Geoffrey M.; Raaschou-Nielsen, Ole; Meliker, Jaymie R.

    2015-01-01

    Though the etiology is largely unknown, testicular cancer incidence has seen recent significant increases in northern Europe and throughout many Western regions. The most common cancer in males under age 40, age period cohort models have posited exposures in the in utero environment or in early childhood as possible causes of increased risk of testicular cancer. Some of these factors may be tied to geography through being associated with behavioral, cultural, sociodemographic or built environment characteristics. If so, this could result in detectable geographic clusters of cases that could lead to hypotheses regarding environmental targets for intervention. Given a latency period between exposure to an environmental carcinogen and testicular cancer diagnosis, mobility histories are beneficial for spatial cluster analyses. Nearest-neighbor based Q-statistics allow for the incorporation of changes in residency in spatial disease cluster detection. Using these methods, a space-time cluster analysis was conducted on a population-wide case-control population selected from the Danish Cancer Registry with mobility histories since 1971 extracted from the Danish Civil Registration System. Cases (N=3297) were diagnosed between 1991 and 2003, and two sets of controls (N=3297 for each set) matched on sex and date of birth were included in the study. We also examined spatial patterns in maternal residential history for those cases and controls born in 1971 or later (N= 589 case-control pairs). Several small clusters were detected when aligning individuals by year prior to diagnosis, age at diagnosis and calendar year of diagnosis. However, the largest of these clusters contained only 2 statistically significant individuals at their center, and were not replicated in SaTScan spatial-only analyses which are less susceptible to multiple testing bias. We found little evidence of local clusters in residential histories of testicular cancer cases in this Danish population. PMID:25756204

  10. Space-time analysis of testicular cancer clusters using residential histories: a case-control study in Denmark.

    PubMed

    Sloan, Chantel D; Nordsborg, Rikke B; Jacquez, Geoffrey M; Raaschou-Nielsen, Ole; Meliker, Jaymie R

    2015-01-01

    Though the etiology is largely unknown, testicular cancer incidence has seen recent significant increases in northern Europe and throughout many Western regions. The most common cancer in males under age 40, age period cohort models have posited exposures in the in utero environment or in early childhood as possible causes of increased risk of testicular cancer. Some of these factors may be tied to geography through being associated with behavioral, cultural, sociodemographic or built environment characteristics. If so, this could result in detectable geographic clusters of cases that could lead to hypotheses regarding environmental targets for intervention. Given a latency period between exposure to an environmental carcinogen and testicular cancer diagnosis, mobility histories are beneficial for spatial cluster analyses. Nearest-neighbor based Q-statistics allow for the incorporation of changes in residency in spatial disease cluster detection. Using these methods, a space-time cluster analysis was conducted on a population-wide case-control population selected from the Danish Cancer Registry with mobility histories since 1971 extracted from the Danish Civil Registration System. Cases (N=3297) were diagnosed between 1991 and 2003, and two sets of controls (N=3297 for each set) matched on sex and date of birth were included in the study. We also examined spatial patterns in maternal residential history for those cases and controls born in 1971 or later (N= 589 case-control pairs). Several small clusters were detected when aligning individuals by year prior to diagnosis, age at diagnosis and calendar year of diagnosis. However, the largest of these clusters contained only 2 statistically significant individuals at their center, and were not replicated in SaTScan spatial-only analyses which are less susceptible to multiple testing bias. We found little evidence of local clusters in residential histories of testicular cancer cases in this Danish population.

  11. Identifying seizure clusters in patients with psychogenic nonepileptic seizures.

    PubMed

    Baird, Grayson L; Harlow, Lisa L; Machan, Jason T; Thomas, Dave; LaFrance, W C

    2017-08-01

    The present study explored how seizure clusters may be defined for those with psychogenic nonepileptic seizures (PNES), a topic for which there is a paucity of literature. The sample was drawn from a multisite randomized clinical trial for PNES; seizure data are from participants' seizure diaries. Three possible cluster definitions were examined: 1) common clinical definition, where ≥3 seizures in a day is considered a cluster, along with two novel statistical definitions, where ≥3 seizures in a day are considered a cluster if the observed number of seizures statistically exceeds what would be expected relative to a patient's: 1) average seizure rate prior to the trial, 2) observed seizure rate for the previous seven days. Prevalence of clusters was 62-68% depending on cluster definition used, and occurrence rate of clusters was 6-19% depending on cluster definition. Based on these data, clusters seem to be common in patients with PNES, and more research is needed to identify if clusters are related to triggers and outcomes. Copyright © 2017 Elsevier Inc. All rights reserved.

  12. Pixel Color Clustering of Multi-Temporally Acquired Digital Photographs of a Rice Canopy by Luminosity-Normalization and Pseudo-Red-Green-Blue Color Imaging

    PubMed Central

    Doi, Ryoichi; Arif, Chusnul

    2014-01-01

    Red-green-blue (RGB) channels of RGB digital photographs were loaded with luminosity-adjusted R, G, and completely white grayscale images, respectively (RGwhtB method), or R, G, and R + G (RGB yellow) grayscale images, respectively (RGrgbyB method), to adjust the brightness of the entire area of multi-temporally acquired color digital photographs of a rice canopy. From the RGwhtB or RGrgbyB pseudocolor image, cyan, magenta, CMYK yellow, black, L*, a*, and b* grayscale images were prepared. Using these grayscale images and R, G, and RGB yellow grayscale images, the luminosity-adjusted pixels of the canopy photographs were statistically clustered. With the RGrgbyB and the RGwhtB methods, seven and five major color clusters were given, respectively. The RGrgbyB method showed clear differences among three rice growth stages, and the vegetative stage was further divided into two substages. The RGwhtB method could not clearly discriminate between the second vegetative and midseason stages. The relative advantages of the RGrgbyB method were attributed to the R, G, B, magenta, yellow, L*, and a* grayscale images that contained richer information to show the colorimetrical differences among objects than those of the RGwhtB method. The comparison of rice canopy colors at different time points was enabled by the pseudocolor imaging method. PMID:25302325

  13. Generic Feature Selection with Short Fat Data

    PubMed Central

    Clarke, B.; Chu, J.-H.

    2014-01-01

    SUMMARY Consider a regression problem in which there are many more explanatory variables than data points, i.e., p ≫ n. Essentially, without reducing the number of variables inference is impossible. So, we group the p explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of n, p, classes of statistics, clustering algorithms, penalty terms, and data types. When n is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [n/K] statistics where K is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an Lq norm with high enough q. PMID:25346546

  14. Stream-Dashboard: A Big Data Stream Clustering Framework with Applications to Social Media Streams

    ERIC Educational Resources Information Center

    Hawwash, Basheer

    2013-01-01

    Data mining is concerned with detecting patterns of data in raw datasets, which are then used to unearth knowledge that might not have been discovered using conventional querying or statistical methods. This discovered knowledge has been used to empower decision makers in countless applications spanning across many multi-disciplinary areas…

  15. Clustering Coefficients for Correlation Networks.

    PubMed

    Masuda, Naoki; Sakaki, Michiko; Ezaki, Takahiro; Watanabe, Takamitsu

    2018-01-01

    Graph theory is a useful tool for deciphering structural and functional networks of the brain on various spatial and temporal scales. The clustering coefficient quantifies the abundance of connected triangles in a network and is a major descriptive statistics of networks. For example, it finds an application in the assessment of small-worldness of brain networks, which is affected by attentional and cognitive conditions, age, psychiatric disorders and so forth. However, it remains unclear how the clustering coefficient should be measured in a correlation-based network, which is among major representations of brain networks. In the present article, we propose clustering coefficients tailored to correlation matrices. The key idea is to use three-way partial correlation or partial mutual information to measure the strength of the association between the two neighboring nodes of a focal node relative to the amount of pseudo-correlation expected from indirect paths between the nodes. Our method avoids the difficulties of previous applications of clustering coefficient (and other) measures in defining correlational networks, i.e., thresholding on the correlation value, discarding of negative correlation values, the pseudo-correlation problem and full partial correlation matrices whose estimation is computationally difficult. For proof of concept, we apply the proposed clustering coefficient measures to functional magnetic resonance imaging data obtained from healthy participants of various ages and compare them with conventional clustering coefficients. We show that the clustering coefficients decline with the age. The proposed clustering coefficients are more strongly correlated with age than the conventional ones are. We also show that the local variants of the proposed clustering coefficients (i.e., abundance of triangles around a focal node) are useful in characterizing individual nodes. In contrast, the conventional local clustering coefficients were strongly correlated with and therefore may be confounded by the node's connectivity. The proposed methods are expected to help us to understand clustering and lack thereof in correlational brain networks, such as those derived from functional time series and across-participant correlation in neuroanatomical properties.

  16. Clustering Coefficients for Correlation Networks

    PubMed Central

    Masuda, Naoki; Sakaki, Michiko; Ezaki, Takahiro; Watanabe, Takamitsu

    2018-01-01

    Graph theory is a useful tool for deciphering structural and functional networks of the brain on various spatial and temporal scales. The clustering coefficient quantifies the abundance of connected triangles in a network and is a major descriptive statistics of networks. For example, it finds an application in the assessment of small-worldness of brain networks, which is affected by attentional and cognitive conditions, age, psychiatric disorders and so forth. However, it remains unclear how the clustering coefficient should be measured in a correlation-based network, which is among major representations of brain networks. In the present article, we propose clustering coefficients tailored to correlation matrices. The key idea is to use three-way partial correlation or partial mutual information to measure the strength of the association between the two neighboring nodes of a focal node relative to the amount of pseudo-correlation expected from indirect paths between the nodes. Our method avoids the difficulties of previous applications of clustering coefficient (and other) measures in defining correlational networks, i.e., thresholding on the correlation value, discarding of negative correlation values, the pseudo-correlation problem and full partial correlation matrices whose estimation is computationally difficult. For proof of concept, we apply the proposed clustering coefficient measures to functional magnetic resonance imaging data obtained from healthy participants of various ages and compare them with conventional clustering coefficients. We show that the clustering coefficients decline with the age. The proposed clustering coefficients are more strongly correlated with age than the conventional ones are. We also show that the local variants of the proposed clustering coefficients (i.e., abundance of triangles around a focal node) are useful in characterizing individual nodes. In contrast, the conventional local clustering coefficients were strongly correlated with and therefore may be confounded by the node's connectivity. The proposed methods are expected to help us to understand clustering and lack thereof in correlational brain networks, such as those derived from functional time series and across-participant correlation in neuroanatomical properties. PMID:29599714

  17. 3D variational brain tumor segmentation on a clustered feature set

    NASA Astrophysics Data System (ADS)

    Popuri, Karteek; Cobzas, Dana; Jagersand, Martin; Shah, Sirish L.; Murtha, Albert

    2009-02-01

    Tumor segmentation from MRI data is a particularly challenging and time consuming task. Tumors have a large diversity in shape and appearance with intensities overlapping the normal brain tissues. In addition, an expanding tumor can also deflect and deform nearby tissue. Our work addresses these last two difficult problems. We use the available MRI modalities (T1, T1c, T2) and their texture characteristics to construct a multi-dimensional feature set. Further, we extract clusters which provide a compact representation of the essential information in these features. The main idea in this paper is to incorporate these clustered features into the 3D variational segmentation framework. In contrast to the previous variational approaches, we propose a segmentation method that evolves the contour in a supervised fashion. The segmentation boundary is driven by the learned inside and outside region voxel probabilities in the cluster space. We incorporate prior knowledge about the normal brain tissue appearance, during the estimation of these region statistics. In particular, we use a Dirichlet prior that discourages the clusters in the ventricles to be in the tumor and hence better disambiguate the tumor from brain tissue. We show the performance of our method on real MRI scans. The experimental dataset includes MRI scans, from patients with difficult instances, with tumors that are inhomogeneous in appearance, small in size and in proximity to the major structures in the brain. Our method shows good results on these test cases.

  18. Clustering of health-related behaviors among early and mid-adolescents in Tuscany: results from a representative cross-sectional study

    PubMed Central

    Lazzeri, Giacomo; Panatto, Donatella; Domnich, Alexander; Arata, Lucia; Pammolli, Andrea; Simi, Rita; Giacchi, Mariano Vincenzo; Amicizia, Daniela; Gasparini, Roberto

    2018-01-01

    Abstract Background A huge amount of literature suggests that adolescents’ health-related behaviors tend to occur in clusters, and the understanding of such behavioral clustering may have direct implications for the effective tailoring of health-promotion interventions. Despite the usefulness of analyzing clustering, Italian data on this topic are scant. This study aimed to evaluate the clustering patterns of health-related behaviors. Methods The present study is based on data from the Health Behaviors in School-aged Children (HBSC) study conducted in Tuscany in 2010, which involved 3291 11-, 13- and 15-year olds. To aggregate students’ data on 22 health-related behaviors, factor analysis and subsequent cluster analysis were performed. Results Factor analysis revealed eight factors, which were dubbed in accordance with their main traits: ‘Alcohol drinking’, ‘Smoking’, ‘Physical activity’, ‘Screen time’, ‘Signs & symptoms’, ‘Healthy eating’, ‘Violence’ and ‘Sweet tooth’. These factors explained 67% of variance and underwent cluster analysis. A six-cluster κ-means solution was established with a 93.8% level of classification validity. The between-cluster differences in both mean age and gender distribution were highly statistically significant. Conclusions Health-compromising behaviors are common among Tuscan teens and occur in distinct clusters. These results may be used by schools, health-promotion authorities and other stakeholders to design and implement tailored preventive interventions in Tuscany. PMID:27908972

  19. Just the right age: well-clustered exposure ages from a global glacial 10Be compilation

    NASA Astrophysics Data System (ADS)

    Heyman, Jakob; Margold, Martin

    2017-04-01

    Cosmogenic exposure dating has been used extensively for defining glacial chronologies, both in ice sheet and alpine settings, and the global set of published ages today reaches well beyond 10,000 samples. Over the last few years, a number of important developments have improved the measurements (with well-defined AMS standards) and exposure age calculations (with updated data and methods for calculating production rates), in the best case enabling high precision dating of past glacial events. A remaining problem, however, is the fact that a large portion of all dated samples have been affected by prior and/or incomplete exposure, yielding erroneous exposure ages under the standard assumptions. One way to address this issue is to only use exposure ages that can be confidently considered as unaffected by prior/incomplete exposure, such as groups of samples with statistically identical ages. Here we use objective statistical criteria to identify groups of well-clustered exposure ages from the global glacial "expage" 10Be compilation. Out of ˜1700 groups with at least 3 individual samples ˜30% are well-clustered, increasing to ˜45% if allowing outlier rejection of a maximum of 1/3 of the samples (still requiring a minimum of 3 well-clustered ages). The dataset of well-clustered ages is heavily dominated by ages <30 ka, showing that well-defined cosmogenic chronologies primarily exist for the last glaciation. We observe a large-scale global synchronicity in the timing of the last deglaciation from ˜20 to 10 ka. There is also a general correlation between the timing of deglaciation and latitude (or size of the individual ice mass), with earlier deglaciation in lower latitudes and later deglaciation towards the poles. Grouping the data into regions and comparing with available paleoclimate data we can start to untangle regional differences in the last deglaciation and the climate events controlling the ice mass loss. The extensive dataset and the statistical analysis enables an unprecedented global view on the last deglaciation.

  20. Direct statistical modeling and its implications for predictive mapping in mining exploration

    NASA Astrophysics Data System (ADS)

    Sterligov, Boris; Gumiaux, Charles; Barbanson, Luc; Chen, Yan; Cassard, Daniel; Cherkasov, Sergey; Zolotaya, Ludmila

    2010-05-01

    Recent advances in geosciences make more and more multidisciplinary data available for mining exploration. This allowed developing methodologies for computing forecast ore maps from the statistical combination of such different input parameters, all based on an inverse problem theory. Numerous statistical methods (e.g. algebraic method, weight of evidence, Siris method, etc) with varying degrees of complexity in their development and implementation, have been proposed and/or adapted for ore geology purposes. In literature, such approaches are often presented through applications on natural examples and the results obtained can present specificities due to local characteristics. Moreover, though crucial for statistical computations, "minimum requirements" needed for input parameters (number of minimum data points, spatial distribution of objects, etc) are often only poorly expressed. From these, problems often arise when one has to choose between one and the other method for her/his specific question. In this study, a direct statistical modeling approach is developed in order to i) evaluate the constraints on the input parameters and ii) test the validity of different existing inversion methods. The approach particularly focused on the analysis of spatial relationships between location of points and various objects (e.g. polygons and /or polylines) which is particularly well adapted to constrain the influence of intrusive bodies - such as a granite - and faults or ductile shear-zones on spatial location of ore deposits (point objects). The method is designed in a way to insure a-dimensionality with respect to scale. In this approach, both spatial distribution and topology of objects (polygons and polylines) can be parametrized by the user (e.g. density of objects, length, surface, orientation, clustering). Then, the distance of points with respect to a given type of objects (polygons or polylines) is given using a probability distribution. The location of points is computed assuming either independency or different grades of dependency between the two probability distributions. The results show that i)polygons surface mean value, polylines length mean value, the number of objects and their clustering are critical and ii) the validity of the different tested inversion methods strongly depends on the relative importance and on the dependency between the parameters used. In addition, this combined approach of direct and inverse modeling offers an opportunity to test the robustness of the inferred distribution point laws with respect to the quality of the input data set.

  1. Conceptual and statistical problems associated with the use of diversity indices in ecology.

    PubMed

    Barrantes, Gilbert; Sandoval, Luis

    2009-09-01

    Diversity indices, particularly the Shannon-Wiener index, have extensively been used in analyzing patterns of diversity at different geographic and ecological scales. These indices have serious conceptual and statistical problems which make comparisons of species richness or species abundances across communities nearly impossible. There is often no a single statistical method that retains all information needed to answer even a simple question. However, multivariate analyses could be used instead of diversity indices, such as cluster analyses or multiple regressions. More complex multivariate analyses, such as Canonical Correspondence Analysis, provide very valuable information on environmental variables associated to the presence and abundance of the species in a community. In addition, particular hypotheses associated to changes in species richness across localities, or change in abundance of one, or a group of species can be tested using univariate, bivariate, and/or rarefaction statistical tests. The rarefaction method has proved to be robust to standardize all samples to a common size. Even the simplest method as reporting the number of species per taxonomic category possibly provides more information than a diversity index value.

  2. Synthesis of instrumentally and historically recorded earthquakes and studying their spatial statistical relationship (A case study: Dasht-e-Biaz, Eastern Iran)

    NASA Astrophysics Data System (ADS)

    Jalali, Mohammad; Ramazi, Hamidreza

    2018-06-01

    Earthquake catalogues are the main source of statistical seismology for the long term studies of earthquake occurrence. Therefore, studying the spatiotemporal problems is important to reduce the related uncertainties in statistical seismology studies. A statistical tool, time normalization method, has been determined to revise time-frequency relationship in one of the most active regions of Asia, Eastern Iran and West of Afghanistan, (a and b were calculated around 8.84 and 1.99 in the exponential scale, not logarithmic scale). Geostatistical simulation method has been further utilized to reduce the uncertainties in the spatial domain. A geostatistical simulation produces a representative, synthetic catalogue with 5361 events to reduce spatial uncertainties. The synthetic database is classified using a Geographical Information System, GIS, based on simulated magnitudes to reveal the underlying seismicity patterns. Although some regions with highly seismicity correspond to known faults, significantly, as far as seismic patterns are concerned, the new method highlights possible locations of interest that have not been previously identified. It also reveals some previously unrecognized lineation and clusters in likely future strain release.

  3. Health-risk behaviour in Croatia.

    PubMed

    Bécue-Bertaut, Mónica; Kern, Josipa; Hernández-Maldonado, Maria-Luisa; Juresa, Vesna; Vuletic, Silvije

    2008-02-01

    To identify the health-risk behaviour of various homogeneous clusters of individuals. The study was conducted in 13 of the 20 Croatian counties and in Zagreb, the Croatian capital. In the first stage, general practices were selected in each county. The second-stage sample was created by drawing a random subsample of 10% of the patients registered at each selected general practice. The sample was divided into seven homogenous clusters using statistical methodology, combining multiple factor analysis with a hybrid clustering method. Seven homogeneous clusters were identified, three composed of males and four composed of females, based on statistically significant differences between selected characteristics (P<0.001). Although, in general, self-assessed health declined with age, significant variations were observed within specific age intervals. Higher levels of self-assessed health were associated with higher levels of education and/or socio-economic status. Many individuals, especially females, who self-reported poor health were heavy consumers of sleeping pills. Males and females reported different health-risk behaviours related to lifestyle, diet and use of the healthcare system. Heavy alcohol and tobacco use, unhealthy diet, risky physical activity and non-use of the healthcare system influenced self-assessed health in males. Females were slightly less satisfied with their health than males of the same age and educational level. Even highly educated females who took preventive healthcare tests and ate a healthy diet reported a less satisfactory self-assessed level of health than expected. Sociodemographic characteristics, life style, self-assessed health and use of the healthcare system were used in the identification of seven homogeneous population clusters. A comprehensive analysis of these clusters suggests health-related prevention and intervention efforts geared towards specific populations.

  4. Improving the Statistical Modeling of the TRMM Extreme Precipitation Monitoring System

    NASA Astrophysics Data System (ADS)

    Demirdjian, L.; Zhou, Y.; Huffman, G. J.

    2016-12-01

    This project improves upon an existing extreme precipitation monitoring system based on the Tropical Rainfall Measuring Mission (TRMM) daily product (3B42) using new statistical models. The proposed system utilizes a regional modeling approach, where data from similar grid locations are pooled to increase the quality and stability of the resulting model parameter estimates to compensate for the short data record. The regional frequency analysis is divided into two stages. In the first stage, the region defined by the TRMM measurements is partitioned into approximately 27,000 non-overlapping clusters using a recursive k-means clustering scheme. In the second stage, a statistical model is used to characterize the extreme precipitation events occurring in each cluster. Instead of utilizing the block-maxima approach used in the existing system, where annual maxima are fit to the Generalized Extreme Value (GEV) probability distribution at each cluster separately, the present work adopts the peak-over-threshold (POT) method of classifying points as extreme if they exceed a pre-specified threshold. Theoretical considerations motivate the use of the Generalized-Pareto (GP) distribution for fitting threshold exceedances. The fitted parameters can be used to construct simple and intuitive average recurrence interval (ARI) maps which reveal how rare a particular precipitation event is given its spatial location. The new methodology eliminates much of the random noise that was produced by the existing models due to a short data record, producing more reasonable ARI maps when compared with NOAA's long-term Climate Prediction Center (CPC) ground based observations. The resulting ARI maps can be useful for disaster preparation, warning, and management, as well as increased public awareness of the severity of precipitation events. Furthermore, the proposed methodology can be applied to various other extreme climate records.

  5. Regional variation in the severity of pesticide exposure outcomes: applications of geographic information systems and spatial scan statistics.

    PubMed

    Sudakin, Daniel L; Power, Laura E

    2009-03-01

    Geographic information systems and spatial scan statistics have been utilized to assess regional clustering of symptomatic pesticide exposures reported to a state Poison Control Center (PCC) during a single year. In the present study, we analyzed five subsequent years of PCC data to test whether there are significant geographic differences in pesticide exposure incidents resulting in serious (moderate, major, and fatal) medical outcomes. A PCC provided the data on unintentional pesticide exposures for the time period 2001-2005. The geographic location of the caller, the location where the exposure occurred, the exposure route, and the medical outcome were abstracted. There were 273 incidents resulting in moderate effects (n = 261), major effects (n = 10), or fatalities (n = 2). Spatial scan statistics identified a geographic area consisting of two adjacent counties (one urban, one rural), where statistically significant clustering of serious outcomes was observed. The relative risk of moderate, major, and fatal outcomes was 2.0 in this spatial cluster (p = 0.0005). PCC data, geographic information systems, and spatial scan statistics can identify clustering of serious outcomes from human exposure to pesticides. These analyses may be useful for public health officials to target preventive interventions. Further investigation is warranted to understand better the potential explanations for geographical clustering, and to assess whether preventive interventions have an impact on reducing pesticide exposure incidents resulting in serious medical outcomes.

  6. An investigation on thermal patterns in Iran based on spatial autocorrelation

    NASA Astrophysics Data System (ADS)

    Fallah Ghalhari, Gholamabbas; Dadashi Roudbari, Abbasali

    2018-02-01

    The present study aimed at investigating temporal-spatial patterns and monthly patterns of temperature in Iran using new spatial statistical methods such as cluster and outlier analysis, and hotspot analysis. To do so, climatic parameters, monthly average temperature of 122 synoptic stations, were assessed. Statistical analysis showed that January with 120.75% had the most fluctuation among the studied months. Global Moran's Index revealed that yearly changes of temperature in Iran followed a strong spatially clustered pattern. Findings showed that the biggest thermal cluster pattern in Iran, 0.975388, occurred in May. Cluster and outlier analyses showed that thermal homogeneity in Iran decreases in cold months, while it increases in warm months. This is due to the radiation angle and synoptic systems which strongly influence thermal order in Iran. The elevations, however, have the most notable part proved by Geographically weighted regression model. Iran's thermal analysis through hotspot showed that hot thermal patterns (very hot, hot, and semi-hot) were dominant in the South, covering an area of 33.5% (about 552,145.3 km2). Regions such as mountain foot and low lands lack any significant spatial autocorrelation, 25.2% covering about 415,345.1 km2. The last is the cold thermal area (very cold, cold, and semi-cold) with about 25.2% covering about 552,145.3 km2 of the whole area of Iran.

  7. Identification and characterization of near-fatal asthma phenotypes by cluster analysis.

    PubMed

    Serrano-Pariente, J; Rodrigo, G; Fiz, J A; Crespo, A; Plaza, V

    2015-09-01

    Near-fatal asthma (NFA) is a heterogeneous clinical entity and several profiles of patients have been described according to different clinical, pathophysiological and histological features. However, there are no previous studies that identify in a unbiased way--using statistical methods such as clusters analysis--different phenotypes of NFA. Therefore, the aim of the present study was to identify and to characterize phenotypes of near fatal asthma using a cluster analysis. Over a period of 2 years, 33 Spanish hospitals enrolled 179 asthmatics admitted for an episode of NFA. A cluster analysis using two-steps algorithm was performed from data of 84 of these cases. The analysis defined three clusters of patients with NFA: cluster 1, the largest, including older patients with clinical and therapeutic criteria of severe asthma; cluster 2, with an high proportion of respiratory arrest (68%), impaired consciousness level (82%) and mechanical ventilation (93%); and cluster 3, which included younger patients, characterized by an insufficient anti-inflammatory treatment and frequent sensitization to Alternaria alternata and soybean. These results identify specific asthma phenotypes involved in NFA, confirming in part previous findings observed in studies with a clinical approach. The identification of patients with a specific NFA phenotype could suggest interventions to prevent future severe asthma exacerbations. © 2015 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.

  8. Cancer detection based on Raman spectra super-paramagnetic clustering

    NASA Astrophysics Data System (ADS)

    González-Solís, José Luis; Guizar-Ruiz, Juan Ignacio; Martínez-Espinosa, Juan Carlos; Martínez-Zerega, Brenda Esmeralda; Juárez-López, Héctor Alfonso; Vargas-Rodríguez, Héctor; Gallegos-Infante, Luis Armando; González-Silva, Ricardo Armando; Espinoza-Padilla, Pedro Basilio; Palomares-Anda, Pascual

    2016-08-01

    The clustering of Raman spectra of serum sample is analyzed using the super-paramagnetic clustering technique based in the Potts spin model. We investigated the clustering of biochemical networks by using Raman data that define edge lengths in the network, and where the interactions are functions of the Raman spectra's individual band intensities. For this study, we used two groups of 58 and 102 control Raman spectra and the intensities of 160, 150 and 42 Raman spectra of serum samples from breast and cervical cancer and leukemia patients, respectively. The spectra were collected from patients from different hospitals from Mexico. By using super-paramagnetic clustering technique, we identified the most natural and compact clusters allowing us to discriminate the control and cancer patients. A special interest was the leukemia case where its nearly hierarchical observed structure allowed the identification of the patients's leukemia type. The goal of this study is to apply a model of statistical physics, as the super-paramagnetic, to find these natural clusters that allow us to design a cancer detection method. To the best of our knowledge, this is the first report of preliminary results evaluating the usefulness of super-paramagnetic clustering in the discipline of spectroscopy where it is used for classification of spectra.

  9. Regional health care planning: a methodology to cluster facilities using community utilization patterns

    PubMed Central

    2013-01-01

    Background Community-based health care planning and regulation necessitates grouping facilities and areal units into regions of similar health care use. Limited research has explored the methodologies used in creating these regions. We offer a new methodology that clusters facilities based on similarities in patient utilization patterns and geographic location. Our case study focused on Hospital Groups in Michigan, the allocation units used for predicting future inpatient hospital bed demand in the state’s Bed Need Methodology. The scientific, practical, and political concerns that were considered throughout the formulation and development of the methodology are detailed. Methods The clustering methodology employs a 2-step K-means + Ward’s clustering algorithm to group hospitals. The final number of clusters is selected using a heuristic that integrates both a statistical-based measure of cluster fit and characteristics of the resulting Hospital Groups. Results Using recent hospital utilization data, the clustering methodology identified 33 Hospital Groups in Michigan. Conclusions Despite being developed within the politically charged climate of Certificate of Need regulation, we have provided an objective, replicable, and sustainable methodology to create Hospital Groups. Because the methodology is built upon theoretically sound principles of clustering analysis and health care service utilization, it is highly transferable across applications and suitable for grouping facilities or areal units. PMID:23964905

  10. Simulations of the Formation and Evolution of X-ray Clusters

    NASA Astrophysics Data System (ADS)

    Bryan, G. L.; Klypin, A.; Norman, M. L.

    1994-05-01

    We describe results from a set of Omega = 1 Cold plus Hot Dark Matter (CHDM) and Cold Dark Matter (CDM) simulations. We examine the formation and evolution of X-ray clusters in a cosmological setting with sufficient numbers to perform statistical analysis. We find that CDM, normalized to COBE, seems to produce too many large clusters, both in terms of the luminosity (dn/dL) and temperature (dn/dT) functions. The CHDM simulation produces fewer clusters and the temperature distribution (our numerically most secure result) matches observations where they overlap. The computed cluster luminosity function drops below observations, but we are almost surely underestimating the X-ray luminosity. Because of the lower fluctuations in CHDM, there are only a small number of bright clusters in our simulation volume; however we can use the simulated clusters to fix the relation between temperature and velocity dispersion, allowing us to use collisionless N-body codes to probe larger length scales with correspondingly brighter clusters. The hydrodynamic simulations have been performed with a hybrid particle-mesh scheme for the dark matter and a high resolution grid-based piecewise parabolic method for the adiabatic gas dynamics. This combination has been implemented for massively parallel computers, allowing us to achive grids as large as 512(3) .

  11. Earthquake Declustering via a Nearest-Neighbor Approach in Space-Time-Magnitude Domain

    NASA Astrophysics Data System (ADS)

    Zaliapin, I. V.; Ben-Zion, Y.

    2016-12-01

    We propose a new method for earthquake declustering based on nearest-neighbor analysis of earthquakes in space-time-magnitude domain. The nearest-neighbor approach was recently applied to a variety of seismological problems that validate the general utility of the technique and reveal the existence of several different robust types of earthquake clusters. Notably, it was demonstrated that clustering associated with the largest earthquakes is statistically different from that of small-to-medium events. In particular, the characteristic bimodality of the nearest-neighbor distances that helps separating clustered and background events is often violated after the largest earthquakes in their vicinity, which is dominated by triggered events. This prevents using a simple threshold between the two modes of the nearest-neighbor distance distribution for declustering. The current study resolves this problem hence extending the nearest-neighbor approach to the problem of earthquake declustering. The proposed technique is applied to seismicity of different areas in California (San Jacinto, Coso, Salton Sea, Parkfield, Ventura, Mojave, etc.), as well as to the global seismicity, to demonstrate its stability and efficiency in treating various clustering types. The results are compared with those of alternative declustering methods.

  12. High-resolution land cover classification using low resolution global data

    NASA Astrophysics Data System (ADS)

    Carlotto, Mark J.

    2013-05-01

    A fusion approach is described that combines texture features from high-resolution panchromatic imagery with land cover statistics derived from co-registered low-resolution global databases to obtain high-resolution land cover maps. The method does not require training data or any human intervention. We use an MxN Gabor filter bank consisting of M=16 oriented bandpass filters (0-180°) at N resolutions (3-24 meters/pixel). The size range of these spatial filters is consistent with the typical scale of manmade objects and patterns of cultural activity in imagery. Clustering reduces the complexity of the data by combining pixels that have similar texture into clusters (regions). Texture classification assigns a vector of class likelihoods to each cluster based on its textural properties. Classification is unsupervised and accomplished using a bank of texture anomaly detectors. Class likelihoods are modulated by land cover statistics derived from lower resolution global data over the scene. Preliminary results from a number of Quickbird scenes show our approach is able to classify general land cover features such as roads, built up area, forests, open areas, and bodies of water over a wide range of scenes.

  13. Sequential analysis of hydrochemical data for watershed characterization.

    PubMed

    Thyne, Geoffrey; Güler, Cüneyt; Poeter, Eileen

    2004-01-01

    A methodology for characterizing the hydrogeology of watersheds using hydrochemical data that combine statistical, geochemical, and spatial techniques is presented. Surface water and ground water base flow and spring runoff samples (180 total) from a single watershed are first classified using hierarchical cluster analysis. The statistical clusters are analyzed for spatial coherence confirming that the clusters have a geological basis corresponding to topographic flowpaths and showing that the fractured rock aquifer behaves as an equivalent porous medium on the watershed scale. Then principal component analysis (PCA) is used to determine the sources of variation between parameters. PCA analysis shows that the variations within the dataset are related to variations in calcium, magnesium, SO4, and HCO3, which are derived from natural weathering reactions, and pH, NO3, and chlorine, which indicate anthropogenic impact. PHREEQC modeling is used to quantitatively describe the natural hydrochemical evolution for the watershed and aid in discrimination of samples that have an anthropogenic component. Finally, the seasonal changes in the water chemistry of individual sites were analyzed to better characterize the spatial variability of vertical hydraulic conductivity. The integrated result provides a method to characterize the hydrogeology of the watershed that fully utilizes traditional data.

  14. Groundwater flow and hydrogeochemical evolution in the Jianghan Plain, central China

    NASA Astrophysics Data System (ADS)

    Gan, Yiqun; Zhao, Ke; Deng, Yamin; Liang, Xing; Ma, Teng; Wang, Yanxin

    2018-05-01

    Hydrogeochemical analysis and multivariate statistics were applied to identify flow patterns and major processes controlling the hydrogeochemistry of groundwater in the Jianghan Plain, which is located in central Yangtze River Basin (central China) and characterized by intensive surface-water/groundwater interaction. Although HCO3-Ca-(Mg) type water predominated in the study area, the 457 (21 surface water and 436 groundwater) samples were effectively classified into five clusters by hierarchical cluster analysis. The hydrochemical variations among these clusters were governed by three factors from factor analysis. Major components (e.g., Ca, Mg and HCO3) in surface water and groundwater originated from carbonate and silicate weathering (factor 1). Redox conditions (factor 2) influenced the geogenic Fe and As contamination in shallow confined groundwater. Anthropogenic activities (factor 3) primarily caused high levels of Cl and SO4 in surface water and phreatic groundwater. Furthermore, the factor score 1 of samples in the shallow confined aquifer gradually increased along the flow paths. This study demonstrates that enhanced information on hydrochemistry in complex groundwater flow systems, by multivariate statistical methods, improves the understanding of groundwater flow and hydrogeochemical evolution due to natural and anthropogenic impacts.

  15. Statistical analysis of loopy belief propagation in random fields

    NASA Astrophysics Data System (ADS)

    Yasuda, Muneki; Kataoka, Shun; Tanaka, Kazuyuki

    2015-10-01

    Loopy belief propagation (LBP), which is equivalent to the Bethe approximation in statistical mechanics, is a message-passing-type inference method that is widely used to analyze systems based on Markov random fields (MRFs). In this paper, we propose a message-passing-type method to analytically evaluate the quenched average of LBP in random fields by using the replica cluster variation method. The proposed analytical method is applicable to general pairwise MRFs with random fields whose distributions differ from each other and can give the quenched averages of the Bethe free energies over random fields, which are consistent with numerical results. The order of its computational cost is equivalent to that of standard LBP. In the latter part of this paper, we describe the application of the proposed method to Bayesian image restoration, in which we observed that our theoretical results are in good agreement with the numerical results for natural images.

  16. Depth of interaction decoding of a continuous crystal detector module.

    PubMed

    Ling, T; Lewellen, T K; Miyaoka, R S

    2007-04-21

    We present a clustering method to extract the depth of interaction (DOI) information from an 8 mm thick crystal version of our continuous miniature crystal element (cMiCE) small animal PET detector. This clustering method, based on the maximum-likelihood (ML) method, can effectively build look-up tables (LUT) for different DOI regions. Combined with our statistics-based positioning (SBP) method, which uses a LUT searching algorithm based on the ML method and two-dimensional mean-variance LUTs of light responses from each photomultiplier channel with respect to different gamma ray interaction positions, the position of interaction and DOI can be estimated simultaneously. Data simulated using DETECT2000 were used to help validate our approach. An experiment using our cMiCE detector was designed to evaluate the performance. Two and four DOI region clustering were applied to the simulated data. Two DOI regions were used for the experimental data. The misclassification rate for simulated data is about 3.5% for two DOI regions and 10.2% for four DOI regions. For the experimental data, the rate is estimated to be approximately 25%. By using multi-DOI LUTs, we also observed improvement of the detector spatial resolution, especially for the corner region of the crystal. These results show that our ML clustering method is a consistent and reliable way to characterize DOI in a continuous crystal detector without requiring any modifications to the crystal or detector front end electronics. The ability to characterize the depth-dependent light response function from measured data is a major step forward in developing practical detectors with DOI positioning capability.

  17. A case-association cluster detection and visualisation tool with an application to Legionnaires’ disease

    PubMed Central

    Sansom, P; Copley, V R; Naik, F C; Leach, S; Hall, I M

    2013-01-01

    Statistical methods used in spatio-temporal surveillance of disease are able to identify abnormal clusters of cases but typically do not provide a measure of the degree of association between one case and another. Such a measure would facilitate the assignment of cases to common groups and be useful in outbreak investigations of diseases that potentially share the same source. This paper presents a model-based approach, which on the basis of available location data, provides a measure of the strength of association between cases in space and time and which is used to designate and visualise the most likely groupings of cases. The method was developed as a prospective surveillance tool to signal potential outbreaks, but it may also be used to explore groupings of cases in outbreak investigations. We demonstrate the method by using a historical case series of Legionnaires’ disease amongst residents of England and Wales. PMID:23483594

  18. Clustering, randomness, and regularity in cloud fields. 4. Stratocumulus cloud fields

    NASA Astrophysics Data System (ADS)

    Lee, J.; Chou, J.; Weger, R. C.; Welch, R. M.

    1994-07-01

    To complete the analysis of the spatial distribution of boundary layer cloudiness, the present study focuses on nine stratocumulus Landsat scenes. The results indicate many similarities between stratocumulus and cumulus spatial distributions. Most notably, at full spatial resolution all scenes exhibit a decidedly clustered distribution. The strength of the clustering signal decreases with increasing cloud size; the clusters themselves consist of a few clouds (less than 10), occupy a small percentage of the cloud field area (less than 5%), contain between 20% and 60% of the cloud field population, and are randomly located within the scene. In contrast, stratocumulus in almost every respect are more strongly clustered than are cumulus cloud fields. For instance, stratocumulus clusters contain more clouds per cluster, occupy a larger percentage of the total area, and have a larger percentage of clouds participating in clusters than the corresponding cumulus examples. To investigate clustering at intermediate spatial scales, the local dimensionality statistic is introduced. Results obtained from this statistic provide the first direct evidence for regularity among large (>900 m in diameter) clouds in stratocumulus and cumulus cloud fields, in support of the inhibition hypothesis of Ramirez and Bras (1990). Also, the size compensated point-to-cloud cumulative distribution function statistic is found to be necessary to obtain a consistent description of stratocumulus cloud distributions. A hypothesis regarding the underlying physical mechanisms responsible for cloud clustering is presented. It is suggested that cloud clusters often arise from 4 to 10 triggering events localized within regions less than 2 km in diameter and randomly distributed within the cloud field. As the size of the cloud surpasses the scale of the triggering region, the clustering signal weakens and the larger cloud locations become more random.

  19. Clustering, randomness, and regularity in cloud fields. 4: Stratocumulus cloud fields

    NASA Technical Reports Server (NTRS)

    Lee, J.; Chou, J.; Weger, R. C.; Welch, R. M.

    1994-01-01

    To complete the analysis of the spatial distribution of boundary layer cloudiness, the present study focuses on nine stratocumulus Landsat scenes. The results indicate many similarities between stratocumulus and cumulus spatial distributions. Most notably, at full spatial resolution all scenes exhibit a decidedly clustered distribution. The strength of the clustering signal decreases with increasing cloud size; the clusters themselves consist of a few clouds (less than 10), occupy a small percentage of the cloud field area (less than 5%), contain between 20% and 60% of the cloud field population, and are randomly located within the scene. In contrast, stratocumulus in almost every respect are more strongly clustered than are cumulus cloud fields. For instance, stratocumulus clusters contain more clouds per cluster, occupy a larger percentage of the total area, and have a larger percentage of clouds participating in clusters than the corresponding cumulus examples. To investigate clustering at intermediate spatial scales, the local dimensionality statistic is introduced. Results obtained from this statistic provide the first direct evidence for regularity among large (more than 900 m in diameter) clouds in stratocumulus and cumulus cloud fields, in support of the inhibition hypothesis of Ramirez and Bras (1990). Also, the size compensated point-to-cloud cumulative distribution function statistic is found to be necessary to obtain a consistent description of stratocumulus cloud distributions. A hypothesis regarding the underlying physical mechanisms responsible for cloud clustering is presented. It is suggested that cloud clusters often arise from 4 to 10 triggering events localized within regions less than 2 km in diameter and randomly distributed within the cloud field. As the size of the cloud surpasses the scale of the triggering region, the clustering signal weakens and the larger cloud locations become more random.

  20. SHIPS: Spectral Hierarchical Clustering for the Inference of Population Structure in Genetic Studies

    PubMed Central

    Bouaziz, Matthieu; Paccard, Caroline; Guedj, Mickael; Ambroise, Christophe

    2012-01-01

    Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising solution to infer fine-scale genetic patterns. PMID:23077494

  1. Development and optimization of SPECT gated blood pool cluster analysis for the prediction of CRT outcome

    DOE Office of Scientific and Technical Information (OSTI.GOV)

    Lalonde, Michel, E-mail: mlalonde15@rogers.com; Wassenaar, Richard; Wells, R. Glenn

    2014-07-15

    Purpose: Phase analysis of single photon emission computed tomography (SPECT) radionuclide angiography (RNA) has been investigated for its potential to predict the outcome of cardiac resynchronization therapy (CRT). However, phase analysis may be limited in its potential at predicting CRT outcome as valuable information may be lost by assuming that time-activity curves (TAC) follow a simple sinusoidal shape. A new method, cluster analysis, is proposed which directly evaluates the TACs and may lead to a better understanding of dyssynchrony patterns and CRT outcome. Cluster analysis algorithms were developed and optimized to maximize their ability to predict CRT response. Methods: Aboutmore » 49 patients (N = 27 ischemic etiology) received a SPECT RNA scan as well as positron emission tomography (PET) perfusion and viability scans prior to undergoing CRT. A semiautomated algorithm sampled the left ventricle wall to produce 568 TACs from SPECT RNA data. The TACs were then subjected to two different cluster analysis techniques, K-means, and normal average, where several input metrics were also varied to determine the optimal settings for the prediction of CRT outcome. Each TAC was assigned to a cluster group based on the comparison criteria and global and segmental cluster size and scores were used as measures of dyssynchrony and used to predict response to CRT. A repeated random twofold cross-validation technique was used to train and validate the cluster algorithm. Receiver operating characteristic (ROC) analysis was used to calculate the area under the curve (AUC) and compare results to those obtained for SPECT RNA phase analysis and PET scar size analysis methods. Results: Using the normal average cluster analysis approach, the septal wall produced statistically significant results for predicting CRT results in the ischemic population (ROC AUC = 0.73;p < 0.05 vs. equal chance ROC AUC = 0.50) with an optimal operating point of 71% sensitivity and 60% specificity. Cluster analysis results were similar to SPECT RNA phase analysis (ROC AUC = 0.78, p = 0.73 vs cluster AUC; sensitivity/specificity = 59%/89%) and PET scar size analysis (ROC AUC = 0.73, p = 1.0 vs cluster AUC; sensitivity/specificity = 76%/67%). Conclusions: A SPECT RNA cluster analysis algorithm was developed for the prediction of CRT outcome. Cluster analysis results produced results equivalent to those obtained from Fourier and scar analysis.« less

  2. A comparison of hierarchical cluster analysis and league table rankings as methods for analysis and presentation of district health system performance data in Uganda.

    PubMed

    Tashobya, Christine K; Dubourg, Dominique; Ssengooba, Freddie; Speybroeck, Niko; Macq, Jean; Criel, Bart

    2016-03-01

    In 2003, the Uganda Ministry of Health introduced the district league table for district health system performance assessment. The league table presents district performance against a number of input, process and output indicators and a composite index to rank districts. This study explores the use of hierarchical cluster analysis for analysing and presenting district health systems performance data and compares this approach with the use of the league table in Uganda. Ministry of Health and district plans and reports, and published documents were used to provide information on the development and utilization of the Uganda district league table. Quantitative data were accessed from the Ministry of Health databases. Statistical analysis using SPSS version 20 and hierarchical cluster analysis, utilizing Wards' method was used. The hierarchical cluster analysis was conducted on the basis of seven clusters determined for each year from 2003 to 2010, ranging from a cluster of good through moderate-to-poor performers. The characteristics and membership of clusters varied from year to year and were determined by the identity and magnitude of performance of the individual variables. Criticisms of the league table include: perceived unfairness, as it did not take into consideration district peculiarities; and being oversummarized and not adequately informative. Clustering organizes the many data points into clusters of similar entities according to an agreed set of indicators and can provide the beginning point for identifying factors behind the observed performance of districts. Although league table ranking emphasize summation and external control, clustering has the potential to encourage a formative, learning approach. More research is required to shed more light on factors behind observed performance of the different clusters. Other countries especially low-income countries that share many similarities with Uganda can learn from these experiences. © The Author 2015. Published by Oxford University Press in association with The London School of Hygiene and Tropical Medicine.

  3. A comparison of hierarchical cluster analysis and league table rankings as methods for analysis and presentation of district health system performance data in Uganda†

    PubMed Central

    Tashobya, Christine K; Dubourg, Dominique; Ssengooba, Freddie; Speybroeck, Niko; Macq, Jean; Criel, Bart

    2016-01-01

    In 2003, the Uganda Ministry of Health introduced the district league table for district health system performance assessment. The league table presents district performance against a number of input, process and output indicators and a composite index to rank districts. This study explores the use of hierarchical cluster analysis for analysing and presenting district health systems performance data and compares this approach with the use of the league table in Uganda. Ministry of Health and district plans and reports, and published documents were used to provide information on the development and utilization of the Uganda district league table. Quantitative data were accessed from the Ministry of Health databases. Statistical analysis using SPSS version 20 and hierarchical cluster analysis, utilizing Wards’ method was used. The hierarchical cluster analysis was conducted on the basis of seven clusters determined for each year from 2003 to 2010, ranging from a cluster of good through moderate-to-poor performers. The characteristics and membership of clusters varied from year to year and were determined by the identity and magnitude of performance of the individual variables. Criticisms of the league table include: perceived unfairness, as it did not take into consideration district peculiarities; and being oversummarized and not adequately informative. Clustering organizes the many data points into clusters of similar entities according to an agreed set of indicators and can provide the beginning point for identifying factors behind the observed performance of districts. Although league table ranking emphasize summation and external control, clustering has the potential to encourage a formative, learning approach. More research is required to shed more light on factors behind observed performance of the different clusters. Other countries especially low-income countries that share many similarities with Uganda can learn from these experiences. PMID:26024882

  4. Cluster-based analysis improves predictive validity of spike-triggered receptive field estimates

    PubMed Central

    Malone, Brian J.

    2017-01-01

    Spectrotemporal receptive field (STRF) characterization is a central goal of auditory physiology. STRFs are often approximated by the spike-triggered average (STA), which reflects the average stimulus preceding a spike. In many cases, the raw STA is subjected to a threshold defined by gain values expected by chance. However, such correction methods have not been universally adopted, and the consequences of specific gain-thresholding approaches have not been investigated systematically. Here, we evaluate two classes of statistical correction techniques, using the resulting STRF estimates to predict responses to a novel validation stimulus. The first, more traditional technique eliminated STRF pixels (time-frequency bins) with gain values expected by chance. This correction method yielded significant increases in prediction accuracy, including when the threshold setting was optimized for each unit. The second technique was a two-step thresholding procedure wherein clusters of contiguous pixels surviving an initial gain threshold were then subjected to a cluster mass threshold based on summed pixel values. This approach significantly improved upon even the best gain-thresholding techniques. Additional analyses suggested that allowing threshold settings to vary independently for excitatory and inhibitory subfields of the STRF resulted in only marginal additional gains, at best. In summary, augmenting reverse correlation techniques with principled statistical correction choices increased prediction accuracy by over 80% for multi-unit STRFs and by over 40% for single-unit STRFs, furthering the interpretational relevance of the recovered spectrotemporal filters for auditory systems analysis. PMID:28877194

  5. [Space-time suicide clustering in the community of Antequera (Spain)].

    PubMed

    Pérez-Costillas, Lucía; Blasco-Fontecilla, Hilario; Benítez, Nicolás; Comino, Raquel; Antón, José Miguel; Ramos-Medina, Valentín; Lopez, Amalia; Palomo, José Luis; Madrigal, Lucía; Alcalde, Javier; Perea-Millá, Emilio; Artieda-Urrutia, Paula; de León-Martínez, Victoria; de Diego Otero, Yolanda

    2015-01-01

    Approximately 3,500 people commit suicide every year in Spain. The main aim of this study is to explore if a spatial and temporal clustering of suicide exists in the region of Antequera (Málaga, España). Sample and procedure: All suicides from January 1, 2004 to December 31, 2008 were identified using data from the Forensic Pathology Department of the Institute of Legal Medicine, Málaga (España). Geolocalisation. Google Earth was used to calculate the coordinates for each suicide decedent's address. Statistical analysis. A spatiotemporal permutation scan statistic and the Ripley's K function were used to explore spatiotemporal clustering. Pearson's chi-squared was used to determine whether there were differences between suicides inside and outside the spatiotemporal clusters. A total of 120 individuals committed suicide within the region of Antequera, of which 96 (80%) were included in our analyses. Statistically significant evidence for 7 spatiotemporal suicide clusters emerged within critical limits for the 0-2.5 km distance and for the first and second semanas (P<.05 in both cases) after suicide. There was not a single subject diagnosed with a current psychotic disorder, among suicides within clusters, whereas outside clusters, 20% had this diagnosis (X2=4.13; df=1; P<.05). There are spatiotemporal suicide clusters in the area surrounding Antequera. Patients diagnosed with current psychotic disorder are less likely to be influenced by the factors explaining suicide clustering. Copyright © 2013 SEP y SEPB. Published by Elsevier España. All rights reserved.

  6. Assessing market uncertainty by means of a time-varying intermittency parameter for asset price fluctuations

    NASA Astrophysics Data System (ADS)

    Rypdal, Martin; Sirnes, Espen; Løvsletten, Ola; Rypdal, Kristoffer

    2013-08-01

    Maximum likelihood estimation techniques for multifractal processes are applied to high-frequency data in order to quantify intermittency in the fluctuations of asset prices. From time records as short as one month these methods permit extraction of a meaningful intermittency parameter λ characterising the degree of volatility clustering. We can therefore study the time evolution of volatility clustering and test the statistical significance of this variability. By analysing data from the Oslo Stock Exchange, and comparing the results with the investment grade spread, we find that the estimates of λ are lower at times of high market uncertainty.

  7. Cluster Analysis of Time-Dependent Crystallographic Data: Direct Identification of Time-Independent Structural Intermediates

    PubMed Central

    Kostov, Konstantin S.; Moffat, Keith

    2011-01-01

    The initial output of a time-resolved macromolecular crystallography experiment is a time-dependent series of difference electron density maps that displays the time-dependent changes in underlying structure as a reaction progresses. The goal is to interpret such data in terms of a small number of crystallographically refinable, time-independent structures, each associated with a reaction intermediate; to establish the pathways and rate coefficients by which these intermediates interconvert; and thereby to elucidate a chemical kinetic mechanism. One strategy toward achieving this goal is to use cluster analysis, a statistical method that groups objects based on their similarity. If the difference electron density at a particular voxel in the time-dependent difference electron density (TDED) maps is sensitive to the presence of one and only one intermediate, then its temporal evolution will exactly parallel the concentration profile of that intermediate with time. The rationale is therefore to cluster voxels with respect to the shapes of their TDEDs, so that each group or cluster of voxels corresponds to one structural intermediate. Clusters of voxels whose TDEDs reflect the presence of two or more specific intermediates can also be identified. From such groupings one can then infer the number of intermediates, obtain their time-independent difference density characteristics, and refine the structure of each intermediate. We review the principles of cluster analysis and clustering algorithms in a crystallographic context, and describe the application of the method to simulated and experimental time-resolved crystallographic data for the photocycle of photoactive yellow protein. PMID:21244840

  8. Response to traumatic brain injury neurorehabilitation through an artificial intelligence and statistics hybrid knowledge discovery from databases methodology.

    PubMed

    Gibert, Karina; García-Rudolph, Alejandro; García-Molina, Alberto; Roig-Rovira, Teresa; Bernabeu, Montse; Tormos, José María

    2008-01-01

    Develop a classificatory tool to identify different populations of patients with Traumatic Brain Injury based on the characteristics of deficit and response to treatment. A KDD framework where first, descriptive statistics of every variable was done, data cleaning and selection of relevant variables. Then data was mined using a generalization of Clustering based on rules (CIBR), an hybrid AI and Statistics technique which combines inductive learning (AI) and clustering (Statistics). A prior Knowledge Base (KB) is considered to properly bias the clustering; semantic constraints implied by the KB hold in final clusters, guaranteeing interpretability of the resultis. A generalization (Exogenous Clustering based on rules, ECIBR) is presented, allowing to define the KB in terms of variables which will not be considered in the clustering process itself, to get more flexibility. Several tools as Class panel graph are introduced in the methodology to assist final interpretation. A set of 5 classes was recommended by the system and interpretation permitted profiles labeling. From the medical point of view, composition of classes is well corresponding with different patterns of increasing level of response to rehabilitation treatments. All the patients initially assessable conform a single group. Severe impaired patients are subdivided in four profiles which clearly distinct response patterns. Particularly interesting the partial response profile, where patients could not improve executive functions. Meaningful classes were obtained and, from a semantics point of view, the results were sensibly improved regarding classical clustering, according to our opinion that hybrid AI & Stats techniques are more powerful for KDD than pure ones.

  9. Astrophysical properties of star clusters in the Magellanic Clouds homogeneously estimated by ASteCA

    NASA Astrophysics Data System (ADS)

    Perren, G. I.; Piatti, A. E.; Vázquez, R. A.

    2017-06-01

    Aims: We seek to produce a homogeneous catalog of astrophysical parameters of 239 resolved star clusters, located in the Small and Large Magellanic Clouds, observed in the Washington photometric system. Methods: The cluster sample was processed with the recently introduced Automated Stellar Cluster Analysis (ASteCA) package, which ensures both an automatized and a fully reproducible treatment, together with a statistically based analysis of their fundamental parameters and associated uncertainties. The fundamental parameters determined for each cluster with this tool, via a color-magnitude diagram (CMD) analysis, are metallicity, age, reddening, distance modulus, and total mass. Results: We generated a homogeneous catalog of structural and fundamental parameters for the studied cluster sample and performed a detailed internal error analysis along with a thorough comparison with values taken from 26 published articles. We studied the distribution of cluster fundamental parameters in both Clouds and obtained their age-metallicity relationships. Conclusions: The ASteCA package can be applied to an unsupervised determination of fundamental cluster parameters, which is a task of increasing relevance as more data becomes available through upcoming surveys. A table with the estimated fundamental parameters for the 239 clusters analyzed is only available at the CDS via anonymous ftp to http://cdsarc.u-strasbg.fr (http://130.79.128.5) or via http://cdsarc.u-strasbg.fr/viz-bin/qcat?J/A+A/602/A89

  10. Finding approximate gene clusters with Gecko 3.

    PubMed

    Winter, Sascha; Jahn, Katharina; Wehner, Stefanie; Kuchenbecker, Leon; Marz, Manja; Stoye, Jens; Böcker, Sebastian

    2016-11-16

    Gene-order-based comparison of multiple genomes provides signals for functional analysis of genes and the evolutionary process of genome organization. Gene clusters are regions of co-localized genes on genomes of different species. The rapid increase in sequenced genomes necessitates bioinformatics tools for finding gene clusters in hundreds of genomes. Existing tools are often restricted to few (in many cases, only two) genomes, and often make restrictive assumptions such as short perfect conservation, conserved gene order or monophyletic gene clusters. We present Gecko 3, an open-source software for finding gene clusters in hundreds of bacterial genomes, that comes with an easy-to-use graphical user interface. The underlying gene cluster model is intuitive, can cope with low degrees of conservation as well as misannotations and is complemented by a sound statistical evaluation. To evaluate the biological benefit of Gecko 3 and to exemplify our method, we search for gene clusters in a dataset of 678 bacterial genomes using Synechocystis sp. PCC 6803 as a reference. We confirm detected gene clusters reviewing the literature and comparing them to a database of operons; we detect two novel clusters, which were confirmed by publicly available experimental RNA-Seq data. The computational analysis is carried out on a laptop computer in <40 min. © The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research.

  11. Genotyping and spatial analysis of pulmonary tuberculosis and diabetes cases in the state of Veracruz, Mexico

    PubMed Central

    Blanco-Guillot, Francles; Ferreyra-Reyes, Leticia; Delgado-Sánchez, Guadalupe; Ferreira-Guerrero, Elizabeth; Montero-Campos, Rogelio; Bobadilla-del-Valle, Miriam; Martínez-Gamboa, Rosa Areli; Torres-González, Pedro; Téllez-Vazquez, Norma; Canizales-Quintero, Sergio; Yanes-Lane, Mercedes; Mongua-Rodríguez, Norma; Ponce-de-León, Alfredo; Sifuentes-Osornio, José

    2018-01-01

    Background Genotyping and georeferencing in tuberculosis (TB) have been used to characterize the distribution of the disease and occurrence of transmission within specific groups and communities. Objective The objective of this study was to test the hypothesis that diabetes mellitus (DM) and pulmonary TB may occur in spatial and molecular aggregations. Material and methods Retrospective cohort study of patients with pulmonary TB. The study area included 12 municipalities in the Sanitary Jurisdiction of Orizaba, Veracruz, México. Patients with acid-fast bacilli in sputum smears and/or Mycobacterium tuberculosis in sputum cultures were recruited from 1995 to 2010. Clinical (standardized questionnaire, physical examination, chest X-ray, blood glucose test and HIV test), microbiological, epidemiological, and molecular evaluations were carried out. Patients were considered “genotype-clustered” if two or more isolates from different patients were identified within 12 months of each other and had six or more IS6110 bands in an identical pattern, or < 6 bands with identical IS6110 RFLP patterns and spoligotype with the same spacer oligonucleotides. Residential and health care centers addresses were georeferenced. We used a Jeep hand GPS. The coordinates were transferred from the GPS files to ArcGIS using ArcMap 9.3. We evaluated global spatial aggregation of patients in IS6110-RFLP/ spoligotype clusters using global Moran´s I. Since global distribution was not random, we evaluated “hotspots” using Getis-Ord Gi* statistic. Using bivariate and multivariate analysis we analyzed sociodemographic, behavioral, clinic and bacteriological conditions associated with “hotspots”. We used STATA® v13.1 for all statistical analysis. Results From 1995 to 2010, 1,370 patients >20 years were diagnosed with pulmonary TB; 33% had DM. The proportion of isolates that were genotyped was 80.7% (n = 1105), of which 31% (n = 342) were grouped in 91 genotype clusters with 2 to 23 patients each; 65.9% of total clusters were small (2 members) involving 35.08% of patients. Twenty three (22.7) percent of cases were classified as recent transmission. Moran`s I indicated that distribution of patients in IS6110-RFLP/spoligotype clusters was not random (Moran`s I = 0.035468, Z value = 7.0, p = 0.00). Local spatial analysis showed statistically significant spatial aggregation of patients in IS6110-RFLP/spoligotype clusters identifying “hotspots” and “coldspots”. GI* statistic showed that the hotspot for spatial clustering was located in Camerino Z. Mendoza municipality; 14.6% (50/342) of patients in genotype clusters were located in a hotspot; of these, 60% (30/50) lived with DM. Using logistic regression the statistically significant variables associated with hotspots were: DM [adjusted Odds Ratio (aOR) 7.04, 95% Confidence interval (CI) 3.03–16.38] and attending the health center in Camerino Z. Mendoza (aOR18.04, 95% CI 7.35–44.28). Conclusions The combination of molecular and epidemiological information with geospatial data allowed us to identify the concurrence of molecular clustering and spatial aggregation of patients with DM and TB. This information may be highly useful for TB control programs. PMID:29534104

  12. Joint spatial-spectral hyperspectral image clustering using block-diagonal amplified affinity matrix

    NASA Astrophysics Data System (ADS)

    Fan, Lei; Messinger, David W.

    2018-03-01

    The large number of spectral channels in a hyperspectral image (HSI) produces a fine spectral resolution to differentiate between materials in a scene. However, difficult classes that have similar spectral signatures are often confused while merely exploiting information in the spectral domain. Therefore, in addition to spectral characteristics, the spatial relationships inherent in HSIs should also be considered for incorporation into classifiers. The growing availability of high spectral and spatial resolution of remote sensors provides rich information for image clustering. Besides the discriminating power in the rich spectrum, contextual information can be extracted from the spatial domain, such as the size and the shape of the structure to which one pixel belongs. In recent years, spectral clustering has gained popularity compared to other clustering methods due to the difficulty of accurate statistical modeling of data in high dimensional space. The joint spatial-spectral information could be effectively incorporated into the proximity graph for spectral clustering approach, which provides a better data representation by discovering the inherent lower dimensionality from the input space. We embedded both spectral and spatial information into our proposed local density adaptive affinity matrix, which is able to handle multiscale data by automatically selecting the scale of analysis for every pixel according to its neighborhood of the correlated pixels. Furthermore, we explored the "conductivity method," which aims at amplifying the block diagonal structure of the affinity matrix to further improve the performance of spectral clustering on HSI datasets.

  13. ALCHEMY: a reliable method for automated SNP genotype calling for small batch sizes and highly homozygous populations

    PubMed Central

    Wright, Mark H.; Tung, Chih-Wei; Zhao, Keyan; Reynolds, Andy; McCouch, Susan R.; Bustamante, Carlos D.

    2010-01-01

    Motivation: The development of new high-throughput genotyping products requires a significant investment in testing and training samples to evaluate and optimize the product before it can be used reliably on new samples. One reason for this is current methods for automated calling of genotypes are based on clustering approaches which require a large number of samples to be analyzed simultaneously, or an extensive training dataset to seed clusters. In systems where inbred samples are of primary interest, current clustering approaches perform poorly due to the inability to clearly identify a heterozygote cluster. Results: As part of the development of two custom single nucleotide polymorphism genotyping products for Oryza sativa (domestic rice), we have developed a new genotype calling algorithm called ‘ALCHEMY’ based on statistical modeling of the raw intensity data rather than modelless clustering. A novel feature of the model is the ability to estimate and incorporate inbreeding information on a per sample basis allowing accurate genotyping of both inbred and heterozygous samples even when analyzed simultaneously. Since clustering is not used explicitly, ALCHEMY performs well on small sample sizes with accuracy exceeding 99% with as few as 18 samples. Availability: ALCHEMY is available for both commercial and academic use free of charge and distributed under the GNU General Public License at http://alchemy.sourceforge.net/ Contact: mhw6@cornell.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20926420

  14. RELICS: Strong Lens Models for Five Galaxy Clusters from the Reionization Lensing Cluster Survey

    NASA Astrophysics Data System (ADS)

    Cerny, Catherine; Sharon, Keren; Andrade-Santos, Felipe; Avila, Roberto J.; Bradač, Maruša; Bradley, Larry D.; Carrasco, Daniela; Coe, Dan; Czakon, Nicole G.; Dawson, William A.; Frye, Brenda L.; Hoag, Austin; Huang, Kuang-Han; Johnson, Traci L.; Jones, Christine; Lam, Daniel; Lovisari, Lorenzo; Mainali, Ramesh; Oesch, Pascal A.; Ogaz, Sara; Past, Matthew; Paterno-Mahler, Rachel; Peterson, Avery; Riess, Adam G.; Rodney, Steven A.; Ryan, Russell E.; Salmon, Brett; Sendra-Server, Irene; Stark, Daniel P.; Strolger, Louis-Gregory; Trenti, Michele; Umetsu, Keiichi; Vulcani, Benedetta; Zitrin, Adi

    2018-06-01

    Strong gravitational lensing by galaxy clusters magnifies background galaxies, enhancing our ability to discover statistically significant samples of galaxies at {\\boldsymbol{z}}> 6, in order to constrain the high-redshift galaxy luminosity functions. Here, we present the first five lens models out of the Reionization Lensing Cluster Survey (RELICS) Hubble Treasury Program, based on new HST WFC3/IR and ACS imaging of the clusters RXC J0142.9+4438, Abell 2537, Abell 2163, RXC J2211.7–0349, and ACT-CLJ0102–49151. The derived lensing magnification is essential for estimating the intrinsic properties of high-redshift galaxy candidates, and properly accounting for the survey volume. We report on new spectroscopic redshifts of multiply imaged lensed galaxies behind these clusters, which are used as constraints, and detail our strategy to reduce systematic uncertainties due to lack of spectroscopic information. In addition, we quantify the uncertainty on the lensing magnification due to statistical and systematic errors related to the lens modeling process, and find that in all but one cluster, the magnification is constrained to better than 20% in at least 80% of the field of view, including statistical and systematic uncertainties. The five clusters presented in this paper span the range of masses and redshifts of the clusters in the RELICS program. We find that they exhibit similar strong lensing efficiencies to the clusters targeted by the Hubble Frontier Fields within the WFC3/IR field of view. Outputs of the lens models are made available to the community through the Mikulski Archive for Space Telescopes.

  15. Tidal radii of the globular clusters M 5, M 12, M 13, M 15, M 53, NGC 5053 and NGC 5466 from automated star counts.

    NASA Astrophysics Data System (ADS)

    Lehmann, I.; Scholz, R.-D.

    1997-04-01

    We present new tidal radii for seven Galactic globular clusters using the method of automated star counts on Schmidt plates of the Tautenburg, Palomar and UK telescopes. The plates were fully scanned with the APM system in Cambridge (UK). Special account was given to a reliable background subtraction and the correction of crowding effects in the central cluster region. For the latter we used a new kind of crowding correction based on a statistical approach to the distribution of stellar images and the luminosity function of the cluster stars in the uncrowded area. The star counts were correlated with surface brightness profiles of different authors to obtain complete projected density profiles of the globular clusters. Fitting an empirical density law (King 1962) we derived the following structural parameters: tidal radius r_t_, core radius r_c_ and concentration parameter c. In the cases of NGC 5466, M 5, M 12, M 13 and M 15 we found an indication for a tidal tail around these objects (cf. Grillmair et al. 1995).

  16. VizieR Online Data Catalog: Tidal radii of 7 globular clusters (Lehmann+ 1997)

    NASA Astrophysics Data System (ADS)

    Lehmann, I.; Scholz, R.-D.

    1998-02-01

    We present new tidal radii for seven Galactic globular clusters using the method of automated star counts on Schmidt plates of the Tautenburg, Palomar and UK telescopes. The plates were fully scanned with the APM system in Cambridge (UK). Special account was given to a reliable background subtraction and the correction of crowding effects in the central cluster region. For the latter we used a new kind of crowding correction based on a statistical approach to the distribution of stellar images and the luminosity function of the cluster stars in the uncrowded area. The star counts were correlated with surface brightness profiles of different authors to obtain complete projected density profiles of the globular clusters. Fitting an empirical density law (King 1962AJ.....67..471K) we derived the following structural parameters: tidal radius rt, core radius rc and concentration parameter c. In the cases of NGC 5466, M 5, M 12, M 13 and M 15 we found an indication for a tidal tail around these objects (cf. Grillmair et al., 1995AJ....109.2553G). (1 data file).

  17. Locating sources within a dense sensor array using graph clustering

    NASA Astrophysics Data System (ADS)

    Gerstoft, P.; Riahi, N.

    2017-12-01

    We develop a model-free technique to identify weak sources within dense sensor arrays using graph clustering. No knowledge about the propagation medium is needed except that signal strengths decay to insignificant levels within a scale that is shorter than the aperture. We then reinterpret the spatial coherence matrix of a wave field as a matrix whose support is a connectivity matrix of a graph with sensors as vertices. In a dense network, well-separated sources induce clusters in this graph. The geographic spread of these clusters can serve to localize the sources. The support of the covariance matrix is estimated from limited-time data using a hypothesis test with a robust phase-only coherence test statistic combined with a physical distance criterion. The latter criterion ensures graph sparsity and thus prevents clusters from forming by chance. We verify the approach and quantify its reliability on a simulated dataset. The method is then applied to data from a dense 5200 element geophone array that blanketed of the city of Long Beach (CA). The analysis exposes a helicopter traversing the array and oil production facilities.

  18. Prediction of strontium bromide laser efficiency using cluster and decision tree analysis

    NASA Astrophysics Data System (ADS)

    Iliev, Iliycho; Gocheva-Ilieva, Snezhana; Kulin, Chavdar

    2018-01-01

    Subject of investigation is a new high-powered strontium bromide (SrBr2) vapor laser emitting in multiline region of wavelengths. The laser is an alternative to the atom strontium lasers and electron free lasers, especially at the line 6.45 μm which line is used in surgery for medical processing of biological tissues and bones with minimal damage. In this paper the experimental data from measurements of operational and output characteristics of the laser are statistically processed by means of cluster analysis and tree-based regression techniques. The aim is to extract the more important relationships and dependences from the available data which influence the increase of the overall laser efficiency. There are constructed and analyzed a set of cluster models. It is shown by using different cluster methods that the seven investigated operational characteristics (laser tube diameter, length, supplied electrical power, and others) and laser efficiency are combined in 2 clusters. By the built regression tree models using Classification and Regression Trees (CART) technique there are obtained dependences to predict the values of efficiency, and especially the maximum efficiency with over 95% accuracy.

  19. Permutation Tests of Hierarchical Cluster Analyses of Carrion Communities and Their Potential Use in Forensic Entomology.

    PubMed

    van der Ham, Joris L

    2016-05-19

    Forensic entomologists can use carrion communities' ecological succession data to estimate the postmortem interval (PMI). Permutation tests of hierarchical cluster analyses of these data provide a conceptual method to estimate part of the PMI, the post-colonization interval (post-CI). This multivariate approach produces a baseline of statistically distinct clusters that reflect changes in the carrion community composition during the decomposition process. Carrion community samples of unknown post-CIs are compared with these baseline clusters to estimate the post-CI. In this short communication, I use data from previously published studies to demonstrate the conceptual feasibility of this multivariate approach. Analyses of these data produce series of significantly distinct clusters, which represent carrion communities during 1- to 20-day periods of the decomposition process. For 33 carrion community samples, collected over an 11-day period, this approach correctly estimated the post-CI within an average range of 3.1 days. © The Authors 2016. Published by Oxford University Press on behalf of Entomological Society of America. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

  20. A particle swarm optimized kernel-based clustering method for crop mapping from multi-temporal polarimetric L-band SAR observations

    NASA Astrophysics Data System (ADS)

    Tamiminia, Haifa; Homayouni, Saeid; McNairn, Heather; Safari, Abdoreza

    2017-06-01

    Polarimetric Synthetic Aperture Radar (PolSAR) data, thanks to their specific characteristics such as high resolution, weather and daylight independence, have become a valuable source of information for environment monitoring and management. The discrimination capability of observations acquired by these sensors can be used for land cover classification and mapping. The aim of this paper is to propose an optimized kernel-based C-means clustering algorithm for agriculture crop mapping from multi-temporal PolSAR data. Firstly, several polarimetric features are extracted from preprocessed data. These features are linear polarization intensities, and several statistical and physical based decompositions such as Cloude-Pottier, Freeman-Durden and Yamaguchi techniques. Then, the kernelized version of hard and fuzzy C-means clustering algorithms are applied to these polarimetric features in order to identify crop types. The kernel function, unlike the conventional partitioning clustering algorithms, simplifies the non-spherical and non-linearly patterns of data structure, to be clustered easily. In addition, in order to enhance the results, Particle Swarm Optimization (PSO) algorithm is used to tune the kernel parameters, cluster centers and to optimize features selection. The efficiency of this method was evaluated by using multi-temporal UAVSAR L-band images acquired over an agricultural area near Winnipeg, Manitoba, Canada, during June and July in 2012. The results demonstrate more accurate crop maps using the proposed method when compared to the classical approaches, (e.g. 12% improvement in general). In addition, when the optimization technique is used, greater improvement is observed in crop classification, e.g. 5% in overall. Furthermore, a strong relationship between Freeman-Durden volume scattering component, which is related to canopy structure, and phenological growth stages is observed.

Top