Clustering high dimensional data using RIA
NASA Astrophysics Data System (ADS)
Aziz, Nazrina
2015-05-01
Clustering may simply represent a convenient method for organizing a large data set so that it can easily be understood and information can efficiently be retrieved. However, identifying cluster in high dimensionality data sets is a difficult task because of the curse of dimensionality. Another challenge in clustering is some traditional functions cannot capture the pattern dissimilarity among objects. In this article, we used an alternative dissimilarity measurement called Robust Influence Angle (RIA) in the partitioning method. RIA is developed using eigenstructure of the covariance matrix and robust principal component score. We notice that, it can obtain cluster easily and hence avoid the curse of dimensionality. It is also manage to cluster large data sets with mixed numeric and categorical value.
Clustering high dimensional data using RIA
Aziz, Nazrina
2015-05-15
Clustering may simply represent a convenient method for organizing a large data set so that it can easily be understood and information can efficiently be retrieved. However, identifying cluster in high dimensionality data sets is a difficult task because of the curse of dimensionality. Another challenge in clustering is some traditional functions cannot capture the pattern dissimilarity among objects. In this article, we used an alternative dissimilarity measurement called Robust Influence Angle (RIA) in the partitioning method. RIA is developed using eigenstructure of the covariance matrix and robust principal component score. We notice that, it can obtain cluster easily and hence avoid the curse of dimensionality. It is also manage to cluster large data sets with mixed numeric and categorical value.
NASA Technical Reports Server (NTRS)
Srivastava, Ashok, N.; Akella, Ram; Diev, Vesselin; Kumaresan, Sakthi Preethi; McIntosh, Dawn M.; Pontikakis, Emmanuel D.; Xu, Zuobing; Zhang, Yi
2006-01-01
This paper describes the results of a significant research and development effort conducted at NASA Ames Research Center to develop new text mining techniques to discover anomalies in free-text reports regarding system health and safety of two aerospace systems. We discuss two problems of significant importance in the aviation industry. The first problem is that of automatic anomaly discovery about an aerospace system through the analysis of tens of thousands of free-text problem reports that are written about the system. The second problem that we address is that of automatic discovery of recurring anomalies, i.e., anomalies that may be described m different ways by different authors, at varying times and under varying conditions, but that are truly about the same part of the system. The intent of recurring anomaly identification is to determine project or system weakness or high-risk issues. The discovery of recurring anomalies is a key goal in building safe, reliable, and cost-effective aerospace systems. We address the anomaly discovery problem on thousands of free-text reports using two strategies: (1) as an unsupervised learning problem where an algorithm takes free-text reports as input and automatically groups them into different bins, where each bin corresponds to a different unknown anomaly category; and (2) as a supervised learning problem where the algorithm classifies the free-text reports into one of a number of known anomaly categories. We then discuss the application of these methods to the problem of discovering recurring anomalies. In fact the special nature of recurring anomalies (very small cluster sizes) requires incorporating new methods and measures to enhance the original approach for anomaly detection. ?& pant 0-
Adaptive dimension reduction for clustering high dimensional data
Ding, Chris; He, Xiaofeng; Zha, Hongyuan; Simon, Horst
2002-10-01
It is well-known that for high dimensional data clustering, standard algorithms such as EM and the K-means are often trapped in local minimum. many initialization methods were proposed to tackle this problem, but with only limited success. In this paper they propose a new approach to resolve this problem by repeated dimension reductions such that K-means or EM are performed only in very low dimensions. Cluster membership is utilized as a bridge between the reduced dimensional sub-space and the original space, providing flexibility and ease of implementation. Clustering analysis performed on highly overlapped Gaussians, DNA gene expression profiles and internet newsgroups demonstrate the effectiveness of the proposed algorithm.
Dimensionality Reduction Particle Swarm Algorithm for High Dimensional Clustering
Cui, Xiaohui; ST Charles, Jesse Lee; Potok, Thomas E; Beaver, Justin M
2008-01-01
The Particle Swarm Optimization (PSO) clustering algorithm can generate more compact clustering results than the traditional K-means clustering algorithm. However, when clustering high dimensional datasets, the PSO clustering algorithm is notoriously slow because its computation cost increases exponentially with the size of the dataset dimension. Dimensionality reduction techniques offer solutions that both significantly improve the computation time, and yield reasonably accurate clustering results in high dimensional data analysis. In this paper, we introduce research that combines different dimensionality reduction techniques with the PSO clustering algorithm in order to reduce the complexity of high dimensional datasets and speed up the PSO clustering process. We report significant improvements in total runtime. Moreover, the clustering accuracy of the dimensionality reduction PSO clustering algorithm is comparable to the one that uses full dimension space.
Semi-supervised high-dimensional clustering by tight wavelet frames
NASA Astrophysics Data System (ADS)
Dong, Bin; Hao, Ning
2015-08-01
High-dimensional clustering arises frequently from many areas in natural sciences, technical disciplines and social medias. In this paper, we consider the problem of binary clustering of high-dimensional data, i.e. classification of a data set into 2 classes. We assume that the correct (or mostly correct) classification of a small portion of the given data is known. Based on such partial classification, we design optimization models that complete the clustering of the entire data set using the recently introduced tight wavelet frames on graphs.1 Numerical experiments of the proposed models applied to some real data sets are conducted. In particular, the performance of the models on some very high-dimensional data sets are examined; and combinations of the models with some existing dimension reduction techniques are also considered.
Modification of DIRECT for high-dimensional design problems
NASA Astrophysics Data System (ADS)
Tavassoli, Arash; Haji Hajikolaei, Kambiz; Sadeqi, Soheil; Wang, G. Gary; Kjeang, Erik
2014-06-01
DIviding RECTangles (DIRECT), as a well-known derivative-free global optimization method, has been found to be effective and efficient for low-dimensional problems. When facing high-dimensional black-box problems, however, DIRECT's performance deteriorates. This work proposes a series of modifications to DIRECT for high-dimensional problems (dimensionality d>10). The principal idea is to increase the convergence speed by breaking its single initialization-to-convergence approach into several more intricate steps. Specifically, starting with the entire feasible area, the search domain will shrink gradually and adaptively to the region enclosing the potential optimum. Several stopping criteria have been introduced to avoid premature convergence. A diversification subroutine has also been developed to prevent the algorithm from being trapped in local minima. The proposed approach is benchmarked using nine standard high-dimensional test functions and one black-box engineering problem. All these tests show a significant efficiency improvement over the original DIRECT for high-dimensional design problems.
High dimensional model representation (HDMR) with clustering for image retrieval
NASA Astrophysics Data System (ADS)
Karcılı, Ayşegül; Tunga, Burcu
2017-01-01
Image retrieval continues to hold an important place in today's extremely fast growing technology. In this field, the accurate image retrieval with high speed is critical. In this study, to achieve this important issue we developed a novel method with the help of High Dimensional Model Representation (HDMR) philosophy. HDMR is a decomposition method used to solve different scientific problems. To test the performance of the new method we used Columbia Object Image Library (COIL100) and obtained the encouraging results. These results are given in the findings section.
Visualization of high-dimensional clusters using nonlinear magnification
Keahey, T.A.
1998-12-31
This paper describes a cluster visualization system used for data-mining fraud detection. The system can simultaneously show 6 dimensions of data, and a unique technique of 3D nonlinear magnification allows individual clusters of data points to be magnified while still maintaining a view of the global context. The author first describes the fraud detection problem, along with the data which is to be visualized. Then he describes general characteristics of the visualization system, and shows how nonlinear magnification can be used in this system. Finally he concludes and describes options for further work.
Banerjee, Arindam; Ghosh, Joydeep
2004-05-01
Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (spkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, spkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the spkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques. Index Terms-Balanced clustering, expectation maximization (EM), frequency-sensitive competitive learning (FSCL), high-dimensional clustering, kmeans, normalized data, scalable clustering, streaming data, text clustering.
High dimensional data clustering by partitioning the hypergraphs using dense subgraph partition
NASA Astrophysics Data System (ADS)
Sun, Xili; Tian, Shoucai; Lu, Yonggang
2015-12-01
Due to the curse of dimensionality, traditional clustering methods usually fail to produce meaningful results for the high dimensional data. Hypergraph partition is believed to be a promising method for dealing with this challenge. In this paper, we first construct a graph G from the data by defining an adjacency relationship between the data points using Shared Reverse k Nearest Neighbors (SRNN). Then a hypergraph is created from the graph G by defining the hyperedges to be all the maximal cliques in the graph G. After the hypergraph is produced, a powerful hypergraph partitioning method called dense subgraph partition (DSP) combined with the k-medoids method is used to produce the final clustering results. The proposed method is evaluated on several real high-dimensional datasets, and the experimental results show that the proposed method can improve the clustering results of the high dimensional data compared with applying k-medoids method directly on the original data.
Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data.
Weber, Lukas M; Robinson, Mark D
2016-12-01
Recent technological developments in high-dimensional flow cytometry and mass cytometry (CyTOF) have made it possible to detect expression levels of dozens of protein markers in thousands of cells per second, allowing cell populations to be characterized in unprecedented detail. Traditional data analysis by "manual gating" can be inefficient and unreliable in these high-dimensional settings, which has led to the development of a large number of automated analysis methods. Methods designed for unsupervised analysis use specialized clustering algorithms to detect and define cell populations for further downstream analysis. Here, we have performed an up-to-date, extensible performance comparison of clustering methods for high-dimensional flow and mass cytometry data. We evaluated methods using several publicly available data sets from experiments in immunology, containing both major and rare cell populations, with cell population identities from expert manual gating as the reference standard. Several methods performed well, including FlowSOM, X-shift, PhenoGraph, Rclusterpp, and flowMeans. Among these, FlowSOM had extremely fast runtimes, making this method well-suited for interactive, exploratory analysis of large, high-dimensional data sets on a standard laptop or desktop computer. These results extend previously published comparisons by focusing on high-dimensional data and including new methods developed for CyTOF data. R scripts to reproduce all analyses are available from GitHub (https://github.com/lmweber/cytometry-clustering-comparison), and pre-processed data files are available from FlowRepository (FR-FCM-ZZPH), allowing our comparisons to be extended to include new clustering methods and reference data sets. © 2016 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of ISAC.
Variational Bayesian strategies for high-dimensional, stochastic design problems
Koutsourelakis, P.S.
2016-03-01
This paper is concerned with a lesser-studied problem in the context of model-based, uncertainty quantification (UQ), that of optimization/design/control under uncertainty. The solution of such problems is hindered not only by the usual difficulties encountered in UQ tasks (e.g. the high computational cost of each forward simulation, the large number of random variables) but also by the need to solve a nonlinear optimization problem involving large numbers of design variables and potentially constraints. We propose a framework that is suitable for a class of such problems and is based on the idea of recasting them as probabilistic inference tasks. To that end, we propose a Variational Bayesian (VB) formulation and an iterative VB–Expectation-Maximization scheme that is capable of identifying a local maximum as well as a low-dimensional set of directions in the design space, along which, the objective exhibits the largest sensitivity. We demonstrate the validity of the proposed approach in the context of two numerical examples involving thousands of random and design variables. In all cases considered the cost of the computations in terms of calls to the forward model was of the order of 100 or less. The accuracy of the approximations provided is assessed by information-theoretic metrics.
Nuclear Potential Clustering As a New Tool to Detect Patterns in High Dimensional Datasets
NASA Astrophysics Data System (ADS)
Tonkova, V.; Paulus, D.; Neeb, H.
2013-02-01
We present a new approach for the clustering of high dimensional data without prior assumptions about the structure of the underlying distribution. The proposed algorithm is based on a concept adapted from nuclear physics. To partition the data, we model the dynamic behaviour of nucleons interacting in an N-dimensional space. An adaptive nuclear potential, comprised of a short-range attractive (strong interaction) and a long-range repulsive term (Coulomb force) is assigned to each data point. By modelling the dynamics, nucleons that are densely distributed in space fuse to build nuclei (clusters) whereas single point clusters repel each other. The formation of clusters is completed when the system reaches the state of minimal potential energy. The data are then grouped according to the particles' final effective potential energy level. The performance of the algorithm is tested with several synthetic datasets showing that the proposed method can robustly identify clusters even when complex configurations are present. Furthermore, quantitative MRI data from 43 multiple sclerosis patients were analyzed, showing a reasonable splitting into subgroups according to the individual patients' disease grade. The good performance of the algorithm on such highly correlated non-spherical datasets, which are typical for MRI derived image features, shows that Nuclear Potential Clustering is a valuable tool for automated data analysis, not only in the MRI domain.
Clustering High-Dimensional Landmark-based Two-dimensional Shape Data‡
Huang, Chao; Styner, Martin; Zhu, Hongtu
2015-01-01
An important goal in image analysis is to cluster and recognize objects of interest according to the shapes of their boundaries. Clustering such objects faces at least four major challenges including a curved shape space, a high-dimensional feature space, a complex spatial correlation structure, and shape variation associated with some covariates (e.g., age or gender). The aim of this paper is to develop a penalized model-based clustering framework to cluster landmark-based planar shape data, while explicitly addressing these challenges. Specifically, a mixture of offset-normal shape factor analyzers (MOSFA) is proposed with mixing proportions defined through a regression model (e.g., logistic) and an offset-normal shape distribution in each component for data in the curved shape space. A latent factor analysis model is introduced to explicitly model the complex spatial correlation. A penalized likelihood approach with both adaptive pairwise fusion Lasso penalty function and L2 penalty function is used to automatically realize variable selection via thresholding and deliver a sparse solution. Our real data analysis has confirmed the excellent finite-sample performance of MOSFA in revealing meaningful clusters in the corpus callosum shape data obtained from the Attention Deficit Hyperactivity Disorder-200 (ADHD-200) study. PMID:26604425
NASA Astrophysics Data System (ADS)
Franck, I. M.; Koutsourelakis, P. S.
2017-01-01
This paper is concerned with the numerical solution of model-based, Bayesian inverse problems. We are particularly interested in cases where the cost of each likelihood evaluation (forward-model call) is expensive and the number of unknown (latent) variables is high. This is the setting in many problems in computational physics where forward models with nonlinear PDEs are used and the parameters to be calibrated involve spatio-temporarily varying coefficients, which upon discretization give rise to a high-dimensional vector of unknowns. One of the consequences of the well-documented ill-posedness of inverse problems is the possibility of multiple solutions. While such information is contained in the posterior density in Bayesian formulations, the discovery of a single mode, let alone multiple, poses a formidable computational task. The goal of the present paper is two-fold. On one hand, we propose approximate, adaptive inference strategies using mixture densities to capture multi-modal posteriors. On the other, we extend our work in [1] with regard to effective dimensionality reduction techniques that reveal low-dimensional subspaces where the posterior variance is mostly concentrated. We validate the proposed model by employing Importance Sampling which confirms that the bias introduced is small and can be efficiently corrected if the analyst wishes to do so. We demonstrate the performance of the proposed strategy in nonlinear elastography where the identification of the mechanical properties of biological materials can inform non-invasive, medical diagnosis. The discovery of multiple modes (solutions) in such problems is critical in achieving the diagnostic objectives.
NASA Astrophysics Data System (ADS)
Choo, Jaegul; Lee, Hanseung; Liu, Zhicheng; Stasko, John; Park, Haesun
2013-01-01
Many of the modern data sets such as text and image data can be represented in high-dimensional vector spaces and have benefited from computational methods that utilize advanced computational methods. Visual analytics approaches have contributed greatly to data understanding and analysis due to their capability of leveraging humans' ability for quick visual perception. However, visual analytics targeting large-scale data such as text and image data has been challenging due to the limited screen space in terms of both the numbers of data points and features to represent. Among various computational methods supporting visual analytics, dimension reduction and clustering have played essential roles by reducing these numbers in an intelligent way to visually manageable sizes. Given numerous dimension reduction and clustering methods available, however, the decision on the choice of algorithms and their parameters becomes difficult. In this paper, we present an interactive visual testbed system for dimension reduction and clustering in a large-scale high-dimensional data analysis. The testbed system enables users to apply various dimension reduction and clustering methods with different settings, visually compare the results from different algorithmic methods to obtain rich knowledge for the data and tasks at hand, and eventually choose the most appropriate path for a collection of algorithms and parameters. Using various data sets such as documents, images, and others that are already encoded in vectors, we demonstrate how the testbed system can support these tasks.
NASA Astrophysics Data System (ADS)
Schütze, Niels; Wöhling, Thomas; de Play, Michael
2010-05-01
Some real-world optimization problems in water resources have a high-dimensional space of decision variables and more than one objective function. In this work, we compare three general-purpose, multi-objective simulation optimization algorithms, namely NSGA-II, AMALGAM, and CMA-ES-MO when solving three real case Multi-objective Optimization Problems (MOPs): (i) a high-dimensional soil hydraulic parameter estimation problem; (ii) a multipurpose multi-reservoir operation problem; and (iii) a scheduling problem in deficit irrigation. We analyze the behaviour of the three algorithms on these test problems considering their formulations ranging from 40 up to 120 decision variables and 2 to 4 objectives. The computational effort required by each algorithm in order to reach the true Pareto front is also analyzed.
Semi-Supervised Clustering for High-Dimensional and Sparse Features
ERIC Educational Resources Information Center
Yan, Su
2010-01-01
Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some "weak" form of side…
Mosmann, Tim R; Naim, Iftekhar; Rebhahn, Jonathan; Datta, Suprakash; Cavenaugh, James S; Weaver, Jason M; Sharma, Gaurav
2014-05-01
A multistage clustering and data processing method, SWIFT (detailed in a companion manuscript), has been developed to detect rare subpopulations in large, high-dimensional flow cytometry datasets. An iterative sampling procedure initially fits the data to multidimensional Gaussian distributions, then splitting and merging stages use a criterion of unimodality to optimize the detection of rare subpopulations, to converge on a consistent cluster number, and to describe non-Gaussian distributions. Probabilistic assignment of cells to clusters, visualization, and manipulation of clusters by their cluster medians, facilitate application of expert knowledge using standard flow cytometry programs. The dual problems of rigorously comparing similar complex samples, and enumerating absent or very rare cell subpopulations in negative controls, were solved by assigning cells in multiple samples to a cluster template derived from a single or combined sample. Comparison of antigen-stimulated and control human peripheral blood cell samples demonstrated that SWIFT could identify biologically significant subpopulations, such as rare cytokine-producing influenza-specific T cells. A sensitivity of better than one part per million was attained in very large samples. Results were highly consistent on biological replicates, yet the analysis was sensitive enough to show that multiple samples from the same subject were more similar than samples from different subjects. A companion manuscript (Part 1) details the algorithmic development of SWIFT.
Wang, Xueyi
2011-01-01
The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 106 records and 104 dimensions, kMkNN shows a 2-to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces. PMID:22247818
Chen, Sui-Pi; Huang, Guan-Hua
2014-06-01
This paper uses a Bayesian formulation of a clustering procedure to identify gene-gene interactions under case-control studies, called the Algorithm via Bayesian Clustering to Detect Epistasis (ABCDE). The ABCDE uses Dirichlet process mixtures to model SNP marker partitions, and uses the Gibbs weighted Chinese restaurant sampling to simulate posterior distributions of these partitions. Unlike the representative Bayesian epistasis detection algorithm BEAM, which partitions markers into three groups, the ABCDE can be evaluated at any given partition, regardless of the number of groups. This study also develops permutation tests to validate the disease association for SNP subsets identified by the ABCDE, which can yield results that are more robust to model specification and prior assumptions. This study examines the performance of the ABCDE and compares it with the BEAM using various simulated data and a schizophrenia SNP dataset.
Haplotyping Problem, A Clustering Approach
Eslahchi, Changiz; Sadeghi, Mehdi; Pezeshk, Hamid; Kargar, Mehdi; Poormohammadi, Hadi
2007-09-06
Construction of two haplotypes from a set of Single Nucleotide Polymorphism (SNP) fragments is called haplotype reconstruction problem. One of the most popular computational model for this problem is Minimum Error Correction (MEC). Since MEC is an NP-hard problem, here we propose a novel heuristic algorithm based on clustering analysis in data mining for haplotype reconstruction problem. Based on hamming distance and similarity between two fragments, our iterative algorithm produces two clusters of fragments; then, in each iteration, the algorithm assigns a fragment to one of the clusters. Our results suggest that the algorithm has less reconstruction error rate in comparison with other algorithms.
Haplotyping Problem, A Clustering Approach
NASA Astrophysics Data System (ADS)
Eslahchi, Changiz; Sadeghi, Mehdi; Pezeshk, Hamid; Kargar, Mehdi; Poormohammadi, Hadi
2007-09-01
Construction of two haplotypes from a set of Single Nucleotide Polymorphism (SNP) fragments is called haplotype reconstruction problem. One of the most popular computational model for this problem is Minimum Error Correction (MEC). Since MEC is an NP-hard problem, here we propose a novel heuristic algorithm based on clustering analysis in data mining for haplotype reconstruction problem. Based on hamming distance and similarity between two fragments, our iterative algorithm produces two clusters of fragments; then, in each iteration, the algorithm assigns a fragment to one of the clusters. Our results suggest that the algorithm has less reconstruction error rate in comparison with other algorithms.
Graph Based Models for Unsupervised High Dimensional Data Clustering and Network Analysis
2015-01-01
38 4.2.1 Ginzburg- Landau functional . . . . . . . . . . . . . . . . . 39 4.2.2 MBO scheme...45 4.4.1 Ginzburg- Landau relaxation of the discrete problem . . . . 46 4.4.2 MBO scheme, convex splitting, and spectral...authors of [8] introduced a binary semi-supervised segmentation method based on minimizing the Ginzburg- Landau functional on a graph. Inspired by [8], a
Naim, Iftekhar; Datta, Suprakash; Rebhahn, Jonathan; Cavenaugh, James S; Mosmann, Tim R; Sharma, Gaurav
2014-05-01
We present a model-based clustering method, SWIFT (Scalable Weighted Iterative Flow-clustering Technique), for digesting high-dimensional large-sized datasets obtained via modern flow cytometry into more compact representations that are well-suited for further automated or manual analysis. Key attributes of the method include the following: (a) the analysis is conducted in the multidimensional space retaining the semantics of the data, (b) an iterative weighted sampling procedure is utilized to maintain modest computational complexity and to retain discrimination of extremely small subpopulations (hundreds of cells from datasets containing tens of millions), and (c) a splitting and merging procedure is incorporated in the algorithm to preserve distinguishability between biologically distinct populations, while still providing a significant compaction relative to the original data. This article presents a detailed algorithmic description of SWIFT, outlining the application-driven motivations for the different design choices, a discussion of computational complexity of the different steps, and results obtained with SWIFT for synthetic data and relatively simple experimental data that allow validation of the desirable attributes. A companion paper (Part 2) highlights the use of SWIFT, in combination with additional computational tools, for more challenging biological problems.
Naim, Iftekhar; Datta, Suprakash; Rebhahn, Jonathan; Cavenaugh, James S; Mosmann, Tim R; Sharma, Gaurav
2014-01-01
We present a model-based clustering method, SWIFT (Scalable Weighted Iterative Flow-clustering Technique), for digesting high-dimensional large-sized datasets obtained via modern flow cytometry into more compact representations that are well-suited for further automated or manual analysis. Key attributes of the method include the following: (a) the analysis is conducted in the multidimensional space retaining the semantics of the data, (b) an iterative weighted sampling procedure is utilized to maintain modest computational complexity and to retain discrimination of extremely small subpopulations (hundreds of cells from datasets containing tens of millions), and (c) a splitting and merging procedure is incorporated in the algorithm to preserve distinguishability between biologically distinct populations, while still providing a significant compaction relative to the original data. This article presents a detailed algorithmic description of SWIFT, outlining the application-driven motivations for the different design choices, a discussion of computational complexity of the different steps, and results obtained with SWIFT for synthetic data and relatively simple experimental data that allow validation of the desirable attributes. A companion paper (Part 2) highlights the use of SWIFT, in combination with additional computational tools, for more challenging biological problems. © 2014 The Authors. Published by Wiley Periodicals Inc. PMID:24677621
A numerical algorithm for optimal feedback gains in high dimensional LQR problems
NASA Technical Reports Server (NTRS)
Banks, H. T.; Ito, K.
1986-01-01
A hybrid method for computing the feedback gains in linear quadratic regulator problems is proposed. The method, which combines the use of a Chandrasekhar type system with an iteration of the Newton-Kleinman form with variable acceleration parameter Smith schemes, is formulated so as to efficiently compute directly the feedback gains rather than solutions of an associated Riccati equation. The hybrid method is particularly appropriate when used with large dimensional systems such as those arising in approximating infinite dimensional (distributed parameter) control systems (e.g., those governed by delay-differential and partial differential equations). Computational advantage of the proposed algorithm over the standard eigenvector (Potter, Laub-Schur) based techniques are discussed and numerical evidence of the efficacy of our ideas presented.
Fournier, René; Orel, Slava
2013-12-21
We present a method for fitting high-dimensional potential energy surfaces that is almost fully automated, can be applied to systems with various chemical compositions, and involves no particular choice of function form. We tested it on four systems: Ag20, Sn6Pb6, Si10, and Li8. The cost for energy evaluation is smaller than the cost of a density functional theory (DFT) energy evaluation by a factor of 1500 for Li8, and 60,000 for Ag20. We achieved intermediate accuracy (errors of 0.4 to 0.8 eV on atomization energies, or, 1% to 3% on cohesive energies) with rather small datasets (between 240 and 1400 configurations). We demonstrate that this accuracy is sufficient to correctly screen the configurations with lowest DFT energy, making this function potentially very useful in a hybrid global optimization strategy. We show that, as expected, the accuracy of the function improves with an increase in the size of the fitting dataset.
Fournier, René Orel, Slava
2013-12-21
We present a method for fitting high-dimensional potential energy surfaces that is almost fully automated, can be applied to systems with various chemical compositions, and involves no particular choice of function form. We tested it on four systems: Ag{sub 20}, Sn{sub 6}Pb{sub 6}, Si{sub 10}, and Li{sub 8}. The cost for energy evaluation is smaller than the cost of a density functional theory (DFT) energy evaluation by a factor of 1500 for Li{sub 8}, and 60 000 for Ag{sub 20}. We achieved intermediate accuracy (errors of 0.4 to 0.8 eV on atomization energies, or, 1% to 3% on cohesive energies) with rather small datasets (between 240 and 1400 configurations). We demonstrate that this accuracy is sufficient to correctly screen the configurations with lowest DFT energy, making this function potentially very useful in a hybrid global optimization strategy. We show that, as expected, the accuracy of the function improves with an increase in the size of the fitting dataset.
NASA Astrophysics Data System (ADS)
Yao, Bing; Yang, Hui
2016-12-01
This paper presents a novel physics-driven spatiotemporal regularization (STRE) method for high-dimensional predictive modeling in complex healthcare systems. This model not only captures the physics-based interrelationship between time-varying explanatory and response variables that are distributed in the space, but also addresses the spatial and temporal regularizations to improve the prediction performance. The STRE model is implemented to predict the time-varying distribution of electric potentials on the heart surface based on the electrocardiogram (ECG) data from the distributed sensor network placed on the body surface. The model performance is evaluated and validated in both a simulated two-sphere geometry and a realistic torso-heart geometry. Experimental results show that the STRE model significantly outperforms other regularization models that are widely used in current practice such as Tikhonov zero-order, Tikhonov first-order and L1 first-order regularization methods.
Yao, Bing; Yang, Hui
2016-01-01
This paper presents a novel physics-driven spatiotemporal regularization (STRE) method for high-dimensional predictive modeling in complex healthcare systems. This model not only captures the physics-based interrelationship between time-varying explanatory and response variables that are distributed in the space, but also addresses the spatial and temporal regularizations to improve the prediction performance. The STRE model is implemented to predict the time-varying distribution of electric potentials on the heart surface based on the electrocardiogram (ECG) data from the distributed sensor network placed on the body surface. The model performance is evaluated and validated in both a simulated two-sphere geometry and a realistic torso-heart geometry. Experimental results show that the STRE model significantly outperforms other regularization models that are widely used in current practice such as Tikhonov zero-order, Tikhonov first-order and L1 first-order regularization methods. PMID:27966576
Yao, Bing; Yang, Hui
2016-12-14
This paper presents a novel physics-driven spatiotemporal regularization (STRE) method for high-dimensional predictive modeling in complex healthcare systems. This model not only captures the physics-based interrelationship between time-varying explanatory and response variables that are distributed in the space, but also addresses the spatial and temporal regularizations to improve the prediction performance. The STRE model is implemented to predict the time-varying distribution of electric potentials on the heart surface based on the electrocardiogram (ECG) data from the distributed sensor network placed on the body surface. The model performance is evaluated and validated in both a simulated two-sphere geometry and a realistic torso-heart geometry. Experimental results show that the STRE model significantly outperforms other regularization models that are widely used in current practice such as Tikhonov zero-order, Tikhonov first-order and L1 first-order regularization methods.
Clustering of solutions in hard satisfiability problems
NASA Astrophysics Data System (ADS)
Ardelius, John; Aurell, Erik; Krishnamurthy, Supriya
2007-10-01
We study numerically the solution space structure of random 3-SAT problems close to the SAT/UNSAT transition. This is done by considering chains of satisfiability problems, where clauses are added sequentially to a problem instance. Using the overlap measure of similarity between different solutions found on the same problem instance, we examine geometrical changes as a function of α. In each chain, the overlap distribution is first smooth, but then develops a tiered structure, indicating that the solutions are found in well separated clusters. On chains of not too large instances, all remaining solutions are eventually observed to be found in only one small cluster before vanishing. This condensation transition point is estimated by finite size scaling to be αc = 4.26 with an apparent critical exponent of about 1.7. The average overlap value is also observed to increase with α up to the transition, indicating a reduction in solutions space size, in accordance with theoretical predictions. The solutions are generated by a local heuristic, ASAT, and compared to those found by the Survey Propagation algorithm up to αc.
A local search for a graph clustering problem
NASA Astrophysics Data System (ADS)
Navrotskaya, Anna; Il'ev, Victor
2016-10-01
In the clustering problems one has to partition a given set of objects (a data set) into some subsets (called clusters) taking into consideration only similarity of the objects. One of most visual formalizations of clustering is graph clustering, that is grouping the vertices of a graph into clusters taking into consideration the edge structure of the graph whose vertices are objects and edges represent similarities between the objects. In the graph k-clustering problem the number of clusters does not exceed k and the goal is to minimize the number of edges between clusters and the number of missing edges within clusters. This problem is NP-hard for any k ≥ 2. We propose a polynomial time (2k-1)-approximation algorithm for graph k-clustering. Then we apply a local search procedure to the feasible solution found by this algorithm and hold experimental research of obtained heuristics.
A facility for using cluster research to study environmental problems
Not Available
1991-11-01
This report begins by describing the general application of cluster based research to environmental chemistry and the development of a Cluster Structure and Dynamics Research Facility (CSDRF). Next, four important areas of cluster research are described in more detail, including how they can impact environmental problems. These are: surface-supported clusters, water and contaminant interactions, time-resolved dynamic studies in clusters, and cluster structures and reactions. These facilities and equipment required for each area of research are then presented. The appendices contain workshop agenda and a listing of the researchers who participated in the workshop discussions that led to this report.
ICANP2: Isoenergetic cluster algorithm for NP-complete Problems
NASA Astrophysics Data System (ADS)
Zhu, Zheng; Fang, Chao; Katzgraber, Helmut G.
NP-complete optimization problems with Boolean variables are of fundamental importance in computer science, mathematics and physics. Most notably, the minimization of general spin-glass-like Hamiltonians remains a difficult numerical task. There has been a great interest in designing efficient heuristics to solve these computationally difficult problems. Inspired by the rejection-free isoenergetic cluster algorithm developed for Ising spin glasses, we present a generalized cluster update that can be applied to different NP-complete optimization problems with Boolean variables. The cluster updates allow for a wide-spread sampling of phase space, thus speeding up optimization. By carefully tuning the pseudo-temperature (needed to randomize the configurations) of the problem, we show that the method can efficiently tackle problems on topologies with a large site-percolation threshold. We illustrate the ICANP2 heuristic on paradigmatic optimization problems, such as the satisfiability problem and the vertex cover problem.
Clustered Self Organising Migrating Algorithm for the Quadratic Assignment Problem
NASA Astrophysics Data System (ADS)
Davendra, Donald; Zelinka, Ivan; Senkerik, Roman
2009-08-01
An approach of population dynamics and clustering for permutative problems is presented in this paper. Diversity indicators are created from solution ordering and its mapping is shown as an advantage for population control in metaheuristics. Self Organising Migrating Algorithm (SOMA) is modified using this approach and vetted with the Quadratic Assignment Problem (QAP). Extensive experimentation is conducted on benchmark problems in this area.
Multicluster solutions to a multinucleon problem and clustering phenomena
Gnilozub, I. A.; Kurgalin, S. D.; Tchuvil'sky, Yu. M.
2008-07-15
Various concepts of clustering phenomena are discussed. Precise multicluster solutions constructed by the present authors for an A-nucleon problem whose dynamical properties are described by a generalized Elliott Hamiltonian are used as a mathematical formalism of the theory of clustering phenomena in nuclei. It is shown that qualitative features of various clustering phenomena, such as the very fact of the existence of cluster states, their classification, and selectivity of reactions that populate them, are explained within the concept being discussed. The 2{alpha} + bineutron three-cluster states of the {sup 10}Be nucleus are classified, and their spectrum is calculated. It is demonstrated that the results of these calculations are in good agreement with experimental data.
Automated High-Dimensional Flow Cytometric Data Analysis
NASA Astrophysics Data System (ADS)
Pyne, Saumyadipta; Hu, Xinli; Wang, Kui; Rossin, Elizabeth; Lin, Tsung-I.; Maier, Lisa; Baecher-Allan, Clare; McLachlan, Geoffrey; Tamayo, Pablo; Hafler, David; de Jager, Philip; Mesirov, Jill
Flow cytometry is widely used for single cell interrogation of surface and intracellular protein expression by measuring fluorescence intensity of fluorophore-conjugated reagents. We focus on the recently developed procedure of Pyne et al. (2009, Proceedings of the National Academy of Sciences USA 106, 8519-8524) for automated high- dimensional flow cytometric analysis called FLAME (FLow analysis with Automated Multivariate Estimation). It introduced novel finite mixture models of heavy-tailed and asymmetric distributions to identify and model cell populations in a flow cytometric sample. This approach robustly addresses the complexities of flow data without the need for transformation or projection to lower dimensions. It also addresses the critical task of matching cell populations across samples that enables downstream analysis. It thus facilitates application of flow cytometry to new biological and clinical problems. To facilitate pipelining with standard bioinformatic applications such as high-dimensional visualization, subject classification or outcome prediction, FLAME has been incorporated with the GenePattern package of the Broad Institute. Thereby analysis of flow data can be approached similarly as other genomic platforms. We also consider some new work that proposes a rigorous and robust solution to the registration problem by a multi-level approach that allows us to model and register cell populations simultaneously across a cohort of high-dimensional flow samples. This new approach is called JCM (Joint Clustering and Matching). It enables direct and rigorous comparisons across different time points or phenotypes in a complex biological study as well as for classification of new patient samples in a more clinical setting.
The Heterogeneous P-Median Problem for Categorization Based Clustering
ERIC Educational Resources Information Center
Blanchard, Simon J.; Aloise, Daniel; DeSarbo, Wayne S.
2012-01-01
The p-median offers an alternative to centroid-based clustering algorithms for identifying unobserved categories. However, existing p-median formulations typically require data aggregation into a single proximity matrix, resulting in masked respondent heterogeneity. A proposed three-way formulation of the p-median problem explicitly considers…
Scalable Nearest Neighbor Algorithms for High Dimensional Data.
Muja, Marius; Lowe, David G
2014-11-01
For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching.
Distributed Computation of the knn Graph for Large High-Dimensional Point Sets
Plaku, Erion; Kavraki, Lydia E.
2009-01-01
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318
High dimensional feature reduction via projection pursuit
NASA Technical Reports Server (NTRS)
Jimenez, Luis; Landgrebe, David
1994-01-01
The recent development of more sophisticated remote sensing systems enables the measurement of radiation in many more spectral intervals than previously possible. An example of that technology is the AVIRIS system, which collects image data in 220 bands. As a result of this, new algorithms must be developed in order to analyze the more complex data effectively. Data in a high dimensional space presents a substantial challenge, since intuitive concepts valid in a 2-3 dimensional space to not necessarily apply in higher dimensional spaces. For example, high dimensional space is mostly empty. This results from the concentration of data in the corners of hypercubes. Other examples may be cited. Such observations suggest the need to project data to a subspace of a much lower dimension on a problem specific basis in such a manner that information is not lost. Projection Pursuit is a technique that will accomplish such a goal. Since it processes data in lower dimensions, it should avoid many of the difficulties of high dimensional spaces. In this paper, we begin the investigation of some of the properties of Projection Pursuit for this purpose.
Statistical Physics of High Dimensional Inference
NASA Astrophysics Data System (ADS)
Advani, Madhu; Ganguli, Surya
To model modern large-scale datasets, we need efficient algorithms to infer a set of P unknown model parameters from N noisy measurements. What are fundamental limits on the accuracy of parameter inference, given limited measurements, signal-to-noise ratios, prior information, and computational tractability requirements? How can we combine prior information with measurements to achieve these limits? Classical statistics gives incisive answers to these questions as the measurement density α =N/P --> ∞ . However, modern high-dimensional inference problems, in fields ranging from bio-informatics to economics, occur at finite α. We formulate and analyze high-dimensional inference analytically by applying the replica and cavity methods of statistical physics where data serves as quenched disorder and inferred parameters play the role of thermal degrees of freedom. Our analysis reveals that widely cherished Bayesian inference algorithms such as maximum likelihood and maximum a posteriori are suboptimal in the modern setting, and yields new tractable, optimal algorithms to replace them as well as novel bounds on the achievable accuracy of a large class of high-dimensional inference algorithms. Thanks to Stanford Graduate Fellowship and Mind Brain Computation IGERT grant for support.
Analysis of data separation and recovery problems using clustered sparsity
NASA Astrophysics Data System (ADS)
King, Emily J.; Kutyniok, Gitta; Zhuang, Xiaosheng
2011-09-01
Data often have two or more fundamental components, like cartoon-like and textured elements in images; point, filament, and sheet clusters in astronomical data; and tonal and transient layers in audio signals. For many applications, separating these components is of interest. Another issue in data analysis is that of incomplete data, for example a photograph with scratches or seismic data collected with fewer than necessary sensors. There exists a unified approach to solving these problems which is minimizing the l1 norm of the analysis coefficients with respect to particular frame(s). This approach using the concept of clustered sparsity leads to similar theoretical bounds and results, which are presented here. Furthermore, necessary conditions for the frames to lead to sufficiently good solutions are also shown.
Application of clustering global optimization to thin film design problems.
Lemarchand, Fabien
2014-03-10
Refinement techniques usually calculate an optimized local solution, which is strongly dependent on the initial formula used for the thin film design. In the present study, a clustering global optimization method is used which can iteratively change this initial formula, thereby progressing further than in the case of local optimization techniques. A wide panel of local solutions is found using this procedure, resulting in a large range of optical thicknesses. The efficiency of this technique is illustrated by two thin film design problems, in particular an infrared antireflection coating, and a solar-selective absorber coating.
Effects of Cluster Location on Human Performance on the Traveling Salesperson Problem
ERIC Educational Resources Information Center
MacGregor, James N.
2013-01-01
Most models of human performance on the traveling salesperson problem involve clustering of nodes, but few empirical studies have examined effects of clustering in the stimulus array. A recent exception varied degree of clustering and concluded that the more clustered a stimulus array, the easier a TSP is to solve (Dry, Preiss, & Wagemans,…
A Link-Based Approach to the Cluster Ensemble Problem.
Iam-On, Natthakan; Boongoen, Tossapon; Garrett, Simon; Price, Chris
2011-12-01
Cluster ensembles have recently emerged as a powerful alternative to standard cluster analysis, aggregating several input data clusterings to generate a single output clustering, with improved robustness and stability. From the early work, these techniques held great promise; however, most of them generate the final solution based on incomplete information of a cluster ensemble. The underlying ensemble-information matrix reflects only cluster-data point relations, while those among clusters are generally overlooked. This paper presents a new link-based approach to improve the conventional matrix. It achieves this using the similarity between clusters that are estimated from a link network model of the ensemble. In particular, three new link-based algorithms are proposed for the underlying similarity assessment. The final clustering result is generated from the refined matrix using two different consensus functions of feature-based and graph-based partitioning. This approach is the first to address and explicitly employ the relationship between input partitions, which has not been emphasized by recent studies of matrix refinement. The effectiveness of the link-based approach is empirically demonstrated over 10 data sets (synthetic and real) and three benchmark evaluation measures. The results suggest the new approach is able to efficiently extract information embedded in the input clusterings, and regularly illustrate higher clustering quality in comparison to several state-of-the-art techniques.
A facility for using cluster research to study environmental problems. Workshop proceedings
Not Available
1991-11-01
This report begins by describing the general application of cluster based research to environmental chemistry and the development of a Cluster Structure and Dynamics Research Facility (CSDRF). Next, four important areas of cluster research are described in more detail, including how they can impact environmental problems. These are: surface-supported clusters, water and contaminant interactions, time-resolved dynamic studies in clusters, and cluster structures and reactions. These facilities and equipment required for each area of research are then presented. The appendices contain workshop agenda and a listing of the researchers who participated in the workshop discussions that led to this report.
A high-dimensional look at VIPERS galaxies
NASA Astrophysics Data System (ADS)
Granett, Benjamin R.; Aff002
2014-05-01
We investigate how galaxies in VIPERS (the VIMOS Public Extragalactic Redshift Survey) inhabit the cosmological density field by examining the correlations across the observable parameter space of galaxy properties and clustering strength. The high-dimensional analysis is made manageable by the use of group-finding and regression tools. We find that the major trends in galaxy properties can be explained by a single parameter related to stellar mass. After subtracting this trend, residual correlations remain between galaxy properties and the local environment pointing to complex formation dependencies. As a specific application of this work we build subsamples of galaxies with specific clustering properties for use in cosmological tests.
Quantifying Photonic High-Dimensional Entanglement
NASA Astrophysics Data System (ADS)
Martin, Anthony; Guerreiro, Thiago; Tiranov, Alexey; Designolle, Sébastien; Fröwis, Florian; Brunner, Nicolas; Huber, Marcus; Gisin, Nicolas
2017-03-01
High-dimensional entanglement offers promising perspectives in quantum information science. In practice, however, the main challenge is to devise efficient methods to characterize high-dimensional entanglement, based on the available experimental data which is usually rather limited. Here we report the characterization and certification of high-dimensional entanglement in photon pairs, encoded in temporal modes. Building upon recently developed theoretical methods, we certify an entanglement of formation of 2.09(7) ebits in a time-bin implementation, and 4.1(1) ebits in an energy-time implementation. These results are based on very limited sets of local measurements, which illustrates the practical relevance of these methods.
Sparse High Dimensional Models in Economics
Fan, Jianqing; Lv, Jinchi; Qi, Lei
2010-01-01
This paper reviews the literature on sparse high dimensional models and discusses some applications in economics and finance. Recent developments of theory, methods, and implementations in penalized least squares and penalized likelihood methods are highlighted. These variable selection methods are proved to be effective in high dimensional sparse modeling. The limits of dimensionality that regularization methods can handle, the role of penalty functions, and their statistical properties are detailed. Some recent advances in ultra-high dimensional sparse modeling are also briefly discussed. PMID:22022635
Analyzing High-Dimensional Multispectral Data
NASA Technical Reports Server (NTRS)
Lee, Chulhee; Landgrebe, David A.
1993-01-01
In this paper, through a series of specific examples, we illustrate some characteristics encountered in analyzing high- dimensional multispectral data. The increased importance of the second-order statistics in analyzing high-dimensional data is illustrated, as is the shortcoming of classifiers such as the minimum distance classifier which rely on first-order variations alone. We also illustrate how inaccurate estimation or first- and second-order statistics, e.g., from use of training sets which are too small, affects the performance of a classifier. Recognizing the importance of second-order statistics on the one hand, but the increased difficulty in perceiving and comprehending information present in statistics derived from high-dimensional data on the other, we propose a method to aid visualization of high-dimensional statistics using a color coding scheme.
The Subspace Voyager: Exploring High-Dimensional Data along a Continuum of Salient 3D Subspace.
Wang, Bing; Mueller, Klaus
2017-02-23
Analyzing high-dimensional data and finding hidden patterns is a difficult problem and has attracted numerous research efforts. Automated methods can be useful to some extent but bringing the data analyst into the loop via interactive visual tools can help the discovery process tremendously. An inherent problem in this effort is that humans lack the mental capacity to truly understand spaces exceeding three spatial dimensions. To keep within this limitation, we describe a framework that decomposes a high-dimensional data space into a continuum of generalized 3D subspaces. Analysts can then explore these 3D subspaces individually via the familiar trackball interface while using additional facilities to smoothly transition to adjacent subspaces for expanded space comprehension. Since the number of such subspaces suffers from combinatorial explosion, we provide a set of data-driven subspace selection and navigation tools which can guide users to interesting subspaces and views. A subspace trail map allows users to manage the explored subspaces, keep their bearings, and return to interesting subspaces and views. Both trackball and trail map are each embedded into a word cloud of attribute labels which aid in navigation. We demonstrate our system via several use cases in a diverse set of application areas - cluster analysis and refinement, information discovery, and supervised training of classifiers. We also report on a user study that evaluates the usability of the various interactions our system provides.
Numerical methods for high-dimensional probability density function equations
NASA Astrophysics Data System (ADS)
Cho, H.; Venturi, D.; Karniadakis, G. E.
2016-01-01
In this paper we address the problem of computing the numerical solution to kinetic partial differential equations involving many phase variables. These types of equations arise naturally in many different areas of mathematical physics, e.g., in particle systems (Liouville and Boltzmann equations), stochastic dynamical systems (Fokker-Planck and Dostupov-Pugachev equations), random wave theory (Malakhov-Saichev equations) and coarse-grained stochastic systems (Mori-Zwanzig equations). We propose three different classes of new algorithms addressing high-dimensionality: The first one is based on separated series expansions resulting in a sequence of low-dimensional problems that can be solved recursively and in parallel by using alternating direction methods. The second class of algorithms relies on truncation of interaction in low-orders that resembles the Bogoliubov-Born-Green-Kirkwood-Yvon (BBGKY) framework of kinetic gas theory and it yields a hierarchy of coupled probability density function equations. The third class of algorithms is based on high-dimensional model representations, e.g., the ANOVA method and probabilistic collocation methods. A common feature of all these approaches is that they are reducible to the problem of computing the solution to high-dimensional equations via a sequence of low-dimensional problems. The effectiveness of the new algorithms is demonstrated in numerical examples involving nonlinear stochastic dynamical systems and partial differential equations, with up to 120 variables.
Numerical methods for high-dimensional probability density function equations
Cho, H.; Venturi, D.; Karniadakis, G.E.
2016-01-15
In this paper we address the problem of computing the numerical solution to kinetic partial differential equations involving many phase variables. These types of equations arise naturally in many different areas of mathematical physics, e.g., in particle systems (Liouville and Boltzmann equations), stochastic dynamical systems (Fokker–Planck and Dostupov–Pugachev equations), random wave theory (Malakhov–Saichev equations) and coarse-grained stochastic systems (Mori–Zwanzig equations). We propose three different classes of new algorithms addressing high-dimensionality: The first one is based on separated series expansions resulting in a sequence of low-dimensional problems that can be solved recursively and in parallel by using alternating direction methods. The second class of algorithms relies on truncation of interaction in low-orders that resembles the Bogoliubov–Born–Green–Kirkwood–Yvon (BBGKY) framework of kinetic gas theory and it yields a hierarchy of coupled probability density function equations. The third class of algorithms is based on high-dimensional model representations, e.g., the ANOVA method and probabilistic collocation methods. A common feature of all these approaches is that they are reducible to the problem of computing the solution to high-dimensional equations via a sequence of low-dimensional problems. The effectiveness of the new algorithms is demonstrated in numerical examples involving nonlinear stochastic dynamical systems and partial differential equations, with up to 120 variables.
Fast Nonparametric Machine Learning Algorithms for High-Dimensional Massive Data and Applications
2006-03-01
Mapreduce : Simplified data processing on large clusters . In Symposium on Operating System Design and Implementation, 2004. 6.3.2 S. C. Deerwester, S. T...Fast Nonparametric Machine Learning Algorithms for High-dimensional Massive Data and Applications Ting Liu CMU-CS-06-124 March 2006 School of...4. TITLE AND SUBTITLE Fast Nonparametric Machine Learning Algorithms for High-dimensional Massive Data and Applications 5a. CONTRACT NUMBER 5b
Clusters of primordial black holes and reionization problem
Belotsky, K. M. Kirillov, A. A. Rubin, S. G.
2015-05-15
Clusters of primordial black holes may cause the formation of quasars in the early Universe. In turn, radiation from these quasars may lead to the reionization of the Universe. However, the evaporation of primordial black holes via Hawking’s mechanism may also contribute to the ionization of matter. The possibility of matter ionization via the evaporation of primordial black holes with allowance for existing constraints on their density is discussed. The contribution to ionization from the evaporation of primordial black holes characterized by their preset mass spectrum can roughly be estimated at about 10{sup −3}.
Feature extraction and classification algorithms for high dimensional data
NASA Technical Reports Server (NTRS)
Lee, Chulhee; Landgrebe, David
1993-01-01
Feature extraction and classification algorithms for high dimensional data are investigated. Developments with regard to sensors for Earth observation are moving in the direction of providing much higher dimensional multispectral imagery than is now possible. In analyzing such high dimensional data, processing time becomes an important factor. With large increases in dimensionality and the number of classes, processing time will increase significantly. To address this problem, a multistage classification scheme is proposed which reduces the processing time substantially by eliminating unlikely classes from further consideration at each stage. Several truncation criteria are developed and the relationship between thresholds and the error caused by the truncation is investigated. Next an approach to feature extraction for classification is proposed based directly on the decision boundaries. It is shown that all the features needed for classification can be extracted from decision boundaries. A characteristic of the proposed method arises by noting that only a portion of the decision boundary is effective in discriminating between classes, and the concept of the effective decision boundary is introduced. The proposed feature extraction algorithm has several desirable properties: it predicts the minimum number of features necessary to achieve the same classification accuracy as in the original space for a given pattern recognition problem; and it finds the necessary feature vectors. The proposed algorithm does not deteriorate under the circumstances of equal means or equal covariances as some previous algorithms do. In addition, the decision boundary feature extraction algorithm can be used both for parametric and non-parametric classifiers. Finally, some problems encountered in analyzing high dimensional data are studied and possible solutions are proposed. First, the increased importance of the second order statistics in analyzing high dimensional data is recognized
HIGH DIMENSIONAL COVARIANCE MATRIX ESTIMATION IN APPROXIMATE FACTOR MODELS
Fan, Jianqing; Liao, Yuan; Mincheva, Martina
2012-01-01
The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu (2011), taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied. PMID:22661790
HIGH DIMENSIONAL COVARIANCE MATRIX ESTIMATION IN APPROXIMATE FACTOR MODELS.
Fan, Jianqing; Liao, Yuan; Mincheva, Martina
2011-01-01
The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu (2011), taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied.
An Extended Membrane System with Active Membranes to Solve Automatic Fuzzy Clustering Problems.
Peng, Hong; Wang, Jun; Shi, Peng; Pérez-Jiménez, Mario J; Riscos-Núñez, Agustín
2016-05-01
This paper focuses on automatic fuzzy clustering problem and proposes a novel automatic fuzzy clustering method that employs an extended membrane system with active membranes that has been designed as its computing framework. The extended membrane system has a dynamic membrane structure; since membranes can evolve, it is particularly suitable for processing the automatic fuzzy clustering problem. A modification of a differential evolution (DE) mechanism was developed as evolution rules for objects according to membrane structure and object communication mechanisms. Under the control of both the object's evolution-communication mechanism and the membrane evolution mechanism, the extended membrane system can effectively determine the most appropriate number of clusters as well as the corresponding optimal cluster centers. The proposed method was evaluated over 13 benchmark problems and was compared with four state-of-the-art automatic clustering methods, two recently developed clustering methods and six classification techniques. The comparison results demonstrate the superiority of the proposed method in terms of effectiveness and robustness.
Problem-Solving Environments (PSEs) to Support Innovation Clustering
NASA Technical Reports Server (NTRS)
Gill, Zann
1999-01-01
This paper argues that there is need for high level concepts to inform the development of Problem-Solving Environment (PSE) capability. A traditional approach to PSE implementation is to: (1) assemble a collection of tools; (2) integrate the tools; and (3) assume that collaborative work begins after the PSE is assembled. I argue for the need to start from the opposite premise, that promoting human collaboration and observing that process comes first, followed by the development of supporting tools, and finally evolution of PSE capability through input from collaborating project teams.
NASA Astrophysics Data System (ADS)
Masood, Tabasum
2016-07-01
The distribution of galaxies in the universe can be well understood by correlation function analysis. The lowest order two point auto correlation function has remained a successful tool for understanding the galaxy clustering phenomena. The two point correlation function is a probability of finding two galaxies in a given volume separated by some particular distance. Given a random galaxy in a location, the correlation function describes the probability that another galaxy will be found within a given distance .The correlation function tool is important for theoretical models of physical cosmology because it provides means of testing models which assume different things about the contents of the universe Correlation function is one of the way to characterize the distribution of galaxies in the space . This can be done by observations and can be extracted from numerical N-body experiments. Correlation function is a natural quantity in theoretical dynamical description of gravitating systems. These correlations can answer many interesting questions about the evolution and the distribution of galaxies.
Fast Gibbs sampling for high-dimensional Bayesian inversion
NASA Astrophysics Data System (ADS)
Lucka, Felix
2016-11-01
Solving ill-posed inverse problems by Bayesian inference has recently attracted considerable attention. Compared to deterministic approaches, the probabilistic representation of the solution by the posterior distribution can be exploited to explore and quantify its uncertainties. In applications where the inverse solution is subject to further analysis procedures can be a significant advantage. Alongside theoretical progress, various new computational techniques allow us to sample very high dimensional posterior distributions: in (Lucka 2012 Inverse Problems 28 125012), and a Markov chain Monte Carlo posterior sampler was developed for linear inverse problems with {{\\ell }}1-type priors. In this article, we extend this single component (SC) Gibbs-type sampler to a wide range of priors used in Bayesian inversion, such as general {{\\ell }}pq priors with additional hard constraints. In addition, a fast computation of the conditional, SC densities in an explicit, parameterized form, a fast, robust and exact sampling from these one-dimensional densities is key to obtain an efficient algorithm. We demonstrate that a generalization of slice sampling can utilize their specific structure for this task and illustrate the performance of the resulting slice-within-Gibbs samplers by different computed examples. These new samplers allow us to perform sample-based Bayesian inference in high-dimensional scenarios with certain priors for the first time, including the inversion of computed tomography data with the popular isotropic total variation prior.
Locating landmarks on high-dimensional free energy surfaces.
Chen, Ming; Yu, Tang-Qing; Tuckerman, Mark E
2015-03-17
Coarse graining of complex systems possessing many degrees of freedom can often be a useful approach for analyzing and understanding key features of these systems in terms of just a few variables. The relevant energy landscape in a coarse-grained description is the free energy surface as a function of the coarse-grained variables, which, despite the dimensional reduction, can still be an object of high dimension. Consequently, navigating and exploring this high-dimensional free energy surface is a nontrivial task. In this paper, we use techniques from multiscale modeling, stochastic optimization, and machine learning to devise a strategy for locating minima and saddle points (termed "landmarks") on a high-dimensional free energy surface "on the fly" and without requiring prior knowledge of or an explicit form for the surface. In addition, we propose a compact graph representation of the landmarks and connections between them, and we show that the graph nodes can be subsequently analyzed and clustered based on key attributes that elucidate important properties of the system. Finally, we show that knowledge of landmark locations allows for the efficient determination of their relative free energies via enhanced sampling techniques.
Exhaustive enumeration unveils clustering and freezing in the random 3-satisfiability problem
NASA Astrophysics Data System (ADS)
Ardelius, John; Zdeborová, Lenka
2008-10-01
We study geometrical properties of the complete set of solutions of the random 3-satisfiability problem. We show that even for moderate system sizes the number of clusters corresponds surprisingly well with the theoretic asymptotic prediction. We locate the freezing transition in the space of solutions, which has been conjectured to be relevant in explaining the onset of computational hardness in random constraint satisfaction problems.
ERIC Educational Resources Information Center
Brusco, Michael J.; Kohn, Hans-Friedrich
2009-01-01
The clique partitioning problem (CPP) requires the establishment of an equivalence relation for the vertices of a graph such that the sum of the edge costs associated with the relation is minimized. The CPP has important applications for the social sciences because it provides a framework for clustering objects measured on a collection of nominal…
ERIC Educational Resources Information Center
Raver, C. Cybele; Jones, Stephanie M.; Li-Grining, Christine; Zhai, Fuhua; Metzger, Molly W.; Solomon, Bonnie
2009-01-01
The present study evaluated the efficacy of a multicomponent, classroom-based intervention in reducing preschoolers' behavior problems. The Chicago School Readiness Project model was implemented in 35 Head Start classrooms using a clustered-randomized controlled trial design. Results indicate significant treatment effects (ds = 0.53-0.89) for…
Technical Report: Scalable Parallel Algorithms for High Dimensional Numerical Integration
Masalma, Yahya; Jiao, Yu
2010-10-01
We implemented a scalable parallel quasi-Monte Carlo numerical high-dimensional integration for tera-scale data points. The implemented algorithm uses the Sobol s quasi-sequences to generate random samples. Sobol s sequence was used to avoid clustering effects in the generated random samples and to produce low-discrepancy random samples which cover the entire integration domain. The performance of the algorithm was tested. Obtained results prove the scalability and accuracy of the implemented algorithms. The implemented algorithm could be used in different applications where a huge data volume is generated and numerical integration is required. We suggest using the hyprid MPI and OpenMP programming model to improve the performance of the algorithms. If the mixed model is used, attention should be paid to the scalability and accuracy.
A novel approach to the problem of non-uniqueness of the solution in hierarchical clustering.
Cattinelli, Isabella; Valentini, Giorgio; Paulesu, Eraldo; Borghese, Nunzio Alberto
2013-07-01
The existence of multiple solutions in clustering, and in hierarchical clustering in particular, is often ignored in practical applications. However, this is a non-trivial problem, as different data orderings can result in different cluster sets that, in turns, may lead to different interpretations of the same data. The method presented here offers a solution to this issue. It is based on the definition of an equivalence relation over dendrograms that allows developing all and only the significantly different dendrograms for the same dataset, thus reducing the computational complexity to polynomial from the exponential obtained when all possible dendrograms are considered. Experimental results in the neuroimaging and bioinformatics domains show the effectiveness of the proposed method.
Spatially Weighted Principal Component Regression for High-dimensional Prediction
Shen, Dan; Zhu, Hongtu
2015-01-01
We consider the problem of using high dimensional data residing on graphs to predict a low-dimensional outcome variable, such as disease status. Examples of data include time series and genetic data measured on linear graphs and imaging data measured on triangulated graphs (or lattices), among many others. Many of these data have two key features including spatial smoothness and intrinsically low dimensional structure. We propose a simple solution based on a general statistical framework, called spatially weighted principal component regression (SWPCR). In SWPCR, we introduce two sets of weights including importance score weights for the selection of individual features at each node and spatial weights for the incorporation of the neighboring pattern on the graph. We integrate the importance score weights with the spatial weights in order to recover the low dimensional structure of high dimensional data. We demonstrate the utility of our methods through extensive simulations and a real data analysis based on Alzheimer’s disease neuroimaging initiative data. PMID:26213452
NASA Astrophysics Data System (ADS)
Konno, Yohko; Suzuki, Keiji
This paper describes an approach to development of a solution algorithm of a general-purpose for large scale problems using “Local Clustering Organization (LCO)” as a new solution for Job-shop scheduling problem (JSP). Using a performance effective large scale scheduling in the study of usual LCO, a solving JSP keep stability induced better solution is examined. In this study for an improvement of a performance of a solution for JSP, processes to a optimization by LCO is examined, and a scheduling solution-structure is extended to a new solution-structure based on machine-division. A solving method introduced into effective local clustering for the solution-structure is proposed as an extended LCO. An extended LCO has an algorithm which improves scheduling evaluation efficiently by clustering of parallel search which extends over plural machines. A result verified by an application of extended LCO on various scale of problems proved to conduce to minimizing make-span and improving on the stable performance.
NASA Astrophysics Data System (ADS)
Chen, L. X.; Wu, Q. P.
2012-10-01
Recently, Dada et al. reported on the experimental entanglement concentration and violation of generalized Bell inequalities with orbital angular momentum (OAM) [Nat. Phys. 7, 677 (2011)]. Here we demonstrate that the high-dimensional entanglement concentration can be performed in arbitrary OAM subspaces with selectivity. Instead of violating the generalized Bell inequalities, the working principle of present entanglement concentration is visualized by the biphoton OAM Klyshko picture, and its good performance is confirmed and quantified through the experimental Shannon dimensionalities after concentration.
A model-based cluster analysis approach to adolescent problem behaviors and young adult outcomes.
Mun, Eun Young; Windle, Michael; Schainker, Lisa M
2008-01-01
Data from a community-based sample of 1,126 10th- and 11th-grade adolescents were analyzed using a model-based cluster analysis approach to empirically identify heterogeneous adolescent subpopulations from the person-oriented and pattern-oriented perspectives. The model-based cluster analysis is a new clustering procedure to investigate population heterogeneity utilizing finite mixture multivariate normal densities and accordingly to classify subpopulations using more rigorous statistical procedures for the comparison of alternative models. Four cluster groups were identified and labeled multiproblem high-risk, smoking high-risk, normative, and low-risk groups. The multiproblem high risk exhibited a constellation of high levels of problem behaviors, including delinquent and sexual behaviors, multiple illicit substance use, and depressive symptoms at age 16. They had risky temperamental attributes and lower academic functioning and educational expectations at age 15.5 and, subsequently, at age 24 completed fewer years of education, and reported lower levels of physical health and higher levels of continued involvement in substance use and abuse. The smoking high-risk group was also found to be at risk for poorer functioning in young adulthood, compared to the low-risk group. The normative and the low risk groups were, by and large, similar in their adolescent and young adult functioning. The continuity and comorbidity path from middle adolescence to young adulthood may be aided and abetted by chronic as well as episodic substance use by adolescents.
Classification of high dimensional multispectral image data
NASA Technical Reports Server (NTRS)
Hoffbeck, Joseph P.; Landgrebe, David A.
1993-01-01
A method for classifying high dimensional remote sensing data is described. The technique uses a radiometric adjustment to allow a human operator to identify and label training pixels by visually comparing the remotely sensed spectra to laboratory reflectance spectra. Training pixels for material without obvious spectral features are identified by traditional means. Features which are effective for discriminating between the classes are then derived from the original radiance data and used to classify the scene. This technique is applied to Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data taken over Cuprite, Nevada in 1992, and the results are compared to an existing geologic map. This technique performed well even with noisy data and the fact that some of the materials in the scene lack absorption features. No adjustment for the atmosphere or other scene variables was made to the data classified. While the experimental results compare favorably with an existing geologic map, the primary purpose of this research was to demonstrate the classification method, as compared to the geology of the Cuprite scene.
Graphics Processing Units and High-Dimensional Optimization
Zhou, Hua; Lange, Kenneth; Suchard, Marc A.
2011-01-01
This paper discusses the potential of graphics processing units (GPUs) in high-dimensional optimization problems. A single GPU card with hundreds of arithmetic cores can be inserted in a personal computer and dramatically accelerates many statistical algorithms. To exploit these devices fully, optimization algorithms should reduce to multiple parallel tasks, each accessing a limited amount of data. These criteria favor EM and MM algorithms that separate parameters and data. To a lesser extent block relaxation and coordinate descent and ascent also qualify. We demonstrate the utility of GPUs in nonnegative matrix factorization, PET image reconstruction, and multidimensional scaling. Speedups of 100 fold can easily be attained. Over the next decade, GPUs will fundamentally alter the landscape of computational statistics. It is time for more statisticians to get on-board. PMID:21847315
Clustering analysis of the ground-state structure of the vertex-cover problem
NASA Astrophysics Data System (ADS)
Barthel, Wolfgang; Hartmann, Alexander K.
2004-12-01
Vertex cover is one of the classical NP-complete problems in theoretical computer science. A vertex cover of a graph is a subset of vertices such that for each edge at least one of the two endpoints is contained in the subset. When studied on Erdös-Rényi random graphs (with connectivity c ) one observes a threshold behavior: In the thermodynamic limit the size of the minimal vertex cover is independent of the specific graph. Recent analytical studies show that on the phase boundary, for small connectivities c
High dimensional decision dilemmas in climate models
NASA Astrophysics Data System (ADS)
Bracco, A.; Neelin, J. D.; Luo, H.; McWilliams, J. C.; Meyerson, J. E.
2013-10-01
An important source of uncertainty in climate models is linked to the calibration of model parameters. Interest in systematic and automated parameter optimization procedures stems from the desire to improve the model climatology and to quantify the average sensitivity associated with potential changes in the climate system. Building upon on the smoothness of the response of an atmospheric circulation model (AGCM) to changes of four adjustable parameters, Neelin et al. (2010) used a quadratic metamodel to objectively calibrate the AGCM. The metamodel accurately estimates global spatial averages of common fields of climatic interest, from precipitation, to low and high level winds, from temperature at various levels to sea level pressure and geopotential height, while providing a computationally cheap strategy to explore the influence of parameter settings. Here, guided by the metamodel, the ambiguities or dilemmas related to the decision making process in relation to model sensitivity and optimization are examined. Simulations of current climate are subject to considerable regional-scale biases. Those biases may vary substantially depending on the climate variable considered, and/or on the performance metric adopted. Common dilemmas are associated with model revisions yielding improvement in one field or regional pattern or season, but degradation in another, or improvement in the model climatology but degradation in the interannual variability representation. Challenges are posed to the modeler by the high dimensionality of the model output fields and by the large number of adjustable parameters. The use of the metamodel in the optimization strategy helps visualize trade-offs at a regional level, e.g., how mismatches between sensitivity and error spatial fields yield regional errors under minimization of global objective functions.
NASA Astrophysics Data System (ADS)
Heggie, D.; Hut, P.
2003-10-01
focus on N = 106 for two main reasons: first, direct numerical integrations of N-body systems are beginning to approach this threshold, and second, globular star clusters provide remarkably accurate physical instantiations of the idealized N-body problem with N = 105 - 106. The authors are distinguished contributors to the study of star-cluster dynamics and the gravitational N-body problem. The book contains lucid and concise descriptions of most of the important tools in the subject, with only a modest bias towards the authors' own interests. These tools include the two-body relaxation approximation, the Vlasov and Fokker-Planck equations, regularization of close encounters, conducting fluid models, Hill's approximation, Heggie's law for binary star evolution, symplectic integration algorithms, Liapunov exponents, and so on. The book also provides an up-to-date description of the principal processes that drive the evolution of idealized N-body systems - two-body relaxation, mass segregation, escape, core collapse and core bounce, binary star hardening, gravothermal oscillations - as well as additional processes such as stellar collisions and tidal shocks that affect real star clusters but not idealized N-body systems. In a relatively short (300 pages plus appendices) book such as this, many topics have to be omitted. The reader who is hoping to learn about the phenomenology of star clusters will be disappointed, as the description of their properties is limited to only a page of text; there is also almost no discussion of other, equally interesting N-body systems such as galaxies(N approx 106 - 1012), open clusters (N simeq 102 - 104), planetary systems, or the star clusters surrounding black holes that are found in the centres of most galaxies. All of these omissions are defensible decisions. Less defensible is the uneven set of references in the text; for example, nowhere is the reader informed that the classic predecessor to this work was Spitzer's 1987 monograph
Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences
NASA Technical Reports Server (NTRS)
Budalakoti, Suratna; Srivastava, Ashok N.; Akella, Ram; Turkov, Eugene
2006-01-01
This paper addresses the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. The approach taken uses unsupervised clustering of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by detailed analysis of outliers to detect anomalies. As the LCS measure is expensive to compute, the first part of the paper discusses existing algorithms, such as the Hunt-Szymanski algorithm, that have low time-complexity. We then discuss why these algorithms often do not work well in practice and present a new hybrid algorithm for computing the LCS that, in our tests, outperforms the Hunt-Szymanski algorithm by a factor of five. The second part of the paper presents new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an outlier. The algorithms provide a coherent description to an analyst of the anomalies in the sequence, compared to more normal sequences. The algorithms we present are general and domain-independent, so we discuss applications in related areas such as anomaly detection.
Matton, Annelies; Goossens, Lien; Braet, Caroline; Vervaet, Myriam
2013-05-01
Little is known about the role of sensitivity to punishment (SP) and reward (SR) in eating problems during adolescence. Therefore, the aim of the present study was to examine the naturally occurring clusters of high and low SP and SR among nonclinical adolescents and the between-cluster differences in various eating problems and weight. A total of 579 adolescents (14-19 years, 39.8% boys) completed the Sensitivity to Punishment and Sensitivity to Reward Questionnaire (SPSRQ), the Behavioural Inhibition System and Behavioural Activation System scales (BIS/BAS scales), the Dutch Eating Behaviour Questionnaire and the Child Eating Disorder Examination Questionnaire and were weighed and measured. On the basis of the SPSRQ, four clusters were established, interpreted as lowSP × lowSR, lowSP × highSR, highSP × highSR and highSP × lowSR. These were associated with eating problems but not with adjusted body mass index. It seemed that specifically the highSP × highSR cluster outscored the other clusters on eating problems. These results were partly replicated with the BIS/BAS scales, although less significant relations between the clusters and eating problems were found. The implications of the findings in terms of possible risk and protective clusters are discussed.
Parallel computations using a cluster of workstations to simulate elasticity problems
NASA Astrophysics Data System (ADS)
Darmawan, J. B. B.; Mungkasi, S.
2016-11-01
Computational physics has played important roles in real world problems. This paper is within the applied computational physics area. The aim of this study is to observe the performance of parallel computations using a cluster of workstations (COW) to simulate elasticity problems. Parallel computations with the COW configuration are conducted using the Message Passing Interface (MPI) standard. In parallel computations with COW, we consider five scenarios with twenty simulations. In addition to the execution time, efficiency is used to evaluate programming algorithm scenarios. Sequential and parallel programming performances are evaluated based on their execution time and efficiency. Results show that the one-dimensional elasticity equations are not appropriate to be solved in parallel with MPI_Send and MPI_Recv technique in the MPI standard, because the total amount of time to exchange data is considered more dominant compared with the total amount of time to conduct the basic elasticity computation.
Optimal control problem for the three-sector economic model of a cluster
NASA Astrophysics Data System (ADS)
Murzabekov, Zainel; Aipanov, Shamshi; Usubalieva, Saltanat
2016-08-01
The problem of optimal control for the three-sector economic model of a cluster is considered. Task statement is to determine the optimal distribution of investment and manpower in moving the system from a given initial state to desired final state. To solve the optimal control problem with finite-horizon planning, in case of fixed ends of trajectories, with box constraints, the method of Lagrange multipliers of a special type is used. This approach allows to represent the desired control in the form of synthesis control, depending on state of the system and current time. The results of numerical calculations for an instance of three-sector model of the economy show the effectiveness of the proposed method.
Bias-Corrected Diagonal Discriminant Rules for High-Dimensional Classification
Huang, Song; Tong, Tiejun; Zhao, Hongyu
2011-01-01
Summary Diagonal discriminant rules have been successfully used for high-dimensional classification problems, but suffer from the serious drawback of biased discriminant scores. In this paper, we propose improved diagonal discriminant rules with bias-corrected discriminant scores for high-dimensional classification. We show that the proposed discriminant scores dominate the standard ones under the quadratic loss function. Analytical results on why the bias-corrected rules can potentially improve the predication accuracy are also provided. Finally, we demonstrate the improvement of the proposed rules over the original ones through extensive simulation studies and real case studies. PMID:20222939
DD-HDS: A method for visualization and exploration of high-dimensional data.
Lespinats, Sylvain; Verleysen, Michel; Giron, Alain; Fertil, Bernard
2007-09-01
Mapping high-dimensional data in a low-dimensional space, for example, for visualization, is a problem of increasingly major concern in data analysis. This paper presents data-driven high-dimensional scaling (DD-HDS), a nonlinear mapping method that follows the line of multidimensional scaling (MDS) approach, based on the preservation of distances between pairs of data. It improves the performance of existing competitors with respect to the representation of high-dimensional data, in two ways. It introduces (1) a specific weighting of distances between data taking into account the concentration of measure phenomenon and (2) a symmetric handling of short distances in the original and output spaces, avoiding false neighbor representations while still allowing some necessary tears in the original distribution. More precisely, the weighting is set according to the effective distribution of distances in the data set, with the exception of a single user-defined parameter setting the tradeoff between local neighborhood preservation and global mapping. The optimization of the stress criterion designed for the mapping is realized by "force-directed placement" (FDP). The mappings of low- and high-dimensional data sets are presented as illustrations of the features and advantages of the proposed algorithm. The weighting function specific to high-dimensional data and the symmetric handling of short distances can be easily incorporated in most distance preservation-based nonlinear dimensionality reduction methods.
Engineering two-photon high-dimensional states through quantum interference.
Zhang, Yingwen; Roux, Filippus S; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew
2016-02-01
Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits.
Detection of Subtle Context-Dependent Model Inaccuracies in High-Dimensional Robot Domains.
Mendoza, Juan Pablo; Simmons, Reid; Veloso, Manuela
2016-12-01
Autonomous robots often rely on models of their sensing and actions for intelligent decision making. However, when operating in unconstrained environments, the complexity of the world makes it infeasible to create models that are accurate in every situation. This article addresses the problem of using potentially large and high-dimensional sets of robot execution data to detect situations in which a robot model is inaccurate-that is, detecting context-dependent model inaccuracies in a high-dimensional context space. To find inaccuracies tractably, the robot conducts an informed search through low-dimensional projections of execution data to find parametric Regions of Inaccurate Modeling (RIMs). Empirical evidence from two robot domains shows that this approach significantly enhances the detection power of existing RIM-detection algorithms in high-dimensional spaces.
NASA Astrophysics Data System (ADS)
Li, Weixuan; Lin, Guang; Li, Bing
2016-09-01
Many uncertainty quantification (UQ) approaches suffer from the curse of dimensionality, that is, their computational costs become intractable for problems involving a large number of uncertainty parameters. In these situations, the classic Monte Carlo often remains the preferred method of choice because its convergence rate O (n - 1 / 2), where n is the required number of model simulations, does not depend on the dimension of the problem. However, many high-dimensional UQ problems are intrinsically low-dimensional, because the variation of the quantity of interest (QoI) is often caused by only a few latent parameters varying within a low-dimensional subspace, known as the sufficient dimension reduction (SDR) subspace in the statistics literature. Motivated by this observation, we propose two inverse regression-based UQ algorithms (IRUQ) for high-dimensional problems. Both algorithms use inverse regression to convert the original high-dimensional problem to a low-dimensional one, which is then efficiently solved by building a response surface for the reduced model, for example via the polynomial chaos expansion. The first algorithm, which is for the situations where an exact SDR subspace exists, is proved to converge at rate O (n-1), hence much faster than MC. The second algorithm, which doesn't require an exact SDR, employs the reduced model as a control variate to reduce the error of the MC estimate. The accuracy gain could still be significant, depending on how well the reduced model approximates the original high-dimensional one. IRUQ also provides several additional practical advantages: it is non-intrusive; it does not require computing the high-dimensional gradient of the QoI; and it reports an error bar so the user knows how reliable the result is.
Hyper-spectral image segmentation using spectral clustering with covariance descriptors
NASA Astrophysics Data System (ADS)
Kursun, Olcay; Karabiber, Fethullah; Koc, Cemalettin; Bal, Abdullah
2009-02-01
Image segmentation is an important and difficult computer vision problem. Hyper-spectral images pose even more difficulty due to their high-dimensionality. Spectral clustering (SC) is a recently popular clustering/segmentation algorithm. In general, SC lifts the data to a high dimensional space, also known as the kernel trick, then derive eigenvectors in this new space, and finally using these new dimensions partition the data into clusters. We demonstrate that SC works efficiently when combined with covariance descriptors that can be used to assess pixelwise similarities rather than in the high-dimensional Euclidean space. We present the formulations and some preliminary results of the proposed hybrid image segmentation method for hyper-spectral images.
Li, Weixuan; Lin, Guang; Li, Bing
2016-09-01
A well-known challenge in uncertainty quantification (UQ) is the "curse of dimensionality". However, many high-dimensional UQ problems are essentially low-dimensional, because the randomness of the quantity of interest (QoI) is caused only by uncertain parameters varying within a low-dimensional subspace, known as the sufficient dimension reduction (SDR) subspace. Motivated by this observation, we propose and demonstrate in this paper an inverse regression-based UQ approach (IRUQ) for high-dimensional problems. Specifically, we use an inverse regression procedure to estimate the SDR subspace and then convert the original problem to a low-dimensional one, which can be efficiently solved by building a response surface model such as a polynomial chaos expansion. The novelty and advantages of the proposed approach is seen in its computational efficiency and practicality. Comparing with Monte Carlo, the traditionally preferred approach for high-dimensional UQ, IRUQ with a comparable cost generally gives much more accurate solutions even for high-dimensional problems, and even when the dimension reduction is not exactly sufficient. Theoretically, IRUQ is proved to converge twice as fast as the approach it uses seeking the SDR subspace. For example, while a sliced inverse regression method converges to the SDR subspace at the rate of $O(n^{-1/2})$, the corresponding IRUQ converges at $O(n^{-1})$. IRUQ also provides several desired conveniences in practice. It is non-intrusive, requiring only a simulator to generate realizations of the QoI, and there is no need to compute the high-dimensional gradient of the QoI. Finally, error bars can be derived for the estimation results reported by IRUQ.
Lee, Jenny Hyunjung; McDonnell, Kevin T; Zelenyuk, Alla; Imre, Dan; Mueller, Klaus
2014-03-01
Although the euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging intercluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multidimensional scaling (MDS) where one can often observe nonintuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our biscale framework distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate euclidean distance.
Statistical mechanics of complex neural systems and high dimensional data
NASA Astrophysics Data System (ADS)
Advani, Madhu; Lahiri, Subhaneil; Ganguli, Surya
2013-03-01
Recent experimental advances in neuroscience have opened new vistas into the immense complexity of neuronal networks. This proliferation of data challenges us on two parallel fronts. First, how can we form adequate theoretical frameworks for understanding how dynamical network processes cooperate across widely disparate spatiotemporal scales to solve important computational problems? Second, how can we extract meaningful models of neuronal systems from high dimensional datasets? To aid in these challenges, we give a pedagogical review of a collection of ideas and theoretical methods arising at the intersection of statistical physics, computer science and neurobiology. We introduce the interrelated replica and cavity methods, which originated in statistical physics as powerful ways to quantitatively analyze large highly heterogeneous systems of many interacting degrees of freedom. We also introduce the closely related notion of message passing in graphical models, which originated in computer science as a distributed algorithm capable of solving large inference and optimization problems involving many coupled variables. We then show how both the statistical physics and computer science perspectives can be applied in a wide diversity of contexts to problems arising in theoretical neuroscience and data analysis. Along the way we discuss spin glasses, learning theory, illusions of structure in noise, random matrices, dimensionality reduction and compressed sensing, all within the unified formalism of the replica method. Moreover, we review recent conceptual connections between message passing in graphical models, and neural computation and learning. Overall, these ideas illustrate how statistical physics and computer science might provide a lens through which we can uncover emergent computational functions buried deep within the dynamical complexities of neuronal networks.
Inference for High-dimensional Differential Correlation Matrices *
Cai, T. Tony; Zhang, Anru
2015-01-01
Motivated by differential co-expression analysis in genomics, we consider in this paper estimation and testing of high-dimensional differential correlation matrices. An adaptive thresholding procedure is introduced and theoretical guarantees are given. Minimax rate of convergence is established and the proposed estimator is shown to be adaptively rate-optimal over collections of paired correlation matrices with approximately sparse differences. Simulation results show that the procedure significantly outperforms two other natural methods that are based on separate estimation of the individual correlation matrices. The procedure is also illustrated through an analysis of a breast cancer dataset, which provides evidence at the gene co-expression level that several genes, of which a subset has been previously verified, are associated with the breast cancer. Hypothesis testing on the differential correlation matrices is also considered. A test, which is particularly well suited for testing against sparse alternatives, is introduced. In addition, other related problems, including estimation of a single sparse correlation matrix, estimation of the differential covariance matrices, and estimation of the differential cross-correlation matrices, are also discussed. PMID:26500380
Inference for High-dimensional Differential Correlation Matrices.
Cai, T Tony; Zhang, Anru
2016-01-01
Motivated by differential co-expression analysis in genomics, we consider in this paper estimation and testing of high-dimensional differential correlation matrices. An adaptive thresholding procedure is introduced and theoretical guarantees are given. Minimax rate of convergence is established and the proposed estimator is shown to be adaptively rate-optimal over collections of paired correlation matrices with approximately sparse differences. Simulation results show that the procedure significantly outperforms two other natural methods that are based on separate estimation of the individual correlation matrices. The procedure is also illustrated through an analysis of a breast cancer dataset, which provides evidence at the gene co-expression level that several genes, of which a subset has been previously verified, are associated with the breast cancer. Hypothesis testing on the differential correlation matrices is also considered. A test, which is particularly well suited for testing against sparse alternatives, is introduced. In addition, other related problems, including estimation of a single sparse correlation matrix, estimation of the differential covariance matrices, and estimation of the differential cross-correlation matrices, are also discussed.
Fast Covariance Estimation for High-dimensional Functional Data.
Xiao, Luo; Zipunnikov, Vadim; Ruppert, David; Crainiceanu, Ciprian
2016-01-01
We propose two fast covariance smoothing methods and associated software that scale up linearly with the number of observations per function. Most available methods and software cannot smooth covariance matrices of dimension J > 500; a recently introduced sandwich smoother is an exception but is not adapted to smooth covariance matrices of large dimensions, such as J = 10, 000. We introduce two new methods that circumvent those problems: 1) a fast implementation of the sandwich smoother for covariance smoothing; and 2) a two-step procedure that first obtains the singular value decomposition of the data matrix and then smoothes the eigenvectors. These new approaches are at least an order of magnitude faster in high dimensions and drastically reduce computer memory requirements. The new approaches provide instantaneous (a few seconds) smoothing for matrices of dimension J = 10,000 and very fast (< 10 minutes) smoothing for J = 100, 000. R functions, simulations, and data analysis provide ready to use, reproducible, and scalable tools for practical data analysis of noisy high-dimensional functional data.
Hyperspherical Sparse Approximation Techniques for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max; Burkardt, John
2016-08-04
This work proposes a hyperspherical sparse approximation framework for detecting jump discontinuities in functions in high-dimensional spaces. The need for a novel approach results from the theoretical and computational inefficiencies of well-known approaches, such as adaptive sparse grids, for discontinuity detection. Our approach constructs the hyperspherical coordinate representation of the discontinuity surface of a function. Then sparse approximations of the transformed function are built in the hyperspherical coordinate system, with values at each point estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. Several approaches are used to approximate the transformed discontinuity surface in the hyperspherical system, including adaptive sparse grid and radial basis function interpolation, discrete least squares projection, and compressed sensing approximation. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. In conclusion, rigorous complexity analyses of the new methods are provided, as are several numerical examples that illustrate the effectiveness of our approach.
HYPOTHESIS TESTING FOR HIGH-DIMENSIONAL SPARSE BINARY REGRESSION
Mukherjee, Rajarshi; Pillai, Natesh S.; Lin, Xihong
2015-01-01
In this paper, we study the detection boundary for minimax hypothesis testing in the context of high-dimensional, sparse binary regression models. Motivated by genetic sequencing association studies for rare variant effects, we investigate the complexity of the hypothesis testing problem when the design matrix is sparse. We observe a new phenomenon in the behavior of detection boundary which does not occur in the case of Gaussian linear regression. We derive the detection boundary as a function of two components: a design matrix sparsity index and signal strength, each of which is a function of the sparsity of the alternative. For any alternative, if the design matrix sparsity index is too high, any test is asymptotically powerless irrespective of the magnitude of signal strength. For binary design matrices with the sparsity index that is not too high, our results are parallel to those in the Gaussian case. In this context, we derive detection boundaries for both dense and sparse regimes. For the dense regime, we show that the generalized likelihood ratio is rate optimal; for the sparse regime, we propose an extended Higher Criticism Test and show it is rate optimal and sharp. We illustrate the finite sample properties of the theoretical results using simulation studies. PMID:26246645
Hyperspherical Sparse Approximation Techniques for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max; ...
2016-08-04
This work proposes a hyperspherical sparse approximation framework for detecting jump discontinuities in functions in high-dimensional spaces. The need for a novel approach results from the theoretical and computational inefficiencies of well-known approaches, such as adaptive sparse grids, for discontinuity detection. Our approach constructs the hyperspherical coordinate representation of the discontinuity surface of a function. Then sparse approximations of the transformed function are built in the hyperspherical coordinate system, with values at each point estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the new technique can identify jump discontinuities with significantly reduced computationalmore » cost, compared to existing methods. Several approaches are used to approximate the transformed discontinuity surface in the hyperspherical system, including adaptive sparse grid and radial basis function interpolation, discrete least squares projection, and compressed sensing approximation. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. In conclusion, rigorous complexity analyses of the new methods are provided, as are several numerical examples that illustrate the effectiveness of our approach.« less
Visual exploration of high-dimensional data through subspace analysis and dynamic projections
Liu, S.; Wang, B.; Thiagarajan, J. J.; Bremer, P. -T.; Pascucci, V.
2015-06-01
Here, we introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that create smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.
Visual exploration of high-dimensional data through subspace analysis and dynamic projections
Liu, S.; Wang, B.; Thiagarajan, J. J.; ...
2015-06-01
Here, we introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that createmore » smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.« less
Visual Exploration of High-Dimensional Data through Subspace Analysis and Dynamic Projections
Liu, S.; Wang, B.; Thiagarajan, Jayaraman J.; Bremer, Peer -Timo; Pascucci, Valerio
2015-06-01
We introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that create smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.
Blöchliger, Nicolas; Caflisch, Amedeo; Vitalis, Andreas
2015-11-10
Data mining techniques depend strongly on how the data are represented and how distance between samples is measured. High-dimensional data often contain a large number of irrelevant dimensions (features) for a given query. These features act as noise and obfuscate relevant information. Unsupervised approaches to mine such data require distance measures that can account for feature relevance. Molecular dynamics simulations produce high-dimensional data sets describing molecules observed in time. Here, we propose to globally or locally weight simulation features based on effective rates. This emphasizes, in a data-driven manner, slow degrees of freedom that often report on the metastable states sampled by the molecular system. We couple this idea to several unsupervised learning protocols. Our approach unmasks slow side chain dynamics within the native state of a miniprotein and reveals additional metastable conformations of a protein. The approach can be combined with most algorithms for clustering or dimensionality reduction.
Classification of sparse high-dimensional vectors.
Ingster, Yuri I; Pouet, Christophe; Tsybakov, Alexandre B
2009-11-13
We study the problem of classification of d-dimensional vectors into two classes (one of which is 'pure noise') based on a training sample of size m. The main specific feature is that the dimension d can be very large. We suppose that the difference between the distribution of the population and that of the noise is only in a shift, which is a sparse vector. For Gaussian noise, fixed sample size m, and dimension d that tends to infinity, we obtain the sharp classification boundary, i.e. the necessary and sufficient conditions for the possibility of successful classification. We propose classifiers attaining this boundary. We also give extensions of the result to the case where the sample size m depends on d and satisfies the condition (log m)/log d --> gamma, 0
Duke Workshop on High-Dimensional Data Sensing and Analysis
2015-05-06
acquisition of high-dimensional data, including compressive sensing (CS). The meeting focused on new theory, algorithms and application. In addition to having...theory, algorithms and application. In addition to having talks from many of the leading researchers from academia, there were talks from the members of...analysis and acquisition of high-dimensional data, including compressive sensing (CS). The meeting focused on new theory, algorithms and application
Towards robust particle filters for high-dimensional systems
NASA Astrophysics Data System (ADS)
van Leeuwen, Peter Jan
2015-04-01
In recent years particle filters have matured and several variants are now available that are not degenerate for high-dimensional systems. Often they are based on ad-hoc combinations with Ensemble Kalman Filters. Unfortunately it is unclear what approximations are made when these hybrids are used. The proper way to derive particle filters for high-dimensional systems is exploring the freedom in the proposal density. It is well known that using an Ensemble Kalman Filter as proposal density (the so-called Weighted Ensemble Kalman Filter) does not work for high-dimensional systems. However, much better results are obtained when weak-constraint 4Dvar is used as proposal, leading to the implicit particle filter. Still this filter is degenerate when the number of independent observations is large. The Equivalent-Weights Particle Filter is a filter that works well in systems of arbitrary dimensions, but it contains a few tuning parameters that have to be chosen well to avoid biases. In this paper we discuss ways to derive more robust particle filters for high-dimensional systems. Using ideas from large-deviation theory and optimal transportation particle filters will be generated that are robust and work well in these systems. It will be shown that all successful filters can be derived from one general framework. Also, the performance of the filters will be tested on simple but high-dimensional systems, and, if time permits, on a high-dimensional highly nonlinear barotropic vorticity equation model.
Evangelista, Francesco A
2011-06-14
We report a general implementation of alternative formulations of single-reference coupled cluster theory (extended, unitary, and variational) with arbitrary-order truncation of the cluster operator. These methods are applied to compute the energy of Ne and the equilibrium properties of HF and C(2). Potential energy curves for the dissociation of HF and the BeH(2) model computed with the extended, variational, and unitary coupled cluster approaches are compared to those obtained from the multireference coupled cluster approach of Mukherjee et al. [J. Chem. Phys. 110, 6171 (1999)] and the internally contracted multireference coupled cluster approach [F. A. Evangelista and J. Gauss, J. Chem. Phys. 134, 114102 (2011)]. In the case of Ne, HF, and C(2), the alternative coupled cluster approaches yield almost identical bond length, harmonic vibrational frequency, and anharmonic constant, which are more accurate than those from traditional coupled cluster theory. For potential energy curves, the alternative coupled cluster methods are found to be more accurate than traditional coupled cluster theory, but are three to ten times less accurate than multireference coupled cluster approaches. The most challenging benchmark, the BeH(2) model, highlights the strong dependence of the alternative coupled cluster theories on the choice of the Fermi vacuum. When evaluated by the accuracy to cost ratio, the alternative coupled cluster methods are not competitive with respect to traditional CC theory, in other words, the simplest theory is found to be the most effective one.
Convex Clustering: An Attractive Alternative to Hierarchical Clustering
Chen, Gary K.; Chi, Eric C.; Ranola, John Michael O.; Lange, Kenneth
2015-01-01
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/ PMID:25965340
ClusterSculptor: Software for Expert-Steered Classification of Single Particle Mass Spectra
Zelenyuk, Alla; Imre, Dan G.; Nam, Eun Ju; Han, Yiping; Mueller, Klaus
2008-08-01
To take full advantage of the vast amount of highly detailed data acquired by single particle mass spectrometers requires that the data be organized according to some rules that have the potential to be insightful. Most commonly statistical tools are used to cluster the individual particle mass spectra on the basis of their similarity. Cluster analysis is a powerful strategy for the exploration of high-dimensional data in the absence of a-priori hypotheses or data classification models, and the results of cluster analysis can then be used to form such models. More often than not, when examining the data clustering results we find that many clusters contain particles of different types and that many particles of one type end up in a number of separate clusters. Our experience with cluster analysis shows that we have a vast amount of non-compiled knowledge and intuition that should be brought to bear in this effort. We will present new software we call ClusterSculptor that provides comprehensive and intuitive framework to aid scientists in data classification. ClusterSculptor uses k-means as the overall clustering engine, but allows tuning its parameters interactively, based on a non-distorted compact visual presentation of the inherent characteristics of the data in high-dimensional space. ClusterSculptor provides all the tools necessary for a high-dimensional activity we call cluster sculpting. ClusterSculptor is designed to be coupled to SpectraMiner, our data mining and visualization software package. The data are first visualized with SpectraMiner and identified problems are exported to ClusterSculptor, where the user steers the reclassification and recombination of clusters of tens of thousands particle mass spectra in real-time. The resulting sculpted clusters can be then imported back into SpectraMiner. Here we will greatly improved single particle chemical speciation in an example of application of this new tool to a number of particle types of atmospheric
Lee, Hyun Jung; McDonnell, Kevin T.; Zelenyuk, Alla; Imre, D.; Mueller, Klaus
2014-03-01
Although the Euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging inter-cluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multi-dimensional scaling (MDS) where one can often observe non-intuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our MDS plots also exhibit similar visual relationships as the method of parallel coordinates which is often used alongside to visualize the high-dimensional data in raw form. We then cast our metric into a bi-scale framework which distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate Euclidean distance.
Choosing ℓp norms in high-dimensional spaces based on hub analysis
Flexer, Arthur; Schnitzer, Dominik
2015-01-01
The hubness phenomenon is a recently discovered aspect of the curse of dimensionality. Hub objects have a small distance to an exceptionally large number of data points while anti-hubs lie far from all other data points. A closely related problem is the concentration of distances in high-dimensional spaces. Previous work has already advocated the use of fractional ℓp norms instead of the ubiquitous Euclidean norm to avoid the negative effects of distance concentration. However, which exact fractional norm to use is a largely unsolved problem. The contribution of this work is an empirical analysis of the relation of different ℓp norms and hubness. We propose an unsupervised approach for choosing an ℓp norm which minimizes hubs while simultaneously maximizing nearest neighbor classification. Our approach is evaluated on seven high-dimensional data sets and compared to three approaches that re-scale distances to avoid hubness. PMID:26640321
An Overview of Air Pollution Problem in Megacities and City Clusters in China
NASA Astrophysics Data System (ADS)
Tang, X.
2007-05-01
China has experienced the rapid economic growth in last twenty years. City clusters, which consist of one or several mega cities in close vicinity and many satellite cities and towns, are playing a leading role in Chinese economic growth, owing to their collective economic capacity and interdependency. However, accompanying with the economic boom, population growth and increased energy consumption, the air quality has been degrading in the past two decades. Air pollution in those areas is characterized by concurrent occurrence of high concentrations of multiple primary pollutants leading to form complex secondary pollution problem. After decades long efforts to control air pollution, both the government and scientific communities have realized that to control regional scale air pollution, regional efforts are needed. Field experiments covering the regions like Pearl River Delta region and Beijing City with surrounding areas are critical to understand the chemical and physical processes leading to the formation of regional scale air pollution. In order to formulate policy suggestions for air quality attainment during 2008 Beijing Olympic game and to propose objectives of air quality attainment in 2010 in Beijing, CAREBEIJING (Campaigns of Air Quality Research in Beijing and Surrounding Region) was organized by Peking University in 2006 to learn current air pollution situation of the region, and to identify the transport and transformation processes that lead to the impact of the surrounding area on air quality in Beijing. Same as the purpose for understanding the chemical and physical processes happened in regional scale, the fall and summer campaigns in 2004 and 2006 were carried out in Pearl River Delta. More than 16 domestic and foreign institutions were involved in these campaigns. The background, current status, problems, and some results of these campaigns will be introduced in this presentation.
Resolving the timing problem of the globular clusters orbiting the Fornax dwarf galaxy
NASA Astrophysics Data System (ADS)
Angus, G. W.; Diaferio, A.
2009-06-01
We re-investigate the old problem of the survival of the five globular clusters (GCs) orbiting the Fornax dwarf galaxy in both standard and modified Newtonian dynamics (MOND). For the first time in the history of the topic, we use accurate mass models for the Fornax dwarf, obtained through Jeans modelling of the recently published line-of-sight (LOS) velocity dispersion data, and we are also not resigned to circular orbits for the GCs. Previously conceived problems stem from fixing the starting distances of the globulars to be less than half the tidal radius. We relax this constraint since there is absolutely no evidence for it and show that the dark matter (DM) paradigm, with either cusped or cored DM profiles, has no trouble sustaining the orbits of the two least massive GCs for a Hubble time almost regardless of their initial distance from Fornax. The three most massive globulars can remain in orbit as long as their starting distances are marginally outside the tidal radius. The outlook for MOND is also not nearly as bleak as previously reported. Although dynamical friction (DF) inside the tidal radius is far stronger in MOND, outside DF is negligible due to the absence of stars. This allows highly radial orbits to survive, but more importantly circular orbits at distances more than 85 per cent of Fornax's tidal radius to survive indefinitely. The probability of the GCs being on circular orbits at this distance compared with their current projected distances is discussed and shown to be plausible. Finally, if we ignore the presence of the most massive globular (giving it a large LOS distance), we demonstrate that the remaining four globulars can survive within the tidal radius for the Hubble time with perfectly sensible orbits.
Autonomous mental development in high dimensional context and action spaces.
Joshi, Ameet; Weng, Juyang
2003-01-01
Autonomous Mental Development (AMD) of robots opened a new paradigm for developing machine intelligence, using neural network type of techniques and it fundamentally changed the way an intelligent machine is developed from manual to autonomous. The work presented here is a part of SAIL (Self-Organizing Autonomous Incremental Learner) project which deals with autonomous development of humanoid robot with vision, audition, manipulation and locomotion. The major issue addressed here is the challenge of high dimensional action space (5-10) in addition to the high dimensional context space (hundreds to thousands and beyond), typically required by an AMD machine. This is the first work that studies a high dimensional (numeric) action space in conjunction with a high dimensional perception (context state) space, under the AMD mode. Two new learning algorithms, Direct Update on Direction Cosines (DUDC) and High-Dimensional Conjugate Gradient Search (HCGS), are developed, implemented and tested. The convergence properties of both the algorithms and their targeted applications are discussed. Autonomous learning of speech production under reinforcement learning is studied as an example.
Harnessing high-dimensional hyperentanglement through a biphoton frequency comb
NASA Astrophysics Data System (ADS)
Xie, Zhenda; Zhong, Tian; Shrestha, Sajan; Xu, Xinan; Liang, Junlin; Gong, Yan-Xiao; Bienfang, Joshua C.; Restelli, Alessandro; Shapiro, Jeffrey H.; Wong, Franco N. C.; Wei Wong, Chee
2015-08-01
Quantum entanglement is a fundamental resource for secure information processing and communications, and hyperentanglement or high-dimensional entanglement has been separately proposed for its high data capacity and error resilience. The continuous-variable nature of the energy-time entanglement makes it an ideal candidate for efficient high-dimensional coding with minimal limitations. Here, we demonstrate the first simultaneous high-dimensional hyperentanglement using a biphoton frequency comb to harness the full potential in both the energy and time domain. Long-postulated Hong-Ou-Mandel quantum revival is exhibited, with up to 19 time-bins and 96.5% visibilities. We further witness the high-dimensional energy-time entanglement through Franson revivals, observed periodically at integer time-bins, with 97.8% visibility. This qudit state is observed to simultaneously violate the generalized Bell inequality by up to 10.95 standard deviations while observing recurrent Clauser-Horne-Shimony-Holt S-parameters up to 2.76. Our biphoton frequency comb provides a platform for photon-efficient quantum communications towards the ultimate channel capacity through energy-time-polarization high-dimensional encoding.
Optimally splitting cases for training and testing high dimensional classifiers
2011-01-01
Background We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? Results We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. Conclusions By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split. PMID:21477282
An Effective Parameter Screening Strategy for High Dimensional Watershed Models
NASA Astrophysics Data System (ADS)
Khare, Y. P.; Martinez, C. J.; Munoz-Carpena, R.
2014-12-01
Watershed simulation models can assess the impacts of natural and anthropogenic disturbances on natural systems. These models have become important tools for tackling a range of water resources problems through their implementation in the formulation and evaluation of Best Management Practices, Total Maximum Daily Loads, and Basin Management Action Plans. For accurate applications of watershed models they need to be thoroughly evaluated through global uncertainty and sensitivity analyses (UA/SA). However, due to the high dimensionality of these models such evaluation becomes extremely time- and resource-consuming. Parameter screening, the qualitative separation of important parameters, has been suggested as an essential step before applying rigorous evaluation techniques such as the Sobol' and Fourier Amplitude Sensitivity Test (FAST) methods in the UA/SA framework. The method of elementary effects (EE) (Morris, 1991) is one of the most widely used screening methodologies. Some of the common parameter sampling strategies for EE, e.g. Optimized Trajectories [OT] (Campolongo et al., 2007) and Modified Optimized Trajectories [MOT] (Ruano et al., 2012), suffer from inconsistencies in the generated parameter distributions, infeasible sample generation time, etc. In this work, we have formulated a new parameter sampling strategy - Sampling for Uniformity (SU) - for parameter screening which is based on the principles of the uniformity of the generated parameter distributions and the spread of the parameter sample. A rigorous multi-criteria evaluation (time, distribution, spread and screening efficiency) of OT, MOT, and SU indicated that SU is superior to other sampling strategies. Comparison of the EE-based parameter importance rankings with those of Sobol' helped to quantify the qualitativeness of the EE parameter screening approach, reinforcing the fact that one should use EE only to reduce the resource burden required by FAST/Sobol' analyses but not to replace it.
Understanding 3D human torso shape via manifold clustering
NASA Astrophysics Data System (ADS)
Li, Sheng; Li, Peng; Fu, Yun
2013-05-01
Discovering the variations in human torso shape plays a key role in many design-oriented applications, such as suit designing. With recent advances in 3D surface imaging technologies, people can obtain 3D human torso data that provide more information than traditional measurements. However, how to find different human shapes from 3D torso data is still an open problem. In this paper, we propose to use spectral clustering approach on torso manifold to address this problem. We first represent high-dimensional torso data in a low-dimensional space using manifold learning algorithm. Then the spectral clustering method is performed to get several disjoint clusters. Experimental results show that the clusters discovered by our approach can describe the discrepancies in both genders and human shapes, and our approach achieves better performance than the compared clustering method.
Cluster headache - a symptom of different problems or a primary form? A case report.
Domitrz, Izabela; Gaweł, Małgorzata; Maj, Edyta
2013-01-01
Headache with severe, strictly one-sided unilateral attacks of pain in orbital, supraorbital, temporal localisation lasting 15-180 minutes occurring from once every two days to 8 times daily, typically with one or more autonomic symptoms, is recognized as cluster headache (CH). Headache with normal neurological examination and abnormal neuroimaging studies, mimicking cluster headache, is reported by several authors. We present an elderly woman with a cluster-like headache probably associated with other comorbidities. We differentiate between primary, but 'atypical' CH and symptomatic cluster headache due to frontal sinusitis, pontine venous angioma or vascular compression of the trigeminal nerve root. This headache is not so rare in the general population and its secondary causes must be ruled out before the diagnosis of a primary headache as cluster headache is made.
Hypergraph-based anomaly detection of high-dimensional co-occurrences.
Silva, Jorge; Willett, Rebecca
2009-03-01
This paper addresses the problem of detecting anomalous multivariate co-occurrences using a limited number of unlabeled training observations. A novel method based on using a hypergraph representation of the data is proposed to deal with this very high-dimensional problem. Hypergraphs constitute an important extension of graphs which allow edges to connect more than two vertices simultaneously. A variational Expectation-Maximization algorithm for detecting anomalies directly on the hypergraph domain without any feature selection or dimensionality reduction is presented. The resulting estimate can be used to calculate a measure of anomalousness based on the False Discovery Rate. The algorithm has O(np) computational complexity, where n is the number of training observations and p is the number of potential participants in each co-occurrence event. This efficiency makes the method ideally suited for very high-dimensional settings, and requires no tuning, bandwidth or regularization parameters. The proposed approach is validated on both high-dimensional synthetic data and the Enron email database, where p > 75,000, and it is shown that it can outperform other state-of-the-art methods.
ERIC Educational Resources Information Center
Dry, Matthew J.; Preiss, Kym; Wagemans, Johan
2012-01-01
We investigated human performance on the Euclidean Traveling Salesperson Problem (TSP) and Euclidean Minimum Spanning Tree Problem (MST-P) in regards to a factor that has previously received little attention within the literature: the spatial distributions of TSP and MST-P stimuli. First, we describe a method for quantifying the relative degree of…
Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data.
Cai, T Tony; Zhang, Anru
2016-09-01
Missing data occur frequently in a wide range of applications. In this paper, we consider estimation of high-dimensional covariance matrices in the presence of missing observations under a general missing completely at random model in the sense that the missingness is not dependent on the values of the data. Based on incomplete data, estimators for bandable and sparse covariance matrices are proposed and their theoretical and numerical properties are investigated. Minimax rates of convergence are established under the spectral norm loss and the proposed estimators are shown to be rate-optimal under mild regularity conditions. Simulation studies demonstrate that the estimators perform well numerically. The methods are also illustrated through an application to data from four ovarian cancer studies. The key technical tools developed in this paper are of independent interest and potentially useful for a range of related problems in high-dimensional statistical inference with missing data.
Querying Patterns in High-Dimensional Heterogenous Datasets
ERIC Educational Resources Information Center
Singh, Vishwakarma
2012-01-01
The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a…
A High-Dimensional Nonparametric Multivariate Test for Mean Vector
Wang, Lan; Peng, Bo; Li, Runze
2015-01-01
This work is concerned with testing the population mean vector of nonnormal high-dimensional multivariate data. Several tests for high-dimensional mean vector, based on modifying the classical Hotelling T2 test, have been proposed in the literature. Despite their usefulness, they tend to have unsatisfactory power performance for heavy-tailed multivariate data, which frequently arise in genomics and quantitative finance. This paper proposes a novel high-dimensional nonparametric test for the population mean vector for a general class of multivariate distributions. With the aid of new tools in modern probability theory, we proved that the limiting null distribution of the proposed test is normal under mild conditions when p is substantially larger than n. We further study the local power of the proposed test and compare its relative efficiency with a modified Hotelling T2 test for high-dimensional data. An interesting finding is that the newly proposed test can have even more substantial power gain with large p than the traditional nonparametric multivariate test does with finite fixed p. We study the finite sample performance of the proposed test via Monte Carlo simulations. We further illustrate its application by an empirical analysis of a genomics data set. PMID:26848205
High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries
Zollanvari, Amin
2015-01-01
High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical–statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject. PMID:27081307
Semi-supervised clustering algorithm for haplotype assembly problem based on MEC model.
Xu, Xin-Shun; Li, Ying-Xin
2012-01-01
Haplotype assembly is to infer a pair of haplotypes from localized polymorphism data. In this paper, a semi-supervised clustering algorithm-SSK (semi-supervised K-means) is proposed for it, which, to our knowledge, is the first semi-supervised clustering method for it. In SSK, some positive information is firstly extracted. The information is then used to help k-means to cluster all SNP fragments into two sets from which two haplotypes can be reconstructed. The performance of SSK is tested on both real data and simulated data. The results show that it outperforms several state-of-the-art algorithms on minimum error correction (MEC) model.
Janes, C R; Ames, G M
1992-10-01
We examine the clustering of attendance, illness, and accidental injury problems in a large unionized manufacturing plant using both quantitative and qualitative methods. We find that the distribution of workers into problem groups is related to 1) conflicts over seniority, 2) physical stressors and their influence on perceived desirability of certain kinds of jobs, and 3) organizational conditions and environments congenial to the development of distinct occupational "subcultures." We suggest that the case study approach we apply in this paper is critical to the design of programs of preventive intervention and complements the more commonly applied multiple-site and individually focused, survey approaches.
ERIC Educational Resources Information Center
Brusco, Michael J.
2007-01-01
The study of human performance on discrete optimization problems has a considerable history that spans various disciplines. The two most widely studied problems are the Euclidean traveling salesperson problem and the quadratic assignment problem. The purpose of this paper is to outline a program of study for the measurement of human performance on…
Partially supervised speaker clustering.
Tang, Hao; Chu, Stephen Mingyu; Hasegawa-Johnson, Mark; Huang, Thomas S
2012-05-01
Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm—linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical
Lin, Wei; Feng, Rui; Li, Hongzhe
2014-01-01
In genetical genomics studies, it is important to jointly analyze gene expression data and genetic variants in exploring their associations with complex traits, where the dimensionality of gene expressions and genetic variants can both be much larger than the sample size. Motivated by such modern applications, we consider the problem of variable selection and estimation in high-dimensional sparse instrumental variables models. To overcome the difficulty of high dimensionality and unknown optimal instruments, we propose a two-stage regularization framework for identifying and estimating important covariate effects while selecting and estimating optimal instruments. The methodology extends the classical two-stage least squares estimator to high dimensions by exploiting sparsity using sparsity-inducing penalty functions in both stages. The resulting procedure is efficiently implemented by coordinate descent optimization. For the representative L1 regularization and a class of concave regularization methods, we establish estimation, prediction, and model selection properties of the two-stage regularized estimators in the high-dimensional setting where the dimensionality of co-variates and instruments are both allowed to grow exponentially with the sample size. The practical performance of the proposed method is evaluated by simulation studies and its usefulness is illustrated by an analysis of mouse obesity data. Supplementary materials for this article are available online. PMID:26392642
Ma Xiang; Zabaras, Nicholas
2010-05-20
A computational methodology is developed to address the solution of high-dimensional stochastic problems. It utilizes high-dimensional model representation (HDMR) technique in the stochastic space to represent the model output as a finite hierarchical correlated function expansion in terms of the stochastic inputs starting from lower-order to higher-order component functions. HDMR is efficient at capturing the high-dimensional input-output relationship such that the behavior for many physical systems can be modeled to good accuracy only by the first few lower-order terms. An adaptive version of HDMR is also developed to automatically detect the important dimensions and construct higher-order terms using only the important dimensions. The newly developed adaptive sparse grid collocation (ASGC) method is incorporated into HDMR to solve the resulting sub-problems. By integrating HDMR and ASGC, it is computationally possible to construct a low-dimensional stochastic reduced-order model of the high-dimensional stochastic problem and easily perform various statistic analysis on the output. Several numerical examples involving elementary mathematical functions and fluid mechanics problems are considered to illustrate the proposed method. The cases examined show that the method provides accurate results for stochastic dimensionality as high as 500 even with large-input variability. The efficiency of the proposed method is examined by comparing with Monte Carlo (MC) simulation.
Lee, Jenny Hyunjung; McDonnell, Kevin T; Zelenyuk, Alla; Imre, Dan; Mueller, Klaus
2013-07-11
Although the Euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging inter-cluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multi-dimensional scaling (MDS) where one can often observe non-intuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our bi-scale framework distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate Euclidean distance.
High-dimensional statistical inference: From vector to matrix
NASA Astrophysics Data System (ADS)
Zhang, Anru
Statistical inference for sparse signals or low-rank matrices in high-dimensional settings is of significant interest in a range of contemporary applications. It has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. In this thesis, we consider several problems in including sparse signal recovery (compressed sensing under restricted isometry) and low-rank matrix recovery (matrix recovery via rank-one projections and structured matrix completion). The first part of the thesis discusses compressed sensing and affine rank minimization in both noiseless and noisy cases and establishes sharp restricted isometry conditions for sparse signal and low-rank matrix recovery. The analysis relies on a key technical tool which represents points in a polytope by convex combinations of sparse vectors. The technique is elementary while leads to sharp results. It is shown that, in compressed sensing, delta kA < 1/3, deltak A+ thetak,kA < 1, or deltatkA < √( t - 1)/t for any given constant t ≥ 4/3 guarantee the exact recovery of all k sparse signals in the noiseless case through the constrained ℓ1 minimization, and similarly in affine rank minimization delta rM < 1/3, deltar M + thetar, rM < 1, or deltatrM< √( t - 1)/t ensure the exact reconstruction of all matrices with rank at most r in the noiseless case via the constrained nuclear norm minimization. Moreover, for any epsilon > 0, delta kA < 1/3 + epsilon, deltak A + thetak,kA < 1 + epsilon, or deltatkA< √(t - 1) / t + epsilon are not sufficient to guarantee the exact recovery of all k-sparse signals for large k. Similar result also holds for matrix recovery. In addition, the conditions delta kA<1/3, deltak A+ thetak,kA<1, delta tkA < √(t - 1)/t and deltarM<1/3, delta rM+ thetar,rM<1, delta trM< √(t - 1)/ t are also shown to be sufficient respectively for stable recovery of approximately sparse signals and low-rank matrices in the noisy case
Reinforcement learning on slow features of high-dimensional input streams.
Legenstein, Robert; Wilbert, Niko; Wiskott, Laurenz
2010-08-19
Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA) network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning.
REVIEWS OF TOPICAL PROBLEMS: Hadron clusters and half-dressed particles in quantum field theory
NASA Astrophysics Data System (ADS)
Feĭnberg, E. L.
1980-10-01
Accelerator experiments show that multiple production of hadrons in high-energy collisions of particles involves the formation of unstable intermediate entities, which subsequently decay into the final hadrons. These entities are apparently not only the comparatively light resonances with which we are already familiar but also heavy nonresonant clusters (with a mass above 2-5 GeV). The cluster concept was introduced previously in cosmic-ray physics, under the name "fireballs". To determine what these clusters are from the standpoint of quantum field theory, a detailed and thorough analysis is made of some analogous processes in quantum electrodynamics which are amenable to calculation. The QED analogs of the nonresonant clusters are "half-dressed" electrons and heavy photons. The half-dressed electrons decay into photons and electrons and are completely observable entities, whose interaction properties distinguish them from dressed electrons. In other words, the nonresonant particles are generally off-shell particles (the excursion from the mass shell is in the timelike direction). The assumption that hadron clusters are only resonances would be equivalent to a very specialized assumption regarding the nature of the spectral function of the hadron propagator; it would be different from that in electrodynamics, where the spectral function can be calculated. Nonresonant hadron clusters thus fit naturally into hadron field theory and are nonequilibrium hadrons far from the mass shell in the timelike direction. (In certain cases, their structural distortion is of the same nature as that of a half-dressed electron, so that this term can be conventionally applied to them as well.
Dimensionality reduction for registration of high-dimensional data sets.
Xu, Min; Chen, Hao; Varshney, Pramod K
2013-08-01
Registration of two high-dimensional data sets often involves dimensionality reduction to yield a single-band image from each data set followed by pairwise image registration. We develop a new application-specific algorithm for dimensionality reduction of high-dimensional data sets such that the weighted harmonic mean of Cramér-Rao lower bounds for the estimation of the transformation parameters for registration is minimized. The performance of the proposed dimensionality reduction algorithm is evaluated using three remotes sensing data sets. The experimental results using mutual information-based pairwise registration technique demonstrate that our proposed dimensionality reduction algorithm combines the original data sets to obtain the image pair with more texture, resulting in improved image registration.
Structural analysis of high-dimensional basins of attraction.
Martiniani, Stefano; Schrenk, K Julian; Stevenson, Jacob D; Wales, David J; Frenkel, Daan
2016-09-01
We propose an efficient Monte Carlo method for the computation of the volumes of high-dimensional bodies with arbitrary shape. We start with a region of known volume within the interior of the manifold and then use the multistate Bennett acceptance-ratio method to compute the dimensionless free-energy difference between a series of equilibrium simulations performed within this object. The method produces results that are in excellent agreement with thermodynamic integration, as well as a direct estimate of the associated statistical uncertainties. The histogram method also allows us to directly obtain an estimate of the interior radial probability density profile, thus yielding useful insight into the structural properties of such a high-dimensional body. We illustrate the method by analyzing the effect of structural disorder on the basins of attraction of mechanically stable packings of soft repulsive spheres.
Quantum Teleportation of High-dimensional Atomic Momenta State
NASA Astrophysics Data System (ADS)
Qurban, Misbah; Abbas, Tasawar; Rameez-ul-Islam; Ikram, Manzoor
2016-06-01
Atomic momenta states of the neutral atoms are known to be decoherence resistant and therefore present a viable solution for most of the quantum information tasks including the quantum teleportation. We present a systematic protocol for the teleportation of high-dimensional quantized momenta atomic states to the field state inside the cavities by applying standard cavity QED techniques. The proposal can be executed under prevailing experimental scenario.
High-dimensional quantum cloning and applications to quantum hacking
Bouchard, Frédéric; Fickler, Robert; Boyd, Robert W.; Karimi, Ebrahim
2017-01-01
Attempts at cloning a quantum system result in the introduction of imperfections in the state of the copies. This is a consequence of the no-cloning theorem, which is a fundamental law of quantum physics and the backbone of security for quantum communications. Although perfect copies are prohibited, a quantum state may be copied with maximal accuracy via various optimal cloning schemes. Optimal quantum cloning, which lies at the border of the physical limit imposed by the no-signaling theorem and the Heisenberg uncertainty principle, has been experimentally realized for low-dimensional photonic states. However, an increase in the dimensionality of quantum systems is greatly beneficial to quantum computation and communication protocols. Nonetheless, no experimental demonstration of optimal cloning machines has hitherto been shown for high-dimensional quantum systems. We perform optimal cloning of high-dimensional photonic states by means of the symmetrization method. We show the universality of our technique by conducting cloning of numerous arbitrary input states and fully characterize our cloning machine by performing quantum state tomography on cloned photons. In addition, a cloning attack on a Bennett and Brassard (BB84) quantum key distribution protocol is experimentally demonstrated to reveal the robustness of high-dimensional states in quantum cryptography. PMID:28168219
Cell Fate Decision as High-Dimensional Critical State Transition
Zhou, Joseph; Castaño, Ivan G.; Leong-Quong, Rebecca Y. Y.; Chang, Hannah; Trachana, Kalliopi; Giuliani, Alessandro; Huang, Sui
2016-01-01
Cell fate choice and commitment of multipotent progenitor cells to a differentiated lineage requires broad changes of their gene expression profile. But how progenitor cells overcome the stability of their gene expression configuration (attractor) to exit the attractor in one direction remains elusive. Here we show that commitment of blood progenitor cells to the erythroid or myeloid lineage is preceded by the destabilization of their high-dimensional attractor state, such that differentiating cells undergo a critical state transition. Single-cell resolution analysis of gene expression in populations of differentiating cells affords a new quantitative index for predicting critical transitions in a high-dimensional state space based on decrease of correlation between cells and concomitant increase of correlation between genes as cells approach a tipping point. The detection of “rebellious cells” that enter the fate opposite to the one intended corroborates the model of preceding destabilization of a progenitor attractor. Thus, early warning signals associated with critical transitions can be detected in statistical ensembles of high-dimensional systems, offering a formal theory-based approach for analyzing single-cell molecular profiles that goes beyond current computational pattern recognition, does not require knowledge of specific pathways, and could be used to predict impending major shifts in development and disease. PMID:28027308
High-dimensional quantum cloning and applications to quantum hacking.
Bouchard, Frédéric; Fickler, Robert; Boyd, Robert W; Karimi, Ebrahim
2017-02-01
Attempts at cloning a quantum system result in the introduction of imperfections in the state of the copies. This is a consequence of the no-cloning theorem, which is a fundamental law of quantum physics and the backbone of security for quantum communications. Although perfect copies are prohibited, a quantum state may be copied with maximal accuracy via various optimal cloning schemes. Optimal quantum cloning, which lies at the border of the physical limit imposed by the no-signaling theorem and the Heisenberg uncertainty principle, has been experimentally realized for low-dimensional photonic states. However, an increase in the dimensionality of quantum systems is greatly beneficial to quantum computation and communication protocols. Nonetheless, no experimental demonstration of optimal cloning machines has hitherto been shown for high-dimensional quantum systems. We perform optimal cloning of high-dimensional photonic states by means of the symmetrization method. We show the universality of our technique by conducting cloning of numerous arbitrary input states and fully characterize our cloning machine by performing quantum state tomography on cloned photons. In addition, a cloning attack on a Bennett and Brassard (BB84) quantum key distribution protocol is experimentally demonstrated to reveal the robustness of high-dimensional states in quantum cryptography.
NASA Astrophysics Data System (ADS)
Arca-Sedda, Manuel; Capuzzo-Dolcetta, Roberto
2017-01-01
One of the leading scenarios for the formation of nuclear star clusters in galaxies is related to the orbital decay of globular clusters (GCs) and their subsequent merging, though alternative theories are currently debated. The availability of high-quality data for structural and orbital parameters of GCs allows us to test different nuclear star cluster formation scenarios. The Fornax dwarf spheroidal (dSph) galaxy is the heaviest satellite of the Milky Way and it is the only known dSph hosting five GCs, whereas there are no clear signatures for the presence of a central massive black hole. For this reason, it represents a suited place to study the orbital decay process in dwarf galaxies. In this paper, we model the future evolution of the Fornax GCs simulating them and the host galaxy by means of direct N-body simulations. Our simulations also take into account the gravitational field generated by the Milky Way. We found that if the Fornax galaxy is embedded in a standard cold dark matter halo, the nuclear cluster formation would be significantly hampered by the high central galactic mass density. In this context, we discuss the possibility that infalling GCs drive the flattening of the galactic density profile, giving a possible alternative explanation to the so-called cusp/core problem. Moreover, we briefly discuss the link between GC infall process and the absence of massive black holes in the centre of dSphs.
High-dimensional entropy estimation for finite accuracy data: R-NN entropy estimator.
Kybic, Jan
2007-01-01
We address the problem of entropy estimation for high-dimensional finite-accuracy data. Our main application is evaluating high-order mutual information image similarity criteria for multimodal image registration. The basis of our method is an estimator based on k-th nearest neighbor (NN) distances, modified so that only distances greater than some constant R are evaluated. This modification requires a correction which is found numerically in a preprocessing step using quadratic programming. We compare experimentally our new method with k-NN and histogram estimators on synthetic data as well as for evaluation of mutual information for image similarity.
Some Unsolved Problems, Questions, and Applications of the Brightsen Nucleon Cluster Model
NASA Astrophysics Data System (ADS)
Smarandache, Florentin
2010-10-01
Brightsen Model is opposite to the Standard Model, and it was build on John Weeler's Resonating Group Structure Model and on Linus Pauling's Close-Packed Spheron Model. Among Brightsen Model's predictions and applications we cite the fact that it derives the average number of prompt neutrons per fission event, it provides a theoretical way for understanding the low temperature / low energy reactions and for approaching the artificially induced fission, it predicts that forces within nucleon clusters are stronger than forces between such clusters within isotopes; it predicts the unmatter entities inside nuclei that result from stable and neutral union of matter and antimatter, and so on. But these predictions have to be tested in the future at the new CERN laboratory.
Improving clustering by imposing network information
Gerber, Susanne; Horenko, Illia
2015-01-01
Cluster analysis is one of the most popular data analysis tools in a wide range of applied disciplines. We propose and justify a computationally efficient and straightforward-to-implement way of imposing the available information from networks/graphs (a priori available in many application areas) on a broad family of clustering methods. The introduced approach is illustrated on the problem of a noninvasive unsupervised brain signal classification. This task is faced with several challenging difficulties such as nonstationary noisy signals and a small sample size, combined with a high-dimensional feature space and huge noise-to-signal ratios. Applying this approach results in an exact unsupervised classification of very short signals, opening new possibilities for clustering methods in the area of a noninvasive brain-computer interface. PMID:26601225
Analog computation through high-dimensional physical chaotic neuro-dynamics
NASA Astrophysics Data System (ADS)
Horio, Yoshihiko; Aihara, Kazuyuki
2008-07-01
Conventional von Neumann computers have difficulty in solving complex and ill-posed real-world problems. However, living organisms often face such problems in real life, and must quickly obtain suitable solutions through physical, dynamical, and collective computations involving vast assemblies of neurons. These highly parallel computations through high-dimensional dynamics (computation through dynamics) are completely different from the numerical computations on von Neumann computers (computation through algorithms). In this paper, we explore a novel computational mechanism with high-dimensional physical chaotic neuro-dynamics. We physically constructed two hardware prototypes using analog chaotic-neuron integrated circuits. These systems combine analog computations with chaotic neuro-dynamics and digital computation through algorithms. We used quadratic assignment problems (QAPs) as benchmarks. The first prototype utilizes an analog chaotic neural network with 800-dimensional dynamics. An external algorithm constructs a solution for a QAP using the internal dynamics of the network. In the second system, 300-dimensional analog chaotic neuro-dynamics drive a tabu-search algorithm. We demonstrate experimentally that both systems efficiently solve QAPs through physical chaotic dynamics. We also qualitatively analyze the underlying mechanism of the highly parallel and collective analog computations by observing global and local dynamics. Furthermore, we introduce spatial and temporal mutual information to quantitatively evaluate the system dynamics. The experimental results confirm the validity and efficiency of the proposed computational paradigm with the physical analog chaotic neuro-dynamics.
Hawking radiation of a high-dimensional rotating black hole
NASA Astrophysics Data System (ADS)
Ren, Zhao; Lichun, Zhang; Huaifan, Li; Yueqin, Wu
2010-01-01
We extend the classical Damour-Ruffini method and discuss Hawking radiation spectrum of high-dimensional rotating black hole using Tortoise coordinate transformation defined by taking the reaction of the radiation to the spacetime into consideration. Under the condition that the energy and angular momentum are conservative, taking self-gravitation action into account, we derive Hawking radiation spectrums which satisfy unitary principle in quantum mechanics. It is shown that the process that the black hole radiates particles with energy ω is a continuous tunneling process. We provide a theoretical basis for further studying the physical mechanism of black-hole radiation.
Grellmann, Claudia; Neumann, Jane; Bitzer, Sebastian; Kovacs, Peter; Tönjes, Anke; Westlye, Lars T.; Andreassen, Ole A.; Stumvoll, Michael; Villringer, Arno; Horstmann, Annette
2016-01-01
In recent years, the advent of great technological advances has produced a wealth of very high-dimensional data, and combining high-dimensional information from multiple sources is becoming increasingly important in an extending range of scientific disciplines. Partial Least Squares Correlation (PLSC) is a frequently used method for multivariate multimodal data integration. It is, however, computationally expensive in applications involving large numbers of variables, as required, for example, in genetic neuroimaging. To handle high-dimensional problems, dimension reduction might be implemented as pre-processing step. We propose a new approach that incorporates Random Projection (RP) for dimensionality reduction into PLSC to efficiently solve high-dimensional multimodal problems like genotype-phenotype associations. We name our new method PLSC-RP. Using simulated and experimental data sets containing whole genome SNP measures as genotypes and whole brain neuroimaging measures as phenotypes, we demonstrate that PLSC-RP is drastically faster than traditional PLSC while providing statistically equivalent results. We also provide evidence that dimensionality reduction using RP is data type independent. Therefore, PLSC-RP opens up a wide range of possible applications. It can be used for any integrative analysis that combines information from multiple sources. PMID:27375677
Rupp, Matthias; Schneider, Petra; Schneider, Gisbert
2009-11-15
Measuring the (dis)similarity of molecules is important for many cheminformatics applications like compound ranking, clustering, and property prediction. In this work, we focus on real-valued vector representations of molecules (as opposed to the binary spaces of fingerprints). We demonstrate the influence which the choice of (dis)similarity measure can have on results, and provide recommendations for such choices. We review the mathematical concepts used to measure (dis)similarity in vector spaces, namely norms, metrics, inner products, and, similarity coefficients, as well as the relationships between them, employing (dis)similarity measures commonly used in cheminformatics as examples. We present several phenomena (empty space phenomenon, sphere volume related phenomena, distance concentration) in high-dimensional descriptor spaces which are not encountered in two and three dimensions. These phenomena are theoretically characterized and illustrated on both artificial and real (bioactivity) data.
Exploring High-Dimensional Data Space: Identifying Optimal Process Conditions in Photovoltaics
Suh, C.; Biagioni, D.; Glynn, S.; Scharf, J.; Contreras, M. A.; Noufi, R.; Jones, W. B.
2011-01-01
We demonstrate how advanced exploratory data analysis coupled to data-mining techniques can be used to scrutinize the high-dimensional data space of photovoltaics in the context of thin films of Al-doped ZnO (AZO), which are essential materials as a transparent conducting oxide (TCO) layer in CuIn{sub x}Ga{sub 1-x}Se{sub 2} (CIGS) solar cells. AZO data space, wherein each sample is synthesized from a different process history and assessed with various characterizations, is transformed, reorganized, and visualized in order to extract optimal process conditions. The data-analysis methods used include parallel coordinates, diffusion maps, and hierarchical agglomerative clustering algorithms combined with diffusion map embedding.
Suh, C.; Glynn, S.; Scharf, J.; Contreras, M. A.; Noufi, R.; Jones, W. B.; Biagioni, D.
2011-07-01
We demonstrate how advanced exploratory data analysis coupled to data-mining techniques can be used to scrutinize the high-dimensional data space of photovoltaics in the context of thin films of Al-doped ZnO (AZO), which are essential materials as a transparent conducting oxide (TCO) layer in CuInxGa1-xSe2 (CIGS) solar cells. AZO data space, wherein each sample is synthesized from a different process history and assessed with various characterizations, is transformed, reorganized, and visualized in order to extract optimal process conditions. The data-analysis methods used include parallel coordinates, diffusion maps, and hierarchical agglomerative clustering algorithms combined with diffusion map embedding.
An Adaptive ANOVA-based PCKF for High-Dimensional Nonlinear Inverse Modeling
LI, Weixuan; Lin, Guang; Zhang, Dongxiao
2014-02-01
The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos bases in the expansion helps to capture uncertainty more accurately but increases computational cost. Bases selection is particularly important for high-dimensional stochastic problems because the number of polynomial chaos bases required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE bases are pre-set based on users’ experience. Also, for sequential data assimilation problems, the bases kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE bases for different problems and automatically adjusts the number of bases in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm is tested with different examples and demonstrated great effectiveness in comparison with non-adaptive PCKF and En
An adaptive ANOVA-based PCKF for high-dimensional nonlinear inverse modeling
Li, Weixuan; Lin, Guang; Zhang, Dongxiao
2014-02-01
The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos basis functions in the expansion helps to capture uncertainty more accurately but increases computational cost. Selection of basis functions is particularly important for high-dimensional stochastic problems because the number of polynomial chaos basis functions required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE basis functions are pre-set based on users' experience. Also, for sequential data assimilation problems, the basis functions kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE basis functions for different problems and automatically adjusts the number of basis functions in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm was tested with different examples and demonstrated
Node Detection Using High-Dimensional Fuzzy Parcellation Applied to the Insular Cortex
Vercelli, Ugo; Diano, Matteo; Costa, Tommaso; Nani, Andrea; Duca, Sergio; Geminiani, Giuliano; Vercelli, Alessandro; Cauda, Franco
2016-01-01
Several functional connectivity approaches require the definition of a set of regions of interest (ROIs) that act as network nodes. Different methods have been developed to define these nodes and to derive their functional and effective connections, most of which are rather complex. Here we aim to propose a relatively simple “one-step” border detection and ROI estimation procedure employing the fuzzy c-mean clustering algorithm. To test this procedure and to explore insular connectivity beyond the two/three-region model currently proposed in the literature, we parcellated the insular cortex of 20 healthy right-handed volunteers scanned in a resting state. By employing a high-dimensional functional connectivity-based clustering process, we confirmed the two patterns of connectivity previously described. This method revealed a complex pattern of functional connectivity where the two previously detected insular clusters are subdivided into several other networks, some of which are not commonly associated with the insular cortex, such as the default mode network and parts of the dorsal attentional network. Furthermore, the detection of nodes was reliable, as demonstrated by the confirmative analysis performed on a replication group of subjects. PMID:26881093
A reduced-order model from high-dimensional frictional hysteresis.
Biswas, Saurabh; Chatterjee, Anindya
2014-06-08
Hysteresis in material behaviour includes both signum nonlinearities as well as high dimensionality. Available models for component-level hysteretic behaviour are empirical. Here, we derive a low-order model for rate-independent hysteresis from a high-dimensional massless frictional system. The original system, being given in terms of signs of velocities, is first solved incrementally using a linear complementarity problem formulation. From this numerical solution, to develop a reduced-order model, basis vectors are chosen using the singular value decomposition. The slip direction in generalized coordinates is identified as the minimizer of a dissipation-related function. That function includes terms for frictional dissipation through signum nonlinearities at many friction sites. Luckily, it allows a convenient analytical approximation. Upon solution of the approximated minimization problem, the slip direction is found. A final evolution equation for a few states is then obtained that gives a good match with the full solution. The model obtained here may lead to new insights into hysteresis as well as better empirical modelling thereof.
Arif, Muhammad
2012-06-01
In pattern classification problems, feature extraction is an important step. Quality of features in discriminating different classes plays an important role in pattern classification problems. In real life, pattern classification may require high dimensional feature space and it is impossible to visualize the feature space if the dimension of feature space is greater than four. In this paper, we have proposed a Similarity-Dissimilarity plot which can project high dimensional space to a two dimensional space while retaining important characteristics required to assess the discrimination quality of the features. Similarity-dissimilarity plot can reveal information about the amount of overlap of features of different classes. Separable data points of different classes will also be visible on the plot which can be classified correctly using appropriate classifier. Hence, approximate classification accuracy can be predicted. Moreover, it is possible to know about whom class the misclassified data points will be confused by the classifier. Outlier data points can also be located on the similarity-dissimilarity plot. Various examples of synthetic data are used to highlight important characteristics of the proposed plot. Some real life examples from biomedical data are also used for the analysis. The proposed plot is independent of number of dimensions of the feature space.
A reduced-order model from high-dimensional frictional hysteresis
Biswas, Saurabh; Chatterjee, Anindya
2014-01-01
Hysteresis in material behaviour includes both signum nonlinearities as well as high dimensionality. Available models for component-level hysteretic behaviour are empirical. Here, we derive a low-order model for rate-independent hysteresis from a high-dimensional massless frictional system. The original system, being given in terms of signs of velocities, is first solved incrementally using a linear complementarity problem formulation. From this numerical solution, to develop a reduced-order model, basis vectors are chosen using the singular value decomposition. The slip direction in generalized coordinates is identified as the minimizer of a dissipation-related function. That function includes terms for frictional dissipation through signum nonlinearities at many friction sites. Luckily, it allows a convenient analytical approximation. Upon solution of the approximated minimization problem, the slip direction is found. A final evolution equation for a few states is then obtained that gives a good match with the full solution. The model obtained here may lead to new insights into hysteresis as well as better empirical modelling thereof. PMID:24910522
Centroid estimation in discrete high-dimensional spaces with applications in biology.
Carvalho, Luis E; Lawrence, Charles E
2008-03-04
Maximum likelihood estimators and other direct optimization-based estimators dominated statistical estimation and prediction for decades. Yet, the principled foundations supporting their dominance do not apply to the discrete high-dimensional inference problems of the 21st century. As it is well known, statistical decision theory shows that maximum likelihood and related estimators use data only to identify the single most probable solution. Accordingly, unless this one solution so dominates the immense ensemble of all solutions that its probability is near one, there is no principled reason to expect such an estimator to be representative of the posterior-weighted ensemble of solutions, and thus represent inferences drawn from the data. We employ statistical decision theory to find more representative estimators, centroid estimators, in a general high-dimensional discrete setting by using a family of loss functions with penalties that increase with the number of differences in components. We show that centroid estimates are obtained by maximizing the marginal probabilities of the solution components for unconstrained ensembles and for an important class of problems, including sequence alignment and the prediction of RNA secondary structure, whose ensembles contain exclusivity constraints. Three genomics examples are described that show that these estimators substantially improve predictions of ground-truth reference sets.
New data assimilation system DNDAS for high-dimensional models
NASA Astrophysics Data System (ADS)
Qun-bo, Huang; Xiao-qun, Cao; Meng-bin, Zhu; Wei-min, Zhang; Bai-nian, Liu
2016-05-01
The tangent linear (TL) models and adjoint (AD) models have brought great difficulties for the development of variational data assimilation system. It might be impossible to develop them perfectly without great efforts, either by hand, or by automatic differentiation tools. In order to break these limitations, a new data assimilation system, dual-number data assimilation system (DNDAS), is designed based on the dual-number automatic differentiation principles. We investigate the performance of DNDAS with two different optimization schemes and subsequently give a discussion on whether DNDAS is appropriate for high-dimensional forecast models. The new data assimilation system can avoid the complicated reverse integration of the adjoint model, and it only needs the forward integration in the dual-number space to obtain the cost function and its gradient vector concurrently. To verify the correctness and effectiveness of DNDAS, we implemented DNDAS on a simple ordinary differential model and the Lorenz-63 model with different optimization methods. We then concentrate on the adaptability of DNDAS to the Lorenz-96 model with high-dimensional state variables. The results indicate that whether the system is simple or nonlinear, DNDAS can accurately reconstruct the initial condition for the forecast model and has a strong anti-noise characteristic. Given adequate computing resource, the quasi-Newton optimization method performs better than the conjugate gradient method in DNDAS. Project supported by the National Natural Science Foundation of China (Grant Nos. 41475094 and 41375113).
Power Enhancement in High Dimensional Cross-Sectional Tests
Fan, Jianqing; Liao, Yuan; Yao, Jiawei
2016-01-01
We propose a novel technique to boost the power of testing a high-dimensional vector H : θ = 0 against sparse alternatives where the null hypothesis is violated only by a couple of components. Existing tests based on quadratic forms such as the Wald statistic often suffer from low powers due to the accumulation of errors in estimating high-dimensional parameters. More powerful tests for sparse alternatives such as thresholding and extreme-value tests, on the other hand, require either stringent conditions or bootstrap to derive the null distribution and often suffer from size distortions due to the slow convergence. Based on a screening technique, we introduce a “power enhancement component”, which is zero under the null hypothesis with high probability, but diverges quickly under sparse alternatives. The proposed test statistic combines the power enhancement component with an asymptotically pivotal statistic, and strengthens the power under sparse alternatives. The null distribution does not require stringent regularity conditions, and is completely determined by that of the pivotal statistic. As specific applications, the proposed methods are applied to testing the factor pricing models and validating the cross-sectional independence in panel data models. PMID:26778846
High-dimensional camera shake removal with given depth map.
Yue, Tao; Suo, Jinli; Dai, Qionghai
2014-06-01
Camera motion blur is drastically nonuniform for large depth-range scenes, and the nonuniformity caused by camera translation is depth dependent but not the case for camera rotations. To restore the blurry images of large-depth-range scenes deteriorated by arbitrary camera motion, we build an image blur model considering 6-degrees of freedom (DoF) of camera motion with a given scene depth map. To make this 6D depth-aware model tractable, we propose a novel parametrization strategy to reduce the number of variables and an effective method to estimate high-dimensional camera motion as well. The number of variables is reduced by temporal sampling motion function, which describes the 6-DoF camera motion by sampling the camera trajectory uniformly in time domain. To effectively estimate the high-dimensional camera motion parameters, we construct the probabilistic motion density function (PMDF) to describe the probability distribution of camera poses during exposure, and apply it as a unified constraint to guide the convergence of the iterative deblurring algorithm. Specifically, PMDF is computed through a back projection from 2D local blur kernels to 6D camera motion parameter space and robust voting. We conduct a series of experiments on both synthetic and real captured data, and validate that our method achieves better performance than existing uniform methods and nonuniform methods on large-depth-range scenes.
Nam, Julia EunJu; Mueller, Klaus
2013-02-01
Gaining a true appreciation of high-dimensional space remains difficult since all of the existing high-dimensional space exploration techniques serialize the space travel in some way. This is not so foreign to us since we, when traveling, also experience the world in a serial fashion. But we typically have access to a map to help with positioning, orientation, navigation, and trip planning. Here, we propose a multivariate data exploration tool that compares high-dimensional space navigation with a sightseeing trip. It decomposes this activity into five major tasks: 1) Identify the sights: use a map to identify the sights of interest and their location; 2) Plan the trip: connect the sights of interest along a specifyable path; 3) Go on the trip: travel along the route; 4) Hop off the bus: experience the location, look around, zoom into detail; and 5) Orient and localize: regain bearings in the map. We describe intuitive and interactive tools for all of these tasks, both global navigation within the map and local exploration of the data distributions. For the latter, we describe a polygonal touchpad interface which enables users to smoothly tilt the projection plane in high-dimensional space to produce multivariate scatterplots that best convey the data relationships under investigation. Motion parallax and illustrative motion trails aid in the perception of these transient patterns. We describe the use of our system within two applications: 1) the exploratory discovery of data configurations that best fit a personal preference in the presence of tradeoffs and 2) interactive cluster analysis via cluster sculpting in N-D.
Nakano, Takashi; Otsuka, Makoto; Yoshimoto, Junichiro; Doya, Kenji
2015-01-01
A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the past, even though these are inevitable and constraining features of learning in real environments. This class of problem is formally known as partially observable reinforcement learning (PORL) problems. It provides a generalization of reinforcement learning to partially observable domains. In addition, observations in the real world tend to be rich and high-dimensional. In this work, we use a spiking neural network model to approximate the free energy of a restricted Boltzmann machine and apply it to the solution of PORL problems with high-dimensional observations. Our spiking network model solves maze tasks with perceptually ambiguous high-dimensional observations without knowledge of the true environment. An extended model with working memory also solves history-dependent tasks. The way spiking neural networks handle PORL problems may provide a glimpse into the underlying laws of neural information processing which can only be discovered through such a top-down approach.
Nakano, Takashi; Otsuka, Makoto; Yoshimoto, Junichiro; Doya, Kenji
2015-01-01
A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the past, even though these are inevitable and constraining features of learning in real environments. This class of problem is formally known as partially observable reinforcement learning (PORL) problems. It provides a generalization of reinforcement learning to partially observable domains. In addition, observations in the real world tend to be rich and high-dimensional. In this work, we use a spiking neural network model to approximate the free energy of a restricted Boltzmann machine and apply it to the solution of PORL problems with high-dimensional observations. Our spiking network model solves maze tasks with perceptually ambiguous high-dimensional observations without knowledge of the true environment. An extended model with working memory also solves history-dependent tasks. The way spiking neural networks handle PORL problems may provide a glimpse into the underlying laws of neural information processing which can only be discovered through such a top-down approach. PMID:25734662
GX-Means: A model-based divide and merge algorithm for geospatial image clustering
Vatsavai, Raju; Symons, Christopher T; Chandola, Varun; Jun, Goo
2011-01-01
One of the practical issues in clustering is the specification of the appropriate number of clusters, which is not obvious when analyzing geospatial datasets, partly because they are huge (both in size and spatial extent) and high dimensional. In this paper we present a computationally efficient model-based split and merge clustering algorithm that incrementally finds model parameters and the number of clusters. Additionally, we attempt to provide insights into this problem and other data mining challenges that are encountered when clustering geospatial data. The basic algorithm we present is similar to the G-means and X-means algorithms; however, our proposed approach avoids certain limitations of these well-known clustering algorithms that are pertinent when dealing with geospatial data. We compare the performance of our approach with the G-means and X-means algorithms. Experimental evaluation on simulated data and on multispectral and hyperspectral remotely sensed image data demonstrates the effectiveness of our algorithm.
Additivity Principle in High-Dimensional Deterministic Systems
NASA Astrophysics Data System (ADS)
Saito, Keiji; Dhar, Abhishek
2011-12-01
The additivity principle (AP), conjectured by Bodineau and Derrida [Phys. Rev. Lett. 92, 180601 (2004)PRLTAO0031-900710.1103/PhysRevLett.92.180601], is discussed for the case of heat conduction in three-dimensional disordered harmonic lattices to consider the effects of deterministic dynamics, higher dimensionality, and different transport regimes, i.e., ballistic, diffusive, and anomalous transport. The cumulant generating function (CGF) for heat transfer is accurately calculated and compared with the one given by the AP. In the diffusive regime, we find a clear agreement with the conjecture even if the system is high dimensional. Surprisingly, even in the anomalous regime the CGF is also well fitted by the AP. Lower-dimensional systems are also studied and the importance of three dimensionality for the validity is stressed.
Additivity principle in high-dimensional deterministic systems.
Saito, Keiji; Dhar, Abhishek
2011-12-16
The additivity principle (AP), conjectured by Bodineau and Derrida [Phys. Rev. Lett. 92, 180601 (2004)], is discussed for the case of heat conduction in three-dimensional disordered harmonic lattices to consider the effects of deterministic dynamics, higher dimensionality, and different transport regimes, i.e., ballistic, diffusive, and anomalous transport. The cumulant generating function (CGF) for heat transfer is accurately calculated and compared with the one given by the AP. In the diffusive regime, we find a clear agreement with the conjecture even if the system is high dimensional. Surprisingly, even in the anomalous regime the CGF is also well fitted by the AP. Lower-dimensional systems are also studied and the importance of three dimensionality for the validity is stressed.
High dimensional reflectance analysis of soil organic matter
NASA Technical Reports Server (NTRS)
Henderson, T. L.; Baumgardner, M. F.; Franzmeier, D. P.; Stott, D. E.; Coster, D. C.
1992-01-01
Recent breakthroughs in remote-sensing technology have led to the development of high spectral resolution imaging sensors for observation of earth surface features. This research was conducted to evaluate the effects of organic matter content and composition on narrowband soil reflectance across the visible and reflective infrared spectral ranges. Organic matter from four Indiana agricultural soils, ranging in organic C content from 0.99 to 1.72 percent, was extracted, fractionated, and purified. Six components of each soil were isolated and prepared for spectral analysis. Reflectance was measured in 210 narrow bands in the 400- to 2500-nm wavelength range. Statistical analysis of reflectance values indicated the potential of high dimensional reflectance data in specific visible, near-infrared, and middle-infrared bands to provide information about soil organic C content, but not organic matter composition. These bands also responded significantly to Fe- and Mn-oxide content.
Modeling for Process Control: High-Dimensional Systems
Lev S. Tsimring
2008-09-15
Most of other technologically important systems (among them, powders and other granular systems) are intrinsically nonlinear. This project is focused on building the dynamical models for granular systems as a prototype for nonlinear high-dimensional systems exhibiting complex non-equilibrium phenomena. Granular materials present a unique opportunity to study these issues in a technologically important and yet fundamentally interesting setting. Granular systems exhibit a rich variety of regimes from gas-like to solid-like depending on the external excitation. Based the combination of the rigorous asymptotic analysis, available experimental data and nonlinear signal processing tools, we developed a multi-scale approach to the modeling of granular systems from detailed description of grain-grain interaction on a micro-scale to continuous modeling of large-scale granular flows with important geophysical applications.
A quantum router for high-dimensional entanglement
NASA Astrophysics Data System (ADS)
Erhard, Manuel; Malik, Mehul; Zeilinger, Anton
2017-03-01
In addition to being a workhorse for modern quantum technologies, entanglement plays a key role in fundamental tests of quantum mechanics. The entanglement of photons in multiple levels, or dimensions, explores the limits of how large an entangled state can be, while also greatly expanding its applications in quantum information. Here we show how a high-dimensional quantum state of two photons entangled in their orbital angular momentum can be split into two entangled states with a smaller dimensionality structure. Our work demonstrates that entanglement is a quantum property that can be subdivided into spatially separated parts. In addition, our technique has vast potential applications in quantum as well as classical communication systems.
Future of High-Dimensional Data-Driven Exoplanet Science
NASA Astrophysics Data System (ADS)
Ford, Eric B.
2016-03-01
The detection and characterization of exoplanets has come a long way since the 1990’s. For example, instruments specifically designed for Doppler planet surveys feature environmental controls to minimize instrumental effects and advanced calibration systems. Combining these instruments with powerful telescopes, astronomers have detected thousands of exoplanets. The application of Bayesian algorithms has improved the quality and reliability with which astronomers characterize the mass and orbits of exoplanets. Thanks to continued improvements in instrumentation, now the detection of extrasolar low-mass planets is limited primarily by stellar activity, rather than observational uncertainties. This presents a new set of challenges which will require cross-disciplinary research to combine improved statistical algorithms with an astrophysical understanding of stellar activity and the details of astronomical instrumentation. I describe these challenges and outline the roles of parameter estimation over high-dimensional parameter spaces, marginalizing over uncertainties in stellar astrophysics and machine learning for the next generation of Doppler planet searches.
The detection of globular clusters in galaxies as a data mining problem
NASA Astrophysics Data System (ADS)
Brescia, Massimo; Cavuoti, Stefano; Paolillo, Maurizio; Longo, Giuseppe; Puzia, Thomas
2012-04-01
We present an application of self-adaptive supervised learning classifiers derived from the machine learning paradigm to the identification of candidate globular clusters in deep, wide-field, single-band Hubble Space Telescope (HST) images. Several methods provided by the DAta Mining and Exploration (DAME) web application were tested and compared on the NGC 1399 HST data described by Paolillo and collaborators in a companion paper. The best results were obtained using a multilayer perceptron with quasi-Newton learning rule which achieved a classification accuracy of 98.3 per cent, with a completeness of 97.8 per cent and contamination of 1.6 per cent. An extensive set of experiments revealed that the use of accurate structural parameters (effective radius, central surface brightness) does improve the final result, but only by ˜5 per cent. It is also shown that the method is capable to retrieve also extreme sources (for instance, very extended objects) which are missed by more traditional approaches.
Spectral feature design in high dimensional multispectral data
NASA Technical Reports Server (NTRS)
Chen, Chih-Chien Thomas; Landgrebe, David A.
1988-01-01
The High resolution Imaging Spectrometer (HIRIS) is designed to acquire images simultaneously in 192 spectral bands in the 0.4 to 2.5 micrometers wavelength region. It will make possible the collection of essentially continuous reflectance spectra at a spectral resolution sufficient to extract significantly enhanced amounts of information from return signals as compared to existing systems. The advantages of such high dimensional data come at a cost of increased system and data complexity. For example, since the finer the spectral resolution, the higher the data rate, it becomes impractical to design the sensor to be operated continuously. It is essential to find new ways to preprocess the data which reduce the data rate while at the same time maintaining the information content of the high dimensional signal produced. Four spectral feature design techniques are developed from the Weighted Karhunen-Loeve Transforms: (1) non-overlapping band feature selection algorithm; (2) overlapping band feature selection algorithm; (3) Walsh function approach; and (4) infinite clipped optimal function approach. The infinite clipped optimal function approach is chosen since the features are easiest to find and their classification performance is the best. After the preprocessed data has been received at the ground station, canonical analysis is further used to find the best set of features under the criterion that maximal class separability is achieved. Both 100 dimensional vegetation data and 200 dimensional soil data were used to test the spectral feature design system. It was shown that the infinite clipped versions of the first 16 optimal features had excellent classification performance. The overall probability of correct classification is over 90 percent while providing for a reduced downlink data rate by a factor of 10.
Shaffer, Patrick; Valsson, Omar; Parrinello, Michele
2016-01-01
The capabilities of molecular simulations have been greatly extended by a number of widely used enhanced sampling methods that facilitate escaping from metastable states and crossing large barriers. Despite these developments there are still many problems which remain out of reach for these methods which has led to a vigorous effort in this area. One of the most important problems that remains unsolved is sampling high-dimensional free-energy landscapes and systems that are not easily described by a small number of collective variables. In this work we demonstrate a new way to compute free-energy landscapes of high dimensionality based on the previously introduced variationally enhanced sampling, and we apply it to the miniprotein chignolin. PMID:26787868
NASA Technical Reports Server (NTRS)
Pinsonneault, Marc H.; Stauffer, John; Soderblom, David R.; King, Jeremy R.; Hanson, Robert B.
1998-01-01
Parallax data from the Hipparcos mission allow the direct distance to open clusters to be compared with the distance inferred from main-sequence (MS) fitting. There are surprising differences between the two distance measurements. indicating either the need for changes in the cluster compositions or reddening, underlying problems with the technique of MS fitting, or systematic errors in the Hipparcos parallaxes at the 1 mas level. We examine the different possibilities, focusing on MS fitting in both metallicity-sensitive B-V and metallicity-insensitive V-I for five well-studied systems (the Hyades, Pleiades, alpha Per, Praesepe, and Coma Ber). The Hipparcos distances to the Hyades and alpha Per are within 1 sigma of the MS-fitting distance in B-V and V-I, while the Hipparcos distances to Coma Ber and the Pleiades are in disagreement with the MS-fitting distance at more than the 3 sigma level. There are two Hipparcos measurements of the distance to Praesepe; one is in good agreement with the MS-fitting distance and the other disagrees at the 2 sigma level. The distance estimates from the different colors are in conflict with one another for Coma but in agreement for the Pleiades. Changes in the relative cluster metal abundances, age related effects, helium, and reddening are shown to be unlikely to explain the puzzling behavior of the Pleiades. We present evidence for spatially dependent systematic errors at the 1 mas level in the parallaxes of Pleiades stars. The implications of this result are discussed.
Franklin, Jessica M; Eddings, Wesley; Glynn, Robert J; Schneeweiss, Sebastian
2015-10-01
Selection and measurement of confounders is critical for successful adjustment in nonrandomized studies. Although the principles behind confounder selection are now well established, variable selection for confounder adjustment remains a difficult problem in practice, particularly in secondary analyses of databases. We present a simulation study that compares the high-dimensional propensity score algorithm for variable selection with approaches that utilize direct adjustment for all potential confounders via regularized regression, including ridge regression and lasso regression. Simulations were based on 2 previously published pharmacoepidemiologic cohorts and used the plasmode simulation framework to create realistic simulated data sets with thousands of potential confounders. Performance of methods was evaluated with respect to bias and mean squared error of the estimated effects of a binary treatment. Simulation scenarios varied the true underlying outcome model, treatment effect, prevalence of exposure and outcome, and presence of unmeasured confounding. Across scenarios, high-dimensional propensity score approaches generally performed better than regularized regression approaches. However, including the variables selected by lasso regression in a regular propensity score model also performed well and may provide a promising alternative variable selection method.
Yu, Hualong; Ni, Jun
2014-01-01
Training classifiers on skewed data can be technically challenging tasks, especially if the data is high-dimensional simultaneously, the tasks can become more difficult. In biomedicine field, skewed data type often appears. In this study, we try to deal with this problem by combining asymmetric bagging ensemble classifier (asBagging) that has been presented in previous work and an improved random subspace (RS) generation strategy that is called feature subspace (FSS). Specifically, FSS is a novel method to promote the balance level between accuracy and diversity of base classifiers in asBagging. In view of the strong generalization capability of support vector machine (SVM), we adopt it to be base classifier. Extensive experiments on four benchmark biomedicine data sets indicate that the proposed ensemble learning method outperforms many baseline approaches in terms of Accuracy, F-measure, G-mean and AUC evaluation criterions, thus it can be regarded as an effective and efficient tool to deal with high-dimensional and imbalanced biomedical data.
Using High-Dimensional Image Models to Perform Highly Undetectable Steganography
NASA Astrophysics Data System (ADS)
Pevný, Tomáš; Filler, Tomáš; Bas, Patrick
This paper presents a complete methodology for designing practical and highly-undetectable stegosystems for real digital media. The main design principle is to minimize a suitably-defined distortion by means of efficient coding algorithm. The distortion is defined as a weighted difference of extended state-of-the-art feature vectors already used in steganalysis. This allows us to "preserve" the model used by steganalyst and thus be undetectable even for large payloads. This framework can be efficiently implemented even when the dimensionality of the feature set used by the embedder is larger than 107. The high dimensional model is necessary to avoid known security weaknesses. Although high-dimensional models might be problem in steganalysis, we explain, why they are acceptable in steganography. As an example, we introduce HUGO, a new embedding algorithm for spatial-domain digital images and we contrast its performance with LSB matching. On the BOWS2 image database and in contrast with LSB matching, HUGO allows the embedder to hide 7× longer message with the same level of security level.
NASA Astrophysics Data System (ADS)
Hu, Jiang; Bai, ZhiDong
2016-12-01
In this paper, we will introduce the so called naive tests and give a brief review on the newly development. Naive testing methods are easy to understand and performs robust especially when the dimension is large. In this paper, we mainly focus on reviewing some naive testing methods for the mean vectors and covariance matrices of high dimensional populations and believe this naive test idea can be wildly used in many other testing problems.
2012-01-01
Background Externalising and internalising problems affect one in seven school-aged children and are the single strongest predictor of mental health problems into early adolescence. As the burden of mental health problems persists globally, childhood prevention of mental health problems is paramount. Prevention can be offered to all children (universal) or to children at risk of developing mental health problems (targeted). The relative effectiveness and costs of a targeted only versus combined universal and targeted approach are unknown. This study aims to determine the effectiveness, costs and uptake of two approaches to early childhood prevention of mental health problems ie: a Combined universal-targeted approach, versus a Targeted only approach, in comparison to current primary care services (Usual care). Methods/design Three armed, population-level cluster randomised trial (2010–2014) within the universal, well child Maternal Child Health system, attended by more than 80% of families in Victoria, Australia at infant age eight months. Participants were families of eight month old children from nine participating local government areas. Randomised to one of three groups: Combined, Targeted or Usual care. The interventions comprises (a) the Combined universal and targeted program where all families are offered the universal Toddlers Without Tears group parenting program followed by the targeted Family Check-Up one-on-one program or (b) the Targeted Family Check-Up program. The Family Check-Up program is only offered to children at risk of behavioural problems. Participants will be analysed according to the trial arm to which they were randomised, using logistic and linear regression models to compare primary and secondary outcomes. An economic evaluation (cost consequences analysis) will compare incremental costs to all incremental outcomes from a societal perspective. Discussion This trial will inform public health policy by making recommendations about the
Krause, Josua; Dasgupta, Aritra; Fekete, Jean-Daniel; Bertini, Enrico
2016-10-23
Dealing with the curse of dimensionality is a key challenge in high-dimensional data visualization. We present SeekAView to address three main gaps in the existing research literature. First, automated methods like dimensionality reduction or clustering suffer from a lack of transparency in letting analysts interact with their outputs in real-time to suit their exploration strategies. The results often suffer from a lack of interpretability, especially for domain experts not trained in statistics and machine learning. Second, exploratory visualization techniques like scatter plots or parallel coordinates suffer from a lack of visual scalability: it is difficult to present a coherent overview of interesting combinations of dimensions. Third, the existing techniques do not provide a flexible workflow that allows for multiple perspectives into the analysis process by automatically detecting and suggesting potentially interesting subspaces. In SeekAView we address these issues using suggestion based visual exploration of interesting patterns for building and refining multidimensional subspaces. Compared to the state-of-the-art in subspace search and visualization methods, we achieve higher transparency in showing not only the results of the algorithms, but also interesting dimensions calibrated against different metrics. We integrate a visually scalable design space with an iterative workflow guiding the analysts by choosing the starting points and letting them slice and dice through the data to find interesting subspaces and detect correlations, clusters, and outliers. We present two usage scenarios for demonstrating how SeekAView can be applied in real-world data analysis scenarios.
Visualization of High-Dimensional Point Clouds Using Their Density Distribution's Topology.
Oesterling, P; Heine, C; Janicke, H; Scheuermann, G; Heyer, G
2011-11-01
We present a novel method to visualize multidimensional point clouds. While conventional visualization techniques, like scatterplot matrices or parallel coordinates, have issues with either overplotting of entities or handling many dimensions, we abstract the data using topological methods before presenting it. We assume the input points to be samples of a random variable with a high-dimensional probability distribution which we approximate using kernel density estimates on a suitably reconstructed mesh. From the resulting scalar field we extract the join tree and present it as a topological landscape, a visualization metaphor that utilizes the human capability of understanding natural terrains. In this landscape, dense clusters of points show up as hills. The nesting of hills indicates the nesting of clusters. We augment the landscape with the data points to allow selection and inspection of single points and point sets. We also present optimizations to make our algorithm applicable to large data sets and to allow interactive adaption of our visualization to the kernel window width used in the density estimation.
Williams, Kristine; Herman, Ruth; Bontempo, Daniel
2014-01-01
Purpose of the study Assisted living (AL) residents are at risk for cognitive and functional declines that eventually reduce their ability to care for themselves, thereby triggering nursing home placement. In developing a method to slow this decline, the efficacy of Reasoning Exercises in Assisted Living (REAL), a cognitive training intervention that teaches everyday reasoning and problem-solving skills to AL residents, was tested. Design and methods At thirteen randomized Midwestern facilities, AL residents whose Mini Mental State Examination scores ranged from 19–29 either were trained in REAL or a vitamin education attention control program or received no treatment at all. For 3 weeks, treated groups received personal training in their respective programs. Results Scores on the Every Day Problems Test for Cognitively Challenged Elders (EPCCE) and on the Direct Assessment of Functional Status (DAFS) showed significant increases only for the REAL group. For EPCCE, change from baseline immediately postintervention was +3.10 (P<0.01), and there was significant retention at the 3-month follow-up (d=2.71; P<0.01). For DAFS, change from baseline immediately postintervention was +3.52 (P<0.001), although retention was not as strong. Neither the attention nor the no-treatment control groups had significant gains immediately postintervention or at follow-up assessments. Post hoc across-group comparison of baseline change also highlights the benefits of REAL training. For EPCCE, the magnitude of gain was significantly larger in the REAL group versus the no-treatment control group immediately postintervention (d=3.82; P<0.01) and at the 3-month follow-up (d=3.80; P<0.01). For DAFS, gain magnitude immediately postintervention for REAL was significantly greater compared with in the attention control group (d=4.73; P<0.01). Implications REAL improves skills in everyday problem solving, which may allow AL residents to maintain self-care and extend AL residency. This benefit
A qualitative numerical study of high dimensional dynamical systems
NASA Astrophysics Data System (ADS)
Albers, David James
Since Poincare, the father of modern mathematical dynamical systems, much effort has been exerted to achieve a qualitative understanding of the physical world via a qualitative understanding of the functions we use to model the physical world. In this thesis, we construct a numerical framework suitable for a qualitative, statistical study of dynamical systems using the space of artificial neural networks. We analyze the dynamics along intervals in parameter space, separating the set of neural networks into roughly four regions: the fixed point to the first bifurcation; the route to chaos; the chaotic region; and a transition region between chaos and finite-state neural networks. The study is primarily with respect to high-dimensional dynamical systems. We make the following general conclusions as the dimension of the dynamical system is increased: the probability of the first bifurcation being of type Neimark-Sacker is greater than ninety-percent; the most probable route to chaos is via a cascade of bifurcations of high-period periodic orbits, quasi-periodic orbits, and 2-tori; there exists an interval of parameter space such that hyperbolicity is violated on a countable, Lebesgue measure 0, "increasingly dense" subset; chaos is much more likely to persist with respect to parameter perturbation in the chaotic region of parameter space as the dimension is increased; moreover, as the number of positive Lyapunov exponents is increased, the likelihood that any significant portion of these positive exponents can be perturbed away decreases with increasing dimension. The maximum Kaplan-Yorke dimension and the maximum number of positive Lyapunov exponents increases linearly with dimension. The probability of a dynamical system being chaotic increases exponentially with dimension. The results with respect to the first bifurcation and the route to chaos comment on previous results of Newhouse, Ruelle, Takens, Broer, Chenciner, and Iooss. Moreover, results regarding the high-dimensional
HASE: Framework for efficient high-dimensional association analyses
Roshchupkin, G. V.; Adams, H. H. H.; Vernooij, M. W.; Hofman, A.; Van Duijn, C. M.; Ikram, M. A.; Niessen, W. J.
2016-01-01
High-throughput technology can now provide rich information on a person’s biological makeup and environmental surroundings. Important discoveries have been made by relating these data to various health outcomes in fields such as genomics, proteomics, and medical imaging. However, cross-investigations between several high-throughput technologies remain impractical due to demanding computational requirements (hundreds of years of computing resources) and unsuitability for collaborative settings (terabytes of data to share). Here we introduce the HASE framework that overcomes both of these issues. Our approach dramatically reduces computational time from years to only hours and also requires several gigabytes to be exchanged between collaborators. We implemented a novel meta-analytical method that yields identical power as pooled analyses without the need of sharing individual participant data. The efficiency of the framework is illustrated by associating 9 million genetic variants with 1.5 million brain imaging voxels in three cohorts (total N = 4,034) followed by meta-analysis, on a standard computational infrastructure. These experiments indicate that HASE facilitates high-dimensional association studies enabling large multicenter association studies for future discoveries. PMID:27782180
Mapping the High-Dimensional ISM with Kinetic Tomography
NASA Astrophysics Data System (ADS)
Zasowski, Gail; Peek, Joshua Eli Goldston; Tchernyshyov, Kirill
2017-01-01
The interstellar medium (ISM) of a galaxy plays a critical role in its chemical evolution, via flows of enriched material into and out of star-forming molecular clouds, and even more expansive flows on kiloparsec scales through the disk and halo. The Milky Way is the only large galaxy in which we can resolve these motions at the level of individual molecular clouds, measure the kinematics of interstellar dust, and map the full three-dimensional velocity field of multiple ISM components; all of these are necessary to understand the evolution of spiral arms and molecular clouds, along with the redistribution of heavy elements throughout the Galaxy. I will present early results from a novel technique called "kinetic tomography", in which we combine stellar reddening (from Pan-STARRS), interstellar emission (CO and HI), and interstellar absorption (from APOGEE) data into high-dimensional datasets, and then extract distance-resolved kinematic information on multiple phases of the ISM. These methods are providing new views on the evolution of molecular clouds and chemical mixing in the ISM.
Multigroup Equivalence Analysis for High-Dimensional Expression Data
Yang, Celeste; Bartolucci, Alfred A.; Cui, Xiangqin
2015-01-01
Hypothesis tests of equivalence are typically known for their application in bioequivalence studies and acceptance sampling. Their application to gene expression data, in particular high-dimensional gene expression data, has only recently been studied. In this paper, we examine how two multigroup equivalence tests, the F-test and the range test, perform when applied to microarray expression data. We adapted these tests to a well-known equivalence criterion, the difference ratio. Our simulation results showed that both tests can achieve moderate power while controlling the type I error at nominal level for typical expression microarray studies with the benefit of easy-to-interpret equivalence limits. For the range of parameters simulated in this paper, the F-test is more powerful than the range test. However, for comparing three groups, their powers are similar. Finally, the two multigroup tests were applied to a prostate cancer microarray dataset to identify genes whose expression follows a prespecified trajectory across five prostate cancer stages. PMID:26628859
Multigroup Equivalence Analysis for High-Dimensional Expression Data.
Yang, Celeste; Bartolucci, Alfred A; Cui, Xiangqin
2015-01-01
Hypothesis tests of equivalence are typically known for their application in bioequivalence studies and acceptance sampling. Their application to gene expression data, in particular high-dimensional gene expression data, has only recently been studied. In this paper, we examine how two multigroup equivalence tests, the F-test and the range test, perform when applied to microarray expression data. We adapted these tests to a well-known equivalence criterion, the difference ratio. Our simulation results showed that both tests can achieve moderate power while controlling the type I error at nominal level for typical expression microarray studies with the benefit of easy-to-interpret equivalence limits. For the range of parameters simulated in this paper, the F-test is more powerful than the range test. However, for comparing three groups, their powers are similar. Finally, the two multigroup tests were applied to a prostate cancer microarray dataset to identify genes whose expression follows a prespecified trajectory across five prostate cancer stages.
The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R
Li, Xingguo; Zhao, Tuo; Yuan, Xiaoming; Liu, Han
2016-01-01
This paper describes an R package named flare, which implements a family of new high dimensional regression methods (LAD Lasso, SQRT Lasso, ℓq Lasso, and Dantzig selector) and their extensions to sparse precision matrix estimation (TIGER and CLIME). These methods exploit different nonsmooth loss functions to gain modeling exibility, estimation robustness, and tuning insensitiveness. The developed solver is based on the alternating direction method of multipliers (ADMM), which is further accelerated by the multistage screening approach. The package flare is coded in double precision C, and called from R by a user-friendly interface. The memory usage is optimized by using the sparse matrix output. The experiments show that flare is efficient and can scale up to large problems.
Feature Selection Based on High Dimensional Model Representation for Hyperspectral Images.
Taskin Kaya, Gulsen; Kaya, Huseyin; Bruzzone, Lorenzo
2017-03-24
In hyperspectral image analysis, the classification task has generally been addressed jointly with dimensionality reduction due to both the high correlation between the spectral features and the noise present in spectral bands which might significantly degrade classification performance. In supervised classification, limited training instances in proportion to the number of spectral features have negative impacts on the classification accuracy, which has known as Hughes effects or curse of dimensionality in the literature. In this paper, we focus on dimensionality reduction problem, and propose a novel feature-selection algorithm which is based on the method called High Dimensional Model Representation. The proposed algorithm is tested on some toy examples and hyperspectral datasets in comparison to conventional feature-selection algorithms in terms of classification accuracy, stability of the selected features and computational time. The results showed that the proposed approach provides both high classification accuracy and robust features with a satisfactory computational time.
A Hyperspherical Adaptive Sparse-Grid Method for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max D.; Burkardt, John V.
2015-06-24
This study proposes and analyzes a hyperspherical adaptive hierarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces. The method is motivated by the theoretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a function representation of the discontinuity hypersurface of an N-dimensional discontinuous quantity of interest, by virtue of a hyperspherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyperspherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. In addition, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous complexity analyses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.
High dimensional linear regression models under long memory dependence and measurement error
NASA Astrophysics Data System (ADS)
Kaul, Abhishek
This dissertation consists of three chapters. The first chapter introduces the models under consideration and motivates problems of interest. A brief literature review is also provided in this chapter. The second chapter investigates the properties of Lasso under long range dependent model errors. Lasso is a computationally efficient approach to model selection and estimation, and its properties are well studied when the regression errors are independent and identically distributed. We study the case, where the regression errors form a long memory moving average process. We establish a finite sample oracle inequality for the Lasso solution. We then show the asymptotic sign consistency in this setup. These results are established in the high dimensional setup (p> n) where p can be increasing exponentially with n. Finally, we show the consistency, n½ --d-consistency of Lasso, along with the oracle property of adaptive Lasso, in the case where p is fixed. Here d is the memory parameter of the stationary error sequence. The performance of Lasso is also analysed in the present setup with a simulation study. The third chapter proposes and investigates the properties of a penalized quantile based estimator for measurement error models. Standard formulations of prediction problems in high dimension regression models assume the availability of fully observed covariates and sub-Gaussian and homogeneous model errors. This makes these methods inapplicable to measurement errors models where covariates are unobservable and observations are possibly non sub-Gaussian and heterogeneous. We propose weighted penalized corrected quantile estimators for the regression parameter vector in linear regression models with additive measurement errors, where unobservable covariates are nonrandom. The proposed estimators forgo the need for the above mentioned model assumptions. We study these estimators in both the fixed dimension and high dimensional sparse setups, in the latter setup, the
Representing potential energy surfaces by high-dimensional neural network potentials.
Behler, J
2014-05-07
The development of interatomic potentials employing artificial neural networks has seen tremendous progress in recent years. While until recently the applicability of neural network potentials (NNPs) has been restricted to low-dimensional systems, this limitation has now been overcome and high-dimensional NNPs can be used in large-scale molecular dynamics simulations of thousands of atoms. NNPs are constructed by adjusting a set of parameters using data from electronic structure calculations, and in many cases energies and forces can be obtained with very high accuracy. Therefore, NNP-based simulation results are often very close to those gained by a direct application of first-principles methods. In this review, the basic methodology of high-dimensional NNPs will be presented with a special focus on the scope and the remaining limitations of this approach. The development of NNPs requires substantial computational effort as typically thousands of reference calculations are required. Still, if the problem to be studied involves very large systems or long simulation times this overhead is regained quickly. Further, the method is still limited to systems containing about three or four chemical elements due to the rapidly increasing complexity of the configuration space, although many atoms of each species can be present. Due to the ability of NNPs to describe even extremely complex atomic configurations with excellent accuracy irrespective of the nature of the atomic interactions, they represent a general and therefore widely applicable technique, e.g. for addressing problems in materials science, for investigating properties of interfaces, and for studying solvation processes.
Chen, Yi; Jakeman, John; Gittelson, Claude; Xiu, Dongbin
2015-01-08
In this paper we present a localized polynomial chaos expansion for partial differential equations (PDE) with random inputs. In particular, we focus on time independent linear stochastic problems with high dimensional random inputs, where the traditional polynomial chaos methods, and most of the existing methods, incur prohibitively high simulation cost. Furthermore, the local polynomial chaos method employs a domain decomposition technique to approximate the stochastic solution locally. In each subdomain, a subdomain problem is solved independently and, more importantly, in a much lower dimensional random space. In a postprocesing stage, accurate samples of the original stochastic problems are obtained from the samples of the local solutions by enforcing the correct stochastic structure of the random inputs and the coupling conditions at the interfaces of the subdomains. Overall, the method is able to solve stochastic PDEs in very large dimensions by solving a collection of low dimensional local problems and can be highly efficient. In our paper we present the general mathematical framework of the methodology and use numerical examples to demonstrate the properties of the method.
High dimensional spatial modeling of extremes with applications to United States Rainfalls
NASA Astrophysics Data System (ADS)
Zhou, Jie
2007-12-01
Spatial statistical models are used to predict unobserved variables based on observed variables and to estimate unknown model parameters. Extreme value theory(EVT) is used to study large or small observations from a random phenomenon. Both spatial statistics and extreme value theory have been studied in a lot of areas such as agriculture, finance, industry and environmental science. This dissertation proposes two spatial statistical models which concentrate on non-Gaussian probability densities with general spatial covariance structures. The two models are also applied in analyzing United States Rainfalls and especially, rainfall extremes. When the data set is not too large, the first model is used. The model constructs a generalized linear mixed model(GLMM) which can be considered as an extension of Diggle's model-based geostatistical approach(Diggle et al. 1998). The approach improves conventional kriging with a form of generalized linear mixed structure. As for high dimensional problems, two different methods are established to improve the computational efficiency of Markov Chain Monte Carlo(MCMC) implementation. The first method is based on spectral representation of spatial dependence structures which provides good approximations on each MCMC iteration. The other method embeds high dimensional covariance matrices in matrices with block circulant structures. The eigenvalues and eigenvectors of block circulant matrices can be calculated exactly by Fast Fourier Transforms(FFT). The computational efficiency is gained by transforming the posterior matrices into lower dimensional matrices. This method gives us exact update on each MCMC iteration. Future predictions are also made by keeping spatial dependence structures fixed and using the relationship between present days and future days provided by some Global Climate Model(GCM). The predictions are refined by sampling techniques. Both ways of handling high dimensional covariance matrices are novel to analyze large
NASA Astrophysics Data System (ADS)
Taşkin Kaya, Gülşen
2013-10-01
-output relationships in high-dimensional systems for many problems in science and engineering. The HDMR method is developed to improve the efficiency of the deducing high dimensional behaviors. The method is formed by a particular organization of low dimensional component functions, in which each function is the contribution of one or more input variables to the output variables.
Fast and accurate probability density estimation in large high dimensional astronomical datasets
NASA Astrophysics Data System (ADS)
Gupta, Pramod; Connolly, Andrew J.; Gardner, Jeffrey P.
2015-01-01
Astronomical surveys will generate measurements of hundreds of attributes (e.g. color, size, shape) on hundreds of millions of sources. Analyzing these large, high dimensional data sets will require efficient algorithms for data analysis. An example of this is probability density estimation that is at the heart of many classification problems such as the separation of stars and quasars based on their colors. Popular density estimation techniques use binning or kernel density estimation. Kernel density estimation has a small memory footprint but often requires large computational resources. Binning has small computational requirements but usually binning is implemented with multi-dimensional arrays which leads to memory requirements which scale exponentially with the number of dimensions. Hence both techniques do not scale well to large data sets in high dimensions. We present an alternative approach of binning implemented with hash tables (BASH tables). This approach uses the sparseness of data in the high dimensional space to ensure that the memory requirements are small. However hashing requires some extra computation so a priori it is not clear if the reduction in memory requirements will lead to increased computational requirements. Through an implementation of BASH tables in C++ we show that the additional computational requirements of hashing are negligible. Hence this approach has small memory and computational requirements. We apply our density estimation technique to photometric selection of quasars using non-parametric Bayesian classification and show that the accuracy of the classification is same as the accuracy of earlier approaches. Since the BASH table approach is one to three orders of magnitude faster than the earlier approaches it may be useful in various other applications of density estimation in astrostatistics.
Unfold High-Dimensional Clouds for Exhaustive Gating of Flow Cytometry Data.
Qiu, Peng
2014-01-01
Flow cytometry is able to measure the expressions of multiple proteins simultaneously at the single-cell level. A flow cytometry experiment on one biological sample provides measurements of several protein markers on or inside a large number of individual cells in that sample. Analysis of such data often aims to identify subpopulations of cells with distinct phenotypes. Currently, the most widely used analytical approach in the flow cytometry community is manual gating on a sequence of nested biaxial plots, which is highly subjective, labor intensive, and not exhaustive. To address those issues, a number of methods have been developed to automate the gating analysis by clustering algorithms. However, completely removing the subjectivity can be quite challenging. This paper describes an alternative approach. Instead of automating the analysis, we develop novel visualizations to facilitate manual gating. The proposed method views single-cell data of one biological sample as a high-dimensional point cloud of cells, derives the skeleton of the cloud, and unfolds the skeleton to generate 2D visualizations. We demonstrate the utility of the proposed visualization using real data, and provide quantitative comparison to visualizations generated from principal component analysis and multidimensional scaling.
Gude, Tore; Hoffart, Asle
2008-04-01
The aim was to study whether patients with panic disorder with agoraphobia and co-occurring Cluster C traits would respond differently regarding change in interpersonal problems as part of their personality functioning when receiving two different treatment modalities. Two cohorts of patients were followed through three months' in-patient treatment programs and assessed at follow-up one year after end of treatment. The one cohort comprised 18 patients treated with "treatment as usual" according to psychodynamic principles, the second comprised 24 patients treated in a cognitive agoraphobia and schema-focused therapy program. Patients in the cognitive condition showed greater improvement in interpersonal problems than patients in the treatment as usual condition. Although this quasi-experimental study has serious limitations, the results may indicate that agoraphobic patients with Cluster C traits should be treated in cognitive agoraphobia and schema-focused programs rather than in psychodynamic treatment as usual programs in order to reduce their level of interpersonal problems.
Adaptive Virtual Support Vector Machine for the Reliability Analysis of High-Dimensional Problems
2011-08-01
Jeeves (H-J), Lavenberg-Marquardt ( L -M), genetic algorithm (GA) and PatternSearch (PS) methods [24, 25] have been applied. Among them, the PS method...each test starts with different initial profiles . Parameters σ, ε1, and ε2 are 3, 0.8, and 0.3, respectively, for both EDSD and VSVM. To compare the...strategy, and thus all final profiles are different except the 10 initial samples. According to Table 1, which provides averaged values of 20 test cases
NASA Astrophysics Data System (ADS)
Friedenberg, David
2010-10-01
the rate of falsely detected active regions. Additionally we examine the more general field of clustering and develop a framework for clustering algorithms based around diffusion maps. Diffusion maps can be used to project high-dimensional data into a lower dimensional space while preserving much of the structure in the data. We demonstrate how diffusion maps can be used to solve clustering problems and examine the influence of tuning parameters on the results. We introduce two novel methods, the self-tuning diffusion map which replaces the global scaling parameter in the typical diffusion map framework with a local scaling parameter and an algorithm for automatically selecting tuning parameters based on a cross-validation style score called prediction strength. The methods are tested on several example datasets.
Arif, Muhammad; Basalamah, Saleh
2013-06-01
In real life biomedical classification applications, it is difficult to visualize the feature space due to high dimensionality of the feature space. In this paper, we have proposed 3D similarity-dissimilarity plot to project the high dimensional space to a three dimensional space in which important information about the feature space can be extracted in the context of pattern classification. In this plot it is possible to visualize good data points (data points near to their own class as compared to other classes) and bad data points (data points far away from their own class) and outlier points (data points away from both their own class and other classes). Hence separation of classes can easily be visualized. Density of the data points near each other can provide some useful information about the compactness of the clusters within certain class. Moreover, an index called percentage of data points above the similarity-dissimilarity line (PAS) is proposed which is the fraction of data points above the similarity-dissimilarity line. Several synthetic and real life biomedical datasets are used to show the effectiveness of the proposed 3D similarity-dissimilarity plot.
Statistical Machine Learning for Structured and High Dimensional Data
2014-09-17
which generalizes Stein’s unbiased risk estimate (SURE) to Wishart distributions. The resulting estimator is free of any tuning parameters, and enjoys...theory. We have analyzed the case of the normal means within a Sobolev ellipsoid, which is a standard setup in nonparametric regression. Our results ...data analysis problems. In particular, we have been working with data from the Kepler telescope for finding exoplanets orbiting distant stars. The
The problem of the structure (state of helium) in small He{sub N}-CO clusters
Potapov, A. V. Panfilov, V. A.; Surin, L. A.; Dumesh, B. S.
2010-11-15
A second-order perturbation theory, developed for calculating the energy levels of the He-CO binary complex, is applied to small He{sub N}-CO clusters with N = 2-4, the helium atoms being considered as a single bound object. The interaction potential between the CO molecule and HeN is represented as a linear expansion in Legendre polynomials, in which the free rotation limit is chosen as the zero approximation and the angular dependence of the interaction is considered as a small perturbation. By fitting calculated rotational transitions to experimental values it was possible to determine the optimal parameters of the potential and to achieve good agreement (to within less than 1%) between calculated and experimental energy levels. As a result, the shape of the angular anisotropy of the interaction potential is obtained for various clusters. It turns out that the minimum of the potential energy is smoothly shifted from an angle between the axes of the CO molecule and the cluster of {theta} = 100{sup o} in He-CO to {theta} = 180{sup o} (the oxygen end) in He{sub 3}-CO and He{sub 4}-CO clusters. Under the assumption that the distribution of helium atoms with respect to the cluster axis is cylindrically symmetric, the structure of the cluster can be represented as a pyramid with the CO molecule at the vertex.
Finite-key analysis of a practical decoy-state high-dimensional quantum key distribution
NASA Astrophysics Data System (ADS)
Bao, Haize; Bao, Wansu; Wang, Yang; Zhou, Chun; Chen, Ruike
2016-05-01
Compared with two-level quantum key distribution (QKD), high-dimensional QKD enables two distant parties to share a secret key at a higher rate. We provide a finite-key security analysis for the recently proposed practical high-dimensional decoy-state QKD protocol based on time-energy entanglement. We employ two methods to estimate the statistical fluctuation of the postselection probability and give a tighter bound on the secure-key capacity. By numerical evaluation, we show the finite-key effect on the secure-key capacity in different conditions. Moreover, our approach could be used to optimize parameters in practical implementations of high-dimensional QKD.
ERIC Educational Resources Information Center
Jitendra, Asha K.; Harwell, Michael R.; Dupuis, Danielle N.; Karl, Stacy R.; Lein, Amy E.; Simonson, Gregory; Slater, Susan C.
2015-01-01
This experimental study evaluated the effectiveness of a research-based intervention, schema-based instruction (SBI), on students' proportional problem solving. SBI emphasizes the underlying mathematical structure of problems, uses schematic diagrams to represent information in the problem text, provides explicit problem-solving and metacognitive…
NASA Technical Reports Server (NTRS)
Soderblom, David R.; King, Jeremy R.; Hanson, Robert B.; Jones, Burton F.; Fischer, Debra; Stauffer, John R.; Pinsonneault, Marc H.
1998-01-01
This paper examines the discrepancy between distances to nearby open clusters as determined by parallaxes from Hipparcos compared to traditional main-sequence fitting. The biggest difference is seen for the Pleiades, and our hypothesis is that if the Hipparcos distance to the Pleiades is correct, then similar subluminous zero-age main-sequence (ZAMS) stars should exist elsewhere, including in the immediate solar neighborhood. We examine a color-magnitude diagram of very young and nearby solar-type stars and show that none of them lie below the traditional ZAMS, despite the fact that the Hipparcos Pleiades parallax would place its members 0.3 mag below that ZAMS. We also present analyses and observations of solar-type stars that do lie below the ZAMS, and we show that they are subluminous because of low metallicity and that they have the kinematics of old stars.
Wang, Ying; Fan, Yong; Bhatt, Priyanka; Davatzikos, Christos
2010-05-01
This paper presents a general methodology for high-dimensional pattern regression on medical images via machine learning techniques. Compared with pattern classification studies, pattern regression considers the problem of estimating continuous rather than categorical variables, and can be more challenging. It is also clinically important, since it can be used to estimate disease stage and predict clinical progression from images. In this work, adaptive regional feature extraction approach is used along with other common feature extraction methods, and feature selection technique is adopted to produce a small number of discriminative features for optimal regression performance. Then the Relevance Vector Machine (RVM) is used to build regression models based on selected features. To get stable regression models from limited training samples, a bagging framework is adopted to build ensemble basis regressors derived from multiple bootstrap training samples, and thus to alleviate the effects of outliers as well as facilitate the optimal model parameter selection. Finally, this regression scheme is tested on simulated data and real data via cross-validation. Experimental results demonstrate that this regression scheme achieves higher estimation accuracy and better generalizing ability than Support Vector Regression (SVR).
Defining and evaluating classification algorithm for high-dimensional data based on latent topics.
Luo, Le; Li, Li
2014-01-01
Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications.
The sparse matrix transform for covariance estimation and analysis of high dimensional signals.
Cao, Guangzhi; Bachega, Leonardo R; Bouman, Charles A
2011-03-01
Covariance estimation for high dimensional signals is a classically difficult problem in statistical signal analysis and machine learning. In this paper, we propose a maximum likelihood (ML) approach to covariance estimation, which employs a novel non-linear sparsity constraint. More specifically, the covariance is constrained to have an eigen decomposition which can be represented as a sparse matrix transform (SMT). The SMT is formed by a product of pairwise coordinate rotations known as Givens rotations. Using this framework, the covariance can be efficiently estimated using greedy optimization of the log-likelihood function, and the number of Givens rotations can be efficiently computed using a cross-validation procedure. The resulting estimator is generally positive definite and well-conditioned, even when the sample size is limited. Experiments on a combination of simulated data, standard hyperspectral data, and face image sets show that the SMT-based covariance estimates are consistently more accurate than both traditional shrinkage estimates and recently proposed graphical lasso estimates for a variety of different classes and sample sizes. An important property of the new covariance estimate is that it naturally yields a fast implementation of the estimated eigen-transformation using the SMT representation. In fact, the SMT can be viewed as a generalization of the classical fast Fourier transform (FFT) in that it uses "butterflies" to represent an orthonormal transform. However, unlike the FFT, the SMT can be used for fast eigen-signal analysis of general non-stationary signals.
A Hyperspherical Adaptive Sparse-Grid Method for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max D.; ...
2015-06-24
This study proposes and analyzes a hyperspherical adaptive hierarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces. The method is motivated by the theoretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a function representation of the discontinuity hypersurface of an N-dimensional discontinuous quantity of interest, by virtue of a hyperspherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyperspherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the newmore » technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. In addition, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous complexity analyses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.« less
Modeling of stochastic dynamics of time-dependent flows under high-dimensional random forcing
NASA Astrophysics Data System (ADS)
Babaee, Hessam; Karniadakis, George
2016-11-01
In this numerical study the effect of high-dimensional stochastic forcing in time-dependent flows is investigated. To efficiently quantify the evolution of stochasticity in such a system, the dynamically orthogonal method is used. In this methodology, the solution is approximated by a generalized Karhunen-Loeve (KL) expansion in the form of u (x , t ω) = u ̲ (x , t) + ∑ i = 1 N yi (t ω)ui (x , t) , in which u ̲ (x , t) is the stochastic mean, the set of ui (x , t) 's is a deterministic orthogonal basis and yi (t ω) 's are the stochastic coefficients. Explicit evolution equations for u ̲ , ui and yi are formulated. The elements of the basis ui (x , t) 's remain orthogonal for all times and they evolve according to the system dynamics to capture the energetically dominant stochastic subspace. We consider two classical fluid dynamics problems: (1) flow over a cylinder, and (2) flow over an airfoil under up to one-hundred dimensional random forcing. We explore the interaction of intrinsic with extrinsic stochasticity in these flows. DARPA N66001-15-2-4055, Office of Naval Research N00014-14-1-0166.
A decision-theory approach to interpretable set analysis for high-dimensional data.
Boca, Simina M; Bravo, Héctor Céorrada; Caffo, Brian; Leek, Jeffrey T; Parmigiani, Giovanni
2013-09-01
A key problem in high-dimensional significance analysis is to find pre-defined sets that show enrichment for a statistical signal of interest; the classic example is the enrichment of gene sets for differentially expressed genes. Here, we propose a new decision-theory approach to the analysis of gene sets which focuses on estimating the fraction of non-null variables in a set. We introduce the idea of "atoms," non-overlapping sets based on the original pre-defined set annotations. Our approach focuses on finding the union of atoms that minimizes a weighted average of the number of false discoveries and missed discoveries. We introduce a new false discovery rate for sets, called the atomic false discovery rate (afdr), and prove that the optimal estimator in our decision-theory framework is to threshold the afdr. These results provide a coherent and interpretable framework for the analysis of sets that addresses the key issues of overlapping annotations and difficulty in interpreting p values in both competitive and self-contained tests. We illustrate our method and compare it to a popular existing method using simulated examples, as well as gene-set and brain ROI data analyses.
A hyper-spherical adaptive sparse-grid method for high-dimensional discontinuity detection
Zhang, Guannan; Webster, Clayton G; Gunzburger, Max D; Burkardt, John V
2014-03-01
This work proposes and analyzes a hyper-spherical adaptive hi- erarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces is proposed. The method is motivated by the the- oretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a func- tion representation of the discontinuity hyper-surface of an N-dimensional dis- continuous quantity of interest, by virtue of a hyper-spherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyper-spherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smooth- ness of the hyper-surface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous error estimates and complexity anal- yses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.
Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics
Luo, Le; Li, Li
2014-01-01
Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications. PMID:24416136
A simple new filter for nonlinear high-dimensional data assimilation
NASA Astrophysics Data System (ADS)
Tödter, Julian; Kirchgessner, Paul; Ahrens, Bodo
2015-04-01
performance with a realistic ensemble size. The results confirm that, in principle, it can be applied successfully and as simple as the ETKF in high-dimensional problems without further modifications of the algorithm, even though it is only based on the particle weights. This proves that the suggested method constitutes a useful filter for nonlinear, high-dimensional data assimilation, and is able to overcome the curse of dimensionality even in deterministic systems.
High-dimensional Cox models: the choice of penalty as part of the model building process.
Benner, Axel; Zucknick, Manuela; Hielscher, Thomas; Ittrich, Carina; Mansmann, Ulrich
2010-02-01
The Cox proportional hazards regression model is the most popular approach to model covariate information for survival times. In this context, the development of high-dimensional models where the number of covariates is much larger than the number of observations (p>n) is an ongoing challenge. A practicable approach is to use ridge penalized Cox regression in such situations. Beside focussing on finding the best prediction rule, one is often interested in determining a subset of covariates that are the most important ones for prognosis. This could be a gene set in the biostatistical analysis of microarray data. Covariate selection can then, for example, be done by L(1)-penalized Cox regression using the lasso (Tibshirani (1997). Statistics in Medicine 16, 385-395). Several approaches beyond the lasso, that incorporate covariate selection, have been developed in recent years. This includes modifications of the lasso as well as nonconvex variants such as smoothly clipped absolute deviation (SCAD) (Fan and Li (2001). Journal of the American Statistical Association 96, 1348-1360; Fan and Li (2002). The Annals of Statistics 30, 74-99). The purpose of this article is to implement them practically into the model building process when analyzing high-dimensional data with the Cox proportional hazards model. To evaluate penalized regression models beyond the lasso, we included SCAD variants and the adaptive lasso (Zou (2006). Journal of the American Statistical Association 101, 1418-1429). We compare them with "standard" applications such as ridge regression, the lasso, and the elastic net. Predictive accuracy, features of variable selection, and estimation bias will be studied to assess the practical use of these methods. We observed that the performance of SCAD and adaptive lasso is highly dependent on nontrivial preselection procedures. A practical solution to this problem does not yet exist. Since there is high risk of missing relevant covariates when using SCAD or
Approximating high-dimensional dynamics by barycentric coordinates with linear programming.
Hirata, Yoshito; Shiro, Masanori; Takahashi, Nozomu; Aihara, Kazuyuki; Suzuki, Hideyuki; Mas, Paloma
2015-01-01
The increasing development of novel methods and techniques facilitates the measurement of high-dimensional time series but challenges our ability for accurate modeling and predictions. The use of a general mathematical model requires the inclusion of many parameters, which are difficult to be fitted for relatively short high-dimensional time series observed. Here, we propose a novel method to accurately model a high-dimensional time series. Our method extends the barycentric coordinates to high-dimensional phase space by employing linear programming, and allowing the approximation errors explicitly. The extension helps to produce free-running time-series predictions that preserve typical topological, dynamical, and/or geometric characteristics of the underlying attractors more accurately than the radial basis function model that is widely used. The method can be broadly applied, from helping to improve weather forecasting, to creating electronic instruments that sound more natural, and to comprehensively understanding complex biological data.
Approximating high-dimensional dynamics by barycentric coordinates with linear programming
Hirata, Yoshito Aihara, Kazuyuki; Suzuki, Hideyuki; Shiro, Masanori; Takahashi, Nozomu; Mas, Paloma
2015-01-15
The increasing development of novel methods and techniques facilitates the measurement of high-dimensional time series but challenges our ability for accurate modeling and predictions. The use of a general mathematical model requires the inclusion of many parameters, which are difficult to be fitted for relatively short high-dimensional time series observed. Here, we propose a novel method to accurately model a high-dimensional time series. Our method extends the barycentric coordinates to high-dimensional phase space by employing linear programming, and allowing the approximation errors explicitly. The extension helps to produce free-running time-series predictions that preserve typical topological, dynamical, and/or geometric characteristics of the underlying attractors more accurately than the radial basis function model that is widely used. The method can be broadly applied, from helping to improve weather forecasting, to creating electronic instruments that sound more natural, and to comprehensively understanding complex biological data.
Hagen, Nathan; Kester, Robert T.; Gao, Liang; Tkaczyk, Tomasz S.
2012-01-01
The snapshot advantage is a large increase in light collection efficiency available to high-dimensional measurement systems that avoid filtering and scanning. After discussing this advantage in the context of imaging spectrometry, where the greatest effort towards developing snapshot systems has been made, we describe the types of measurements where it is applicable. We then generalize it to the larger context of high-dimensional measurements, where the advantage increases geometrically with measurement dimensionality. PMID:22791926
EPA released a problem formulation for TBBPA and related chemicals used as a flame retardants in plastics/printed circuit boards for electronics. The goal of this problem formulation was to identify scenarios where further risk analysis may be necessary.
NASA Astrophysics Data System (ADS)
Piotto, G.; Zoccali, M.; King, I. R.; Djorgovski, S. G.; Sosin, C.; Rich, R. M.; Meylan, G.
1999-10-01
We present observations of the center of the Galactic globular cluster NGC 6273, obtained with the Hubble Space Telescope Wide Field Planetary Camera 2 as part of the snapshot program GO-7470. A BV color-magnitude diagram (CMD) for ~28,000 stars is presented and discussed. The most prominent feature of the CMD, identified for the first time in this paper, is the extended horizontal-branch blue tail (EBT) with a clear double-peaked distribution and a significant gap. The EBT of NGC 6273 is compared with the EBTs of seven other globular clusters for which we have a CMD in the same photometric system. From this comparison, we conclude that all the globular clusters in our sample with an EBT show at least one gap along the horizontal branch, which could have similar origins. A comparison with theoretical models suggests that at least some of these gaps may be occurring at a particular value of the stellar mass, common to a number of different clusters. From the CMD of NGC 6273 we obtain a distance modulus (m-M)_V=16.27+/-0.20. We also estimate an average reddening E(B-V)=0.47+/-0.03, though the CMD is strongly affected by differential reddening, with the relative reddening spanning a ΔE(B-V)~0.2 mag in the WFPC2 field. A luminosity function for the evolved stars in NGC 6273 is also presented and compared with the most recent evolutionary models.
Nadeau, Robert Michael
1995-10-01
This document contains information about the characterization and application of microearthquake clusters and fault zone dynamics. Topics discussed include: Seismological studies; fault-zone dynamics; periodic recurrence; scaling of microearthquakes to large earthquakes; implications of fault mechanics and seismic hazards; and wave propagation and temporal changes.
Tripathy, Rohit Bilionis, Ilias Gonzalez, Marcial
2016-09-15
Uncertainty quantification (UQ) tasks, such as model calibration, uncertainty propagation, and optimization under uncertainty, typically require several thousand evaluations of the underlying computer codes. To cope with the cost of simulations, one replaces the real response surface with a cheap surrogate based, e.g., on polynomial chaos expansions, neural networks, support vector machines, or Gaussian processes (GP). However, the number of simulations required to learn a generic multivariate response grows exponentially as the input dimension increases. This curse of dimensionality can only be addressed, if the response exhibits some special structure that can be discovered and exploited. A wide range of physical responses exhibit a special structure known as an active subspace (AS). An AS is a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. If the dimensionality of the AS is low enough, then learning the link function is a much easier problem than the original problem of learning a high-dimensional function. The classic approach to discovering the AS requires gradient information, a fact that severely limits its applicability. Furthermore, and partly because of its reliance to gradients, it is not able to handle noisy observations. The latter is an essential trait if one wants to be able to propagate uncertainty through stochastic simulators, e.g., through molecular dynamics codes. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction. In particular, the AS is represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the
NASA Astrophysics Data System (ADS)
Tripathy, Rohit; Bilionis, Ilias; Gonzalez, Marcial
2016-09-01
Uncertainty quantification (UQ) tasks, such as model calibration, uncertainty propagation, and optimization under uncertainty, typically require several thousand evaluations of the underlying computer codes. To cope with the cost of simulations, one replaces the real response surface with a cheap surrogate based, e.g., on polynomial chaos expansions, neural networks, support vector machines, or Gaussian processes (GP). However, the number of simulations required to learn a generic multivariate response grows exponentially as the input dimension increases. This curse of dimensionality can only be addressed, if the response exhibits some special structure that can be discovered and exploited. A wide range of physical responses exhibit a special structure known as an active subspace (AS). An AS is a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. If the dimensionality of the AS is low enough, then learning the link function is a much easier problem than the original problem of learning a high-dimensional function. The classic approach to discovering the AS requires gradient information, a fact that severely limits its applicability. Furthermore, and partly because of its reliance to gradients, it is not able to handle noisy observations. The latter is an essential trait if one wants to be able to propagate uncertainty through stochastic simulators, e.g., through molecular dynamics codes. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction. In particular, the AS is represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the
NASA Astrophysics Data System (ADS)
Wagstaff, Kiri L.
2012-03-01
particular application involves considerations of the kind of data being analyzed, algorithm runtime efficiency, and how much prior knowledge is available about the problem domain, which can dictate the nature of clusters sought. Fundamentally, the clustering method and its representations of clusters carries with it a definition of what a cluster is, and it is important that this be aligned with the analysis goals for the problem at hand. In this chapter, I emphasize this point by identifying for each algorithm the cluster representation as a model, m_j , even for algorithms that are not typically thought of as creating a “model.” This chapter surveys a basic collection of clustering methods useful to any practitioner who is interested in applying clustering to a new data set. The algorithms include k-means (Section 25.2), EM (Section 25.3), agglomerative (Section 25.4), and spectral (Section 25.5) clustering, with side mentions of variants such as kernel k-means and divisive clustering. The chapter also discusses each algorithm’s strengths and limitations and provides pointers to additional in-depth reading for each subject. Section 25.6 discusses methods for incorporating domain knowledge into the clustering process. This chapter concludes with a brief survey of interesting applications of clustering methods to astronomy data (Section 25.7). The chapter begins with k-means because it is both generally accessible and so widely used that understanding it can be considered a necessary prerequisite for further work in the field. EM can be viewed as a more sophisticated version of k-means that uses a generative model for each cluster and probabilistic item assignments. Agglomerative clustering is the most basic form of hierarchical clustering and provides a basis for further exploration of algorithms in that vein. Spectral clustering permits a departure from feature-vector-based clustering and can operate on data sets instead represented as affinity, or similarity
Awate, Suyash P; Yushkevich, Paul; Song, Zhuang; Licht, Daniel; Gee, James C
2009-01-01
The paper presents a novel statistical framework for cortical folding pattern analysis that relies on a rich multivariate descriptor of folding patterns in a region of interest (ROI). The ROI-based approach avoids problems faced by spatial-normalization-based approaches stemming from the severe deficiency of homologous features between typical human cerebral cortices. Unlike typical ROI-based methods that summarize folding complexity or shape by a single number, the proposed descriptor unifies complexity and shape of the surface in a high-dimensional space. In this way, the proposed framework couples the reliability of ROI-based analysis with the richness of the novel cortical folding pattern descriptor. Furthermore, the descriptor can easily incorporate additional variables, e.g. cortical thickness. The paper proposes a novel application of a nonparametric permutation-based approach for statistical hypothesis testing for any multivariate high-dimensional descriptor. While the proposed framework has a rigorous theoretical underpinning, it is straightforward to implement. The framework is validated via simulated and clinical data. The paper is the first to quantitatively evaluate cortical folding in neonates with complex congenital heart disease.
PenPC: A Two-step Approach to Estimate the Skeletons of High Dimensional Directed Acyclic Graphs
Ha, Min Jin; Sun, Wei; Xie, Jichun
2015-01-01
Summary Estimation of the skeleton of a directed acyclic graph (DAG) is of great importance for understanding the underlying DAG and causal e ects can be assessed from the skeleton when the DAG is not identifiable. We propose a novel method named PenPC to estimate the skeleton of a high-dimensional DAG by a two-step approach. We first estimate the non-zero entries of a concentration matrix using penalized regression, and then fix the difference between the concentration matrix and the skeleton by evaluating a set of conditional independence hypotheses. For high dimensional problems where the number of vertices p is in polynomial or exponential scale of sample size n, we study the asymptotic property of PenPC on two types of graphs: traditional random graphs where all the vertices have the same expected number of neighbors, and scale-free graphs where a few vertices may have a large number of neighbors. As illustrated by extensive simulations and applications on gene expression data of cancer patients, PenPC has higher sensitivity and specificity than the state-of-the-art method, the PC-stable algorithm. PMID:26406114
Chapman, Benjamin P.; Weiss, Alexander; Duberstein, Paul
2016-01-01
Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in “big data” problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how three common SLT algorithms–Supervised Principal Components, Regularization, and Boosting—can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach—or perhaps because of them–SLT methods may hold value as a statistically rigorous approach to exploratory regression. PMID:27454257
Efficient characterization of high-dimensional parameter spaces for systems biology
2011-01-01
Background A biological system's robustness to mutations and its evolution are influenced by the structure of its viable space, the region of its space of biochemical parameters where it can exert its function. In systems with a large number of biochemical parameters, viable regions with potentially complex geometries fill a tiny fraction of the whole parameter space. This hampers explorations of the viable space based on "brute force" or Gaussian sampling. Results We here propose a novel algorithm to characterize viable spaces efficiently. The algorithm combines global and local explorations of a parameter space. The global exploration involves an out-of-equilibrium adaptive Metropolis Monte Carlo method aimed at identifying poorly connected viable regions. The local exploration then samples these regions in detail by a method we call multiple ellipsoid-based sampling. Our algorithm explores efficiently nonconvex and poorly connected viable regions of different test-problems. Most importantly, its computational effort scales linearly with the number of dimensions, in contrast to "brute force" sampling that shows an exponential dependence on the number of dimensions. We also apply this algorithm to a simplified model of a biochemical oscillator with positive and negative feedback loops. A detailed characterization of the model's viable space captures well known structural properties of circadian oscillators. Concretely, we find that model topologies with an essential negative feedback loop and a nonessential positive feedback loop provide the most robust fixed period oscillations. Moreover, the connectedness of the model's viable space suggests that biochemical oscillators with varying topologies can evolve from one another. Conclusions Our algorithm permits an efficient analysis of high-dimensional, nonconvex, and poorly connected viable spaces characteristic of complex biological circuitry. It allows a systematic use of robustness as a tool for model
Chapman, Benjamin P; Weiss, Alexander; Duberstein, Paul R
2016-12-01
Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in "big data" problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how 3 common SLT algorithms-supervised principal components, regularization, and boosting-can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach-or perhaps because of them-SLT methods may hold value as a statistically rigorous approach to exploratory regression. (PsycINFO Database Record
Machine learning etudes in astrophysics: selection functions for mock cluster catalogs
Hajian, Amir; Alvarez, Marcelo A.; Bond, J. Richard E-mail: malvarez@cita.utoronto.ca
2015-01-01
Making mock simulated catalogs is an important component of astrophysical data analysis. Selection criteria for observed astronomical objects are often too complicated to be derived from first principles. However the existence of an observed group of objects is a well-suited problem for machine learning classification. In this paper we use one-class classifiers to learn the properties of an observed catalog of clusters of galaxies from ROSAT and to pick clusters from mock simulations that resemble the observed ROSAT catalog. We show how this method can be used to study the cross-correlations of thermal Sunya'ev-Zeldovich signals with number density maps of X-ray selected cluster catalogs. The method reduces the bias due to hand-tuning the selection function and is readily scalable to large catalogs with a high-dimensional space of astrophysical features.
A rough set based rational clustering framework for determining correlated genes.
Jeyaswamidoss, Jeba Emilyn; Thangaraj, Kesavan; Ramar, Kadarkarai; Chitra, Muthusamy
2016-06-01
Cluster analysis plays a foremost role in identifying groups of genes that show similar behavior under a set of experimental conditions. Several clustering algorithms have been proposed for identifying gene behaviors and to understand their significance. The principal aim of this work is to develop an intelligent rough clustering technique, which will efficiently remove the irrelevant dimensions in a high-dimensional space and obtain appropriate meaningful clusters. This paper proposes a novel biclustering technique that is based on rough set theory. The proposed algorithm uses correlation coefficient as a similarity measure to simultaneously cluster both the rows and columns of a gene expression data matrix and mean squared residue to generate the initial biclusters. Furthermore, the biclusters are refined to form the lower and upper boundaries by determining the membership of the genes in the clusters using mean squared residue. The algorithm is illustrated with yeast gene expression data and the experiment proves the effectiveness of the method. The main advantage is that it overcomes the problem of selection of initial clusters and also the restriction of one object belonging to only one cluster by allowing overlapping of biclusters.
Metamodel-based global optimization using fuzzy clustering for design space reduction
NASA Astrophysics Data System (ADS)
Li, Yulin; Liu, Li; Long, Teng; Dong, Weili
2013-09-01
High fidelity analysis are utilized in modern engineering design optimization problems which involve expensive black-box models. For computation-intensive engineering design problems, efficient global optimization methods must be developed to relieve the computational burden. A new metamodel-based global optimization method using fuzzy clustering for design space reduction (MGO-FCR) is presented. The uniformly distributed initial sample points are generated by Latin hypercube design to construct the radial basis function metamodel, whose accuracy is improved with increasing number of sample points gradually. Fuzzy c-mean method and Gath-Geva clustering method are applied to divide the design space into several small interesting cluster spaces for low and high dimensional problems respectively. Modeling efficiency and accuracy are directly related to the design space, so unconcerned spaces are eliminated by the proposed reduction principle and two pseudo reduction algorithms. The reduction principle is developed to determine whether the current design space should be reduced and which space is eliminated. The first pseudo reduction algorithm improves the speed of clustering, while the second pseudo reduction algorithm ensures the design space to be reduced. Through several numerical benchmark functions, comparative studies with adaptive response surface method, approximated unimodal region elimination method and mode-pursuing sampling are carried out. The optimization results reveal that this method captures the real global optimum for all the numerical benchmark functions. And the number of function evaluations show that the efficiency of this method is favorable especially for high dimensional problems. Based on this global design optimization method, a design optimization of a lifting surface in high speed flow is carried out and this method saves about 10 h compared with genetic algorithms. This method possesses favorable performance on efficiency, robustness
Sciberras, E; Mulraney, M; Heussler, H; Rinehart, N; Schuster, T; Gold, L; Hayes, N; Hiscock, H
2017-01-01
Introduction Up to 70% of children with attention-deficit/hyperactivity disorder (ADHD) experience sleep problems. We have demonstrated the efficacy of a brief behavioural intervention for children with ADHD in a large randomised controlled trial (RCT) and now aim to examine whether this intervention is effective in real-life clinical settings when delivered by paediatricians or psychologists. We will also assess the cost-effectiveness of the intervention. Methods and analysis Children aged 5–12 years with ADHD (n=320) are being recruited for this translational cluster RCT through paediatrician practices in Victoria and Queensland, Australia. Children are eligible if they meet criteria for ADHD, have a moderate/severe sleep problem and meet American Academy of Sleep Medicine criteria for either chronic insomnia disorder or delayed sleep–wake phase disorder; or are experiencing sleep-related anxiety. Clinicians are randomly allocated at the level of the paediatrician to either receive the sleep training or not. The behavioural intervention comprises 2 consultations covering sleep hygiene and standardised behavioural strategies. The primary outcome is change in the proportion of children with moderate/severe sleep problems from moderate/severe to no/mild by parent report at 3 months postintervention. Secondary outcomes include a range of child (eg, sleep severity, ADHD symptoms, quality of life, behaviour, working memory, executive functioning, learning, academic achievement) and primary caregiver (mental health, parenting, work attendance) measures. Analyses will address clustering at the level of the paediatrician using linear mixed effect models adjusting for potential a priori confounding variables. Ethics and dissemination Ethics approval has been granted. Findings will determine whether the benefits of an efficacy trial can be realised more broadly at the population level and will inform the development of clinical guidelines for managing sleep problems
Reduced basis ANOVA methods for partial differential equations with high-dimensional random inputs
NASA Astrophysics Data System (ADS)
Liao, Qifeng; Lin, Guang
2016-07-01
In this paper we present a reduced basis ANOVA approach for partial deferential equations (PDEs) with random inputs. The ANOVA method combined with stochastic collocation methods provides model reduction in high-dimensional parameter space through decomposing high-dimensional inputs into unions of low-dimensional inputs. In this work, to further reduce the computational cost, we investigate spatial low-rank structures in the ANOVA-collocation method, and develop efficient spatial model reduction techniques using hierarchically generated reduced bases. We present a general mathematical framework of the methodology, validate its accuracy and demonstrate its efficiency with numerical experiments.
Reduced basis ANOVA methods for partial differential equations with high-dimensional random inputs
Liao, Qifeng; Lin, Guang
2016-07-15
In this paper we present a reduced basis ANOVA approach for partial deferential equations (PDEs) with random inputs. The ANOVA method combined with stochastic collocation methods provides model reduction in high-dimensional parameter space through decomposing high-dimensional inputs into unions of low-dimensional inputs. In this work, to further reduce the computational cost, we investigate spatial low-rank structures in the ANOVA-collocation method, and develop efficient spatial model reduction techniques using hierarchically generated reduced bases. We present a general mathematical framework of the methodology, validate its accuracy and demonstrate its efficiency with numerical experiments.
NASA Astrophysics Data System (ADS)
Ikeda, K.
1982-08-01
The radius of convergence of the cluster series (expressing the equation of state) is discussed in connection with the distribution of zeros of the grand partition function on the complex z(=activity) plane, by giving various examples of circular distribution. Anomalous phase transitions and phase transitions of third order are considered by showing some examples of circular distribution of zeros. For the ideal Fermi-Dirac gas, the distribution function of zeros, lying on the part of the negative real axis from -λ-3 to -∞ [where λ=h(2 π mkT)-1/ 2], is calculated , and the function-theoretical structure of the equation of state is investigated. The distribution of zeros for this gas is compared with that for Tonks' gas (having purely repulsive interparticle forces). The two-dimensional and one-dimensional Fermi-Dirac gases are dealt with from the point of view of the distribution of zeros.
Compressively Characterizing High-Dimensional Entangled States with Complementary, Random Filtering
2016-06-30
22 December 2015; published 12 May 2016) The resources needed to conventionally characterize a quantum system are overwhelmingly large for high...dimensional systems. This obstacle may be overcome by abandoning traditional cornerstones of quantum measurement, such as general quantum states, strong...entanglement without a density matrix. Our method represents the sea change unfolding in quantum measurement, where methods influenced by the
High-Dimensional Exploratory Item Factor Analysis by a Metropolis-Hastings Robbins-Monro Algorithm
ERIC Educational Resources Information Center
Cai, Li
2010-01-01
A Metropolis-Hastings Robbins-Monro (MH-RM) algorithm for high-dimensional maximum marginal likelihood exploratory item factor analysis is proposed. The sequence of estimates from the MH-RM algorithm converges with probability one to the maximum likelihood solution. Details on the computer implementation of this algorithm are provided. The…
A practical scheme of the sigma-point Kalman filter for high-dimensional systems
NASA Astrophysics Data System (ADS)
Tang, Youmin; Deng, Ziwang; Manoj, K. K.; Chen, Dake
2014-03-01
applying a sigma-point Kalman filter (SPKF) to a high-dimensional system such as the oceanic general circulation model (OGCM), a major challenge is to reduce its heavy burden of storage memory and costly computation. In this study, we propose a new scheme for SPKF to address these issues. First, a reduced rank SPKF was introduced on the high-dimensional model state space using the truncated single value decomposition (TSVD) method (T-SPKF). Second, the relationship of SVDs between the model state space and a low-dimensional ensemble space is used to construct sigma points on the ensemble space (ET-SPKF). As such, this new scheme greatly reduces the demand of memory storage and computational cost and makes the SPKF method applicable to high-dimensional systems. Two numerical models are used to test and validate the ET-SPKF algorithm. The first model is the 40-variable Lorenz model, which has been a test bed of new assimilation algorithms. The second model is a realistic OGCM for the assimilation of actual observations, including Argo and in situ observations over the Pacific Ocean. The experiments show that ET-SPKF is computationally feasible for high-dimensional systems and capable of precise analyses. In particular, for realistic oceanic assimilations, the ET-SPKF algorithm can significantly improve oceanic analysis and improve ENSO prediction. A comparison between the ET-SPKF algorithm and EnKF (ensemble Kalman filter) is also tribally conducted using the OGCM and actual observations.
Kandrup, H.E. ); Morrison, P.J. . Inst. for Fusion Studies)
1992-11-01
The Hamiltonian formulation of the Vlasov-Einstein system, which is appropriate for collisionless, self-gravitating systems like clusters of stars that are so dense that gravity must be described by the Einstein equation, is presented. In particular, it is demonstrated explicitly in the context of a 3 + 1 splitting that, for spherically symmetric configurations, the Vlasov-Einstein system can be viewed as a Hamiltonian system, where the dynamics is generated by a noncanonical Poisson bracket, with the Hamiltonian generating the evolution of the distribution function f (a noncanonical variable) being the conserved ADM mass-energy H[sub ADM]. An explicit expression is derived for the energy [delta]([sup 2])H[sub ADM] associated with an arbitrary phase space preserving perturbation of an arbitrary spherical equilibrium, and it is shown that the equilibrium must be linearly stable if [delta]([sup 2])H[sub ADM] is positive semi-definite. Insight into the Hamiltonian reformulation is provided by a description of general finite degree of freedom systems.
Kandrup, H.E.; Morrison, P.J.
1992-11-01
The Hamiltonian formulation of the Vlasov-Einstein system, which is appropriate for collisionless, self-gravitating systems like clusters of stars that are so dense that gravity must be described by the Einstein equation, is presented. In particular, it is demonstrated explicitly in the context of a 3 + 1 splitting that, for spherically symmetric configurations, the Vlasov-Einstein system can be viewed as a Hamiltonian system, where the dynamics is generated by a noncanonical Poisson bracket, with the Hamiltonian generating the evolution of the distribution function f (a noncanonical variable) being the conserved ADM mass-energy H{sub ADM}. An explicit expression is derived for the energy {delta}({sup 2})H{sub ADM} associated with an arbitrary phase space preserving perturbation of an arbitrary spherical equilibrium, and it is shown that the equilibrium must be linearly stable if {delta}({sup 2})H{sub ADM} is positive semi-definite. Insight into the Hamiltonian reformulation is provided by a description of general finite degree of freedom systems.
van der Leij, Aryan; Blok, Henk; de Jong, Peter F.
2010-01-01
This study investigated the role of speech perception accuracy and speed in fluent word decoding of reading disabled (RD) children. A same-different phoneme discrimination task with natural speech tested the perception of single consonants and consonant clusters by young but persistent RD children. RD children were slower than chronological age (CA) controls in recognizing identical sounds, suggesting less distinct phonemic categories. In addition, after controlling for phonetic similarity Tallal’s (Brain Lang 9:182–198, 1980) fast transitions account of RD children’s speech perception problems was contrasted with Studdert-Kennedy’s (Read Writ Interdiscip J 15:5–14, 2002) similarity explanation. Results showed no specific RD deficit in perceiving fast transitions. Both phonetic similarity and fast transitions influenced accurate speech perception for RD children as well as CA controls. PMID:20652455
NASA Astrophysics Data System (ADS)
Denis, Pablo A.
2014-04-01
By means of coupled cluster theory and correlation consistent basis sets we investigated the thermochemistry of dimethyl sulphide (DMS), dimethyl disulphide (DMDS) and four closely related sulphur-containing molecules: CH3SS, CH3S, CH3SH and CH3CH2SH. For the four closed-shell molecules studied, their enthalpies of formation (EOFs) were derived using bomb calorimetry. We found that the deviation of the EOF with respect to experiment was 0.96, 0.65, 1.24 and 1.29 kcal/mol, for CH3SH, CH3CH2SH, DMS and DMDS, respectively, when ΔHf,0 = 65.6 kcal/mol was utilised (JANAF value). However, if the recently proposed ΔHf,0 = 66.2 kcal/mol was used to estimate EOF, the errors dropped to 0.36, 0.05, 0.64 and 0.09 kcal/mol, respectively. In contrast, for the CH3SS radical, a better agreement with experiment was obtained if the 65.6 kcal/mol value was used. To compare with experiment avoiding the problem of the ΔHf,0 (S), we determined the CH3-S and CH3-SS bond dissociation energies (BDEs) in CH3S and CH3SS. At the coupled cluster with singles doubles and perturbative triples correction level of theory, these values are 48.0 and 71.4 kcal/mol, respectively. The latter BDEs are 1.5 and 1.2 kcal/mol larger than the experimental values. The agreement can be considered to be acceptable if we take into consideration that these two radicals present important challenges when determining their EOFs. It is our hope that this work stimulates new studies which help elucidate the problem of the EOF of atomic sulphur.
Efficient uncertainty quantification methodologies for high-dimensional climate land models
Sargsyan, Khachik; Safta, Cosmin; Berry, Robert Dan; Ray, Jaideep; Debusschere, Bert J.; Najm, Habib N.
2011-11-01
In this report, we proposed, examined and implemented approaches for performing efficient uncertainty quantification (UQ) in climate land models. Specifically, we applied Bayesian compressive sensing framework to a polynomial chaos spectral expansions, enhanced it with an iterative algorithm of basis reduction, and investigated the results on test models as well as on the community land model (CLM). Furthermore, we discussed construction of efficient quadrature rules for forward propagation of uncertainties from high-dimensional, constrained input space to output quantities of interest. The work lays grounds for efficient forward UQ for high-dimensional, strongly non-linear and computationally costly climate models. Moreover, to investigate parameter inference approaches, we have applied two variants of the Markov chain Monte Carlo (MCMC) method to a soil moisture dynamics submodel of the CLM. The evaluation of these algorithms gave us a good foundation for further building out the Bayesian calibration framework towards the goal of robust component-wise calibration.
A Shell Multi-dimensional Hierarchical Cubing Approach for High-Dimensional Cube
NASA Astrophysics Data System (ADS)
Zou, Shuzhi; Zhao, Li; Hu, Kongfa
The pre-computation of data cubes is critical for improving the response time of OLAP systems and accelerating data mining tasks in large data warehouses. However, as the sizes of data warehouses grow, the time it takes to perform this pre-computation becomes a significant performance bottleneck. In a high dimensional data warehouse, it might not be practical to build all these cuboids and their indices. In this paper, we propose a shell multi-dimensional hierarchical cubing algorithm, based on an extension of the previous minimal cubing approach. This method partitions the high dimensional data cube into low multi-dimensional hierarchical cube. Experimental results show that the proposed method is significantly more efficient than other existing cubing methods.
Su, Yapeng; Shi, Qihui; Wei, Wei
2017-02-01
New insights on cellular heterogeneity in the last decade provoke the development of a variety of single cell omics tools at a lightning pace. The resultant high-dimensional single cell data generated by these tools require new theoretical approaches and analytical algorithms for effective visualization and interpretation. In this review, we briefly survey the state-of-the-art single cell proteomic tools with a particular focus on data acquisition and quantification, followed by an elaboration of a number of statistical and computational approaches developed to date for dissecting the high-dimensional single cell data. The underlying assumptions, unique features, and limitations of the analytical methods with the designated biological questions they seek to answer will be discussed. Particular attention will be given to those information theoretical approaches that are anchored in a set of first principles of physics and can yield detailed (and often surprising) predictions.
Luan, Xiaoli; Chen, Qiang; Liu, Fei
2014-09-01
This article presents a new scheme to design full matrix controller for high dimensional multivariable processes based on equivalent transfer function (ETF). Differing from existing ETF method, the proposed ETF is derived directly by exploiting the relationship between the equivalent closed-loop transfer function and the inverse of open-loop transfer function. Based on the obtained ETF, the full matrix controller is designed utilizing the existing PI tuning rules. The new proposed ETF model can more accurately represent the original processes. Furthermore, the full matrix centralized controller design method proposed in this paper is applicable to high dimensional multivariable systems with satisfactory performance. Comparison with other multivariable controllers shows that the designed ETF based controller is superior with respect to design-complexity and obtained performance.
On landmark selection and sampling in high-dimensional data analysis
Belabbas, Mohamed-Ali; Wolfe, Patrick J.
2009-01-01
In recent years, the spectral analysis of appropriately defined kernel matrices has emerged as a principled way to extract the low-dimensional structure often prevalent in high-dimensional data. Here, we provide an introduction to spectral methods for linear and nonlinear dimension reduction, emphasizing ways to overcome the computational limitations currently faced by practitioners with massive datasets. In particular, a data subsampling or landmark selection process is often employed to construct a kernel based on partial information, followed by an approximate spectral analysis termed the Nyström extension. We provide a quantitative framework to analyse this procedure, and use it to demonstrate algorithmic performance bounds on a range of practical approaches designed to optimize the landmark selection process. We compare the practical implications of these bounds by way of real-world examples drawn from the field of computer vision, whereby low-dimensional manifold structure is shown to emerge from high-dimensional video data streams. PMID:19805446
Compressively Characterizing High-Dimensional Entangled States with Complementary, Random Filtering
NASA Astrophysics Data System (ADS)
Howland, Gregory A.; Knarr, Samuel H.; Schneeloch, James; Lum, Daniel J.; Howell, John C.
2016-04-01
The resources needed to conventionally characterize a quantum system are overwhelmingly large for high-dimensional systems. This obstacle may be overcome by abandoning traditional cornerstones of quantum measurement, such as general quantum states, strong projective measurement, and assumption-free characterization. Following this reasoning, we demonstrate an efficient technique for characterizing high-dimensional, spatial entanglement with one set of measurements. We recover sharp distributions with local, random filtering of the same ensemble in momentum followed by position—something the uncertainty principle forbids for projective measurements. Exploiting the expectation that entangled signals are highly correlated, we use fewer than 5000 measurements to characterize a 65,536-dimensional state. Finally, we use entropic inequalities to witness entanglement without a density matrix. Our method represents the sea change unfolding in quantum measurement, where methods influenced by the information theory and signal-processing communities replace unscalable, brute-force techniques—a progression previously followed by classical sensing.
Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data.
Xiong, Lie; Kuan, Pei-Fen; Tian, Jianan; Keles, Sunduz; Wang, Sijian
2015-01-01
In this paper, we propose a novel multivariate component-wise boosting method for fitting multivariate response regression models under the high-dimension, low sample size setting. Our method is motivated by modeling the association among different biological molecules based on multiple types of high-dimensional genomic data. Particularly, we are interested in two applications: studying the influence of DNA copy number alterations on RNA transcript levels and investigating the association between DNA methylation and gene expression. For this purpose, we model the dependence of the RNA expression levels on DNA copy number alterations and the dependence of gene expression on DNA methylation through multivariate regression models and utilize boosting-type method to handle the high dimensionality as well as model the possible nonlinear associations. The performance of the proposed method is demonstrated through simulation studies. Finally, our multivariate boosting method is applied to two breast cancer studies.
Plurigon: three dimensional visualization and classification of high-dimensionality data.
Martin, Bronwen; Chen, Hongyu; Daimon, Caitlin M; Chadwick, Wayne; Siddiqui, Sana; Maudsley, Stuart
2013-01-01
High-dimensionality data is rapidly becoming the norm for biomedical sciences and many other analytical disciplines. Not only is the collection and processing time for such data becoming problematic, but it has become increasingly difficult to form a comprehensive appreciation of high-dimensionality data. Though data analysis methods for coping with multivariate data are well-documented in technical fields such as computer science, little effort is currently being expended to condense data vectors that exist beyond the realm of physical space into an easily interpretable and aesthetic form. To address this important need, we have developed Plurigon, a data visualization and classification tool for the integration of high-dimensionality visualization algorithms with a user-friendly, interactive graphical interface. Unlike existing data visualization methods, which are focused on an ensemble of data points, Plurigon places a strong emphasis upon the visualization of a single data point and its determining characteristics. Multivariate data vectors are represented in the form of a deformed sphere with a distinct topology of hills, valleys, plateaus, peaks, and crevices. The gestalt structure of the resultant Plurigon object generates an easily-appreciable model. User interaction with the Plurigon is extensive; zoom, rotation, axial and vector display, feature extraction, and anaglyph stereoscopy are currently supported. With Plurigon and its ability to analyze high-complexity data, we hope to see a unification of biomedical and computational sciences as well as practical applications in a wide array of scientific disciplines. Increased accessibility to the analysis of high-dimensionality data may increase the number of new discoveries and breakthroughs, ranging from drug screening to disease diagnosis to medical literature mining.
Hirata, Yoshito; Aihara, Kazuyuki
2012-06-01
We introduce a low-dimensional description for a high-dimensional system, which is a piecewise affine model whose state space is divided by permutations. We show that the proposed model tends to predict wind speeds and photovoltaic outputs for the time scales from seconds to 100 s better than by global affine models. In addition, computations using the piecewise affine model are much faster than those of usual nonlinear models such as radial basis function models.
Controlling chaos in low and high dimensional systems with periodic parametric perturbations
Mirus, K.A.; Sprott, J.C.
1998-06-01
The effect of applying a periodic perturbation to an accessible parameter of various chaotic systems is examined. Numerical results indicate that perturbation frequencies near the natural frequencies of the unstable periodic orbits of the chaotic systems can result in limit cycles for relatively small perturbations. Such perturbations can also control or significantly reduce the dimension of high-dimensional systems. Initial application to the control of fluctuations in a prototypical magnetic fusion plasma device will be reviewed.
Prediction of high-dimensional states subject to respiratory motion: a manifold learning approach.
Liu, Wenyang; Sawant, Amit; Ruan, Dan
2016-07-07
The development of high-dimensional imaging systems in image-guided radiotherapy provides important pathways to the ultimate goal of real-time full volumetric motion monitoring. Effective motion management during radiation treatment usually requires prediction to account for system latency and extra signal/image processing time. It is challenging to predict high-dimensional respiratory motion due to the complexity of the motion pattern combined with the curse of dimensionality. Linear dimension reduction methods such as PCA have been used to construct a linear subspace from the high-dimensional data, followed by efficient predictions on the lower-dimensional subspace. In this study, we extend such rationale to a more general manifold and propose a framework for high-dimensional motion prediction with manifold learning, which allows one to learn more descriptive features compared to linear methods with comparable dimensions. Specifically, a kernel PCA is used to construct a proper low-dimensional feature manifold, where accurate and efficient prediction can be performed. A fixed-point iterative pre-image estimation method is used to recover the predicted value in the original state space. We evaluated and compared the proposed method with a PCA-based approach on level-set surfaces reconstructed from point clouds captured by a 3D photogrammetry system. The prediction accuracy was evaluated in terms of root-mean-squared-error. Our proposed method achieved consistent higher prediction accuracy (sub-millimeter) for both 200 ms and 600 ms lookahead lengths compared to the PCA-based approach, and the performance gain was statistically significant.
Prediction of high-dimensional states subject to respiratory motion: a manifold learning approach
NASA Astrophysics Data System (ADS)
Liu, Wenyang; Sawant, Amit; Ruan, Dan
2016-07-01
The development of high-dimensional imaging systems in image-guided radiotherapy provides important pathways to the ultimate goal of real-time full volumetric motion monitoring. Effective motion management during radiation treatment usually requires prediction to account for system latency and extra signal/image processing time. It is challenging to predict high-dimensional respiratory motion due to the complexity of the motion pattern combined with the curse of dimensionality. Linear dimension reduction methods such as PCA have been used to construct a linear subspace from the high-dimensional data, followed by efficient predictions on the lower-dimensional subspace. In this study, we extend such rationale to a more general manifold and propose a framework for high-dimensional motion prediction with manifold learning, which allows one to learn more descriptive features compared to linear methods with comparable dimensions. Specifically, a kernel PCA is used to construct a proper low-dimensional feature manifold, where accurate and efficient prediction can be performed. A fixed-point iterative pre-image estimation method is used to recover the predicted value in the original state space. We evaluated and compared the proposed method with a PCA-based approach on level-set surfaces reconstructed from point clouds captured by a 3D photogrammetry system. The prediction accuracy was evaluated in terms of root-mean-squared-error. Our proposed method achieved consistent higher prediction accuracy (sub-millimeter) for both 200 ms and 600 ms lookahead lengths compared to the PCA-based approach, and the performance gain was statistically significant.
Humoral fingerprinting of immune responses: “super-resolution”, high-dimensional serology
Lau, William W.; Tsang, John S.
2016-01-01
In a recent study, Chung et al. report the development of a high-dimensional approach to assess humoral responses to immune perturbation that goes beyond antibody neutralization and titers. This approach enables the identification of potentially novel correlates and mechanisms of protective immunity to HIV vaccination, thus offering a glimpse of how dense phenotyping of serological responses coupled with bioinformatics analysis could lead to much-sought-after markers of protective vaccination responses. PMID:26830541
Implementation of High Dimensional Feature Map for Segmentation of MR Images
He, Renjie; Sajja, Balasrinivasa Rao; Narayana, Ponnada A.
2005-01-01
A method that considerably reduces the computational and memory complexities associated with the generation of high dimensional (≥3) feature maps for image segmentation is described. The method is based on the K-nearest neighbor (KNN) classification and consists of two parts: preprocessing of feature space and fast KNN. This technique is implemented on a PC and applied for generating three-and four-dimensional feature maps for segmenting MR brain images of multiple sclerosis patients. PMID:16240091
NASA Astrophysics Data System (ADS)
Mandrà, Salvatore; Zhu, Zheng; Wang, Wenlong; Perdomo-Ortiz, Alejandro; Katzgraber, Helmut G.
2016-08-01
To date, a conclusive detection of quantum speedup remains elusive. Recently, a team by Google Inc. [V. S. Denchev et al., Phys. Rev. X 6, 031015 (2016), 10.1103/PhysRevX.6.031015] proposed a weak-strong cluster model tailored to have tall and narrow energy barriers separating local minima, with the aim to highlight the value of finite-range tunneling. More precisely, results from quantum Monte Carlo simulations as well as the D-Wave 2X quantum annealer scale considerably better than state-of-the-art simulated annealing simulations. Moreover, the D-Wave 2X quantum annealer is ˜108 times faster than simulated annealing on conventional computer hardware for problems with approximately 103 variables. Here, an overview of different sequential, nontailored, as well as specialized tailored algorithms on the Google instances is given. We show that the quantum speedup is limited to sequential approaches and study the typical complexity of the benchmark problems using insights from the study of spin glasses.
High-Dimensional Function Approximation With Neural Networks for Large Volumes of Data.
Andras, Peter
2017-01-25
Approximation of high-dimensional functions is a challenge for neural networks due to the curse of dimensionality. Often the data for which the approximated function is defined resides on a low-dimensional manifold and in principle the approximation of the function over this manifold should improve the approximation performance. It has been show that projecting the data manifold into a lower dimensional space, followed by the neural network approximation of the function over this space, provides a more precise approximation of the function than the approximation of the function with neural networks in the original data space. However, if the data volume is very large, the projection into the low-dimensional space has to be based on a limited sample of the data. Here, we investigate the nature of the approximation error of neural networks trained over the projection space. We show that such neural networks should have better approximation performance than neural networks trained on high-dimensional data even if the projection is based on a relatively sparse sample of the data manifold. We also find that it is preferable to use a uniformly distributed sparse sample of the data for the purpose of the generation of the low-dimensional projection. We illustrate these results considering the practical neural network approximation of a set of functions defined on high-dimensional data including real world data as well.
Lessons learned in the analysis of high-dimensional data in vaccinomics
Oberg, Ann L.; McKinney, Brett A.; Schaid, Daniel J.; Pankratz, V. Shane; Kennedy, Richard B.; Poland, Gregory A.
2015-01-01
The field of vaccinology is increasingly moving toward the generation, analysis, and modeling of extremely large and complex high-dimensional datasets. We have used data such as these in the development and advancement of the field of vaccinomics to enable prediction of vaccine responses and to develop new vaccine candidates. However, the application of systems biology to what has been termed “big data,” or “high-dimensional data,” is not without significant challenges—chief among them a paucity of gold standard analysis and modeling paradigms with which to interpret the data. In this article, we relate some of the lessons we have learned over the last decade of working with high-dimensional, high-throughput data as applied to the field of vaccinomics. The value of such efforts, however, is ultimately to better understand the immune mechanisms by which protective and non-protective responses to vaccines are generated, and to use this information to support a personalized vaccinology approach in creating better, and safer, vaccines for the public health. PMID:25957070
Towards reliable multi-pathogen biosensors using high-dimensional encoding and decoding techniques
NASA Astrophysics Data System (ADS)
Chakrabartty, Shantanu; Liu, Yang
2008-08-01
Advances in micro-nano-biosensor fabrication are enabling technology that can integrate a large number of biological recognition elements within a single package. As a result, hundreds to millions of tests can be performed simultaneously and can facilitate rapid detection of multiple pathogens in a given sample. However, it is an open question as to how to exploit the high-dimensional nature of the multi-pathogen testing for improving the detection reliability a typical biosensor system. In this paper, we discuss two complementary high-dimensional encoding/decoding methods for improving the reliability of multi-pathogen detection. The first method uses a support vector machine (SVM) to learn the non-linear detection boundaries in the high-dimensional measurement space. The second method uses a forward error correcting (FEC) technique to synthetically introduce redundant patterns on the biosensor which can then be efficiently decoded. In this paper, experimental and simulation studies are based on a model conductimetric lateral flow immunoassay that uses antigen-antibody interaction in conjunction with a polyaniline transducer to detect presence or absence of pathogen in a given sample. Our results show that both SVM and FEC techniques can improve the detection performance by exploiting cross-reaction amongst multiple recognition sites on the biosensor. This is contrary to many existing methods used in pathogen detection technology where the main emphasis has been reducing the effects of cross-reaction and coupling instead of exploiting them as side information.
Algamal, Zakariya Yahya; Lee, Muhammad Hisyam
2015-12-01
Cancer classification and gene selection in high-dimensional data have been popular research topics in genetics and molecular biology. Recently, adaptive regularized logistic regression using the elastic net regularization, which is called the adaptive elastic net, has been successfully applied in high-dimensional cancer classification to tackle both estimating the gene coefficients and performing gene selection simultaneously. The adaptive elastic net originally used elastic net estimates as the initial weight, however, using this weight may not be preferable for certain reasons: First, the elastic net estimator is biased in selecting genes. Second, it does not perform well when the pairwise correlations between variables are not high. Adjusted adaptive regularized logistic regression (AAElastic) is proposed to address these issues and encourage grouping effects simultaneously. The real data results indicate that AAElastic is significantly consistent in selecting genes compared to the other three competitor regularization methods. Additionally, the classification performance of AAElastic is comparable to the adaptive elastic net and better than other regularization methods. Thus, we can conclude that AAElastic is a reliable adaptive regularized logistic regression method in the field of high-dimensional cancer classification.
Runcie, Daniel E.; Mukherjee, Sayan
2013-01-01
Quantitative genetic studies that model complex, multivariate phenotypes are important for both evolutionary prediction and artificial selection. For example, changes in gene expression can provide insight into developmental and physiological mechanisms that link genotype and phenotype. However, classical analytical techniques are poorly suited to quantitative genetic studies of gene expression where the number of traits assayed per individual can reach many thousand. Here, we derive a Bayesian genetic sparse factor model for estimating the genetic covariance matrix (G-matrix) of high-dimensional traits, such as gene expression, in a mixed-effects model. The key idea of our model is that we need consider only G-matrices that are biologically plausible. An organism’s entire phenotype is the result of processes that are modular and have limited complexity. This implies that the G-matrix will be highly structured. In particular, we assume that a limited number of intermediate traits (or factors, e.g., variations in development or physiology) control the variation in the high-dimensional phenotype, and that each of these intermediate traits is sparse – affecting only a few observed traits. The advantages of this approach are twofold. First, sparse factors are interpretable and provide biological insight into mechanisms underlying the genetic architecture. Second, enforcing sparsity helps prevent sampling errors from swamping out the true signal in high-dimensional data. We demonstrate the advantages of our model on simulated data and in an analysis of a published Drosophila melanogaster gene expression data set. PMID:23636737
NASA Astrophysics Data System (ADS)
Regis, Rommel G.; Shoemaker, Christine A.
2013-05-01
This article presents the DYCORS (DYnamic COordinate search using Response Surface models) framework for surrogate-based optimization of HEB (High-dimensional, Expensive, and Black-box) functions that incorporates an idea from the DDS (Dynamically Dimensioned Search) algorithm. The iterate is selected from random trial solutions obtained by perturbing only a subset of the coordinates of the current best solution. Moreover, the probability of perturbing a coordinate decreases as the algorithm reaches the computational budget. Two DYCORS algorithms that use RBF (Radial Basis Function) surrogates are developed: DYCORS-LMSRBF is a modification of the LMSRBF algorithm while DYCORS-DDSRBF is an RBF-assisted DDS. Numerical results on a 14-D watershed calibration problem and on eleven 30-D and 200-D test problems show that DYCORS algorithms are generally better than EGO, DDS, LMSRBF, MADS with kriging, SQP, an RBF-assisted evolution strategy, and a genetic algorithm. Hence, DYCORS is a promising approach for watershed calibration and for HEB optimization.
NASA Astrophysics Data System (ADS)
Gavrishchaka, Valeriy; Ganguli, Supriya
2001-10-01
Predictive capabilities of the data-driven models of the systems with complex multi-scale dynamics depend on the quality and amount of the available data and on the algorithms used to extract generalized mappings. Availability of the real-time high-resolution data constantly increases in many fields of practical interest. However, the majority of advanced nonlinear algorithms, including neural networks (NN), can encounter a set of problems called "dimensionality curse" when applied to high-dimensional data. Nonstationarity of the system can also impose significant limitations on the size of training set which leads to poor generalization ability of the model. A very promising algorithm that combines the power of the best nonlinear techniques and tolerance to high-dimensional and incomplete data is support vector machine (SVM). We have summarized and demonstrated advantages of the SVM by applying it to two important and challenging problems: substorm forecasting from solar wind data and volatility forecasting from multi-scale stock and exchange market data. We have shown that performance of the SVM model for substorm prediction can be comparable to or be superior to that of the best existing models including NNs. The advantages of the SVM-based techniques are expected to be much more pronounced in future space-weather forecasting models, which will incorporate many types of high-dimensional, multi-scale input data once real-time availability of this information becomes technologically feasible. We have also demonstrated encouraging performance of the SVM in application to volatility prediction using S&P 500 stock index and USD-DM exchange rate data. Future applications of the SVM in the emerging field of high-frequency finance and its relation to existing models are also discussed.
ERIC Educational Resources Information Center
Dishion, Thomas J.; Ha, Thao; Veronneau, Marie-Helene
2012-01-01
The authors propose that peer relationships should be included in a life history perspective on adolescent problem behavior. Longitudinal analyses were used to examine deviant peer clustering as the mediating link between attenuated family ties, peer marginalization, and social disadvantage in early adolescence and sexual promiscuity in middle…
Matlab Cluster Ensemble Toolbox
Sapio, Vincent De; Kegelmeyer, Philip
2009-04-27
This is a Matlab toolbox for investigating the application of cluster ensembles to data classification, with the objective of improving the accuracy and/or speed of clustering. The toolbox divides the cluster ensemble problem into four areas, providing functionality for each. These include, (1) synthetic data generation, (2) clustering to generate individual data partitions and similarity matrices, (3) consensus function generation and final clustering to generate ensemble data partitioning, and (4) implementation of accuracy metrics. With regard to data generation, Gaussian data of arbitrary dimension can be generated. The kcenters algorithm can then be used to generate individual data partitions by either, (a) subsampling the data and clustering each subsample, or by (b) randomly initializing the algorithm and generating a clustering for each initialization. In either case an overall similarity matrix can be computed using a consensus function operating on the individual similarity matrices. A final clustering can be performed and performance metrics are provided for evaluation purposes.
Optimal cellular preservation for high dimensional flow cytometric analysis of multicentre trials.
Ng, Amanda A P; Lee, Bernett T K; Teo, Timothy S Y; Poidinger, Michael; Connolly, John E
2012-11-30
High dimensional flow cytometry is best served by centralized facilities. However, the difficulties around sample processing, storage and shipment make large scale international studies impractical. We therefore sought to identify optimized fixation procedures which fully leverage the analytical capability of high dimensional flow cytometry without the need for complex cell processing or a sustained cold chain. Whole blood staining procedure was employed to investigate the applicability of fixatives including Cyto-Chex® Blood Collection tube (Streck), Transfix® (Cytomark), 1% and 4% paraformaldehyde to centralized analysis of field trial samples. Samples were subjected to environmental conditions which mimic field studies, without refrigerated shipment and analyzed across 10 days, based on cell count and marker expression. This study showed that Cyto-Chex® demonstrated the least variability in absolute cell count relative to samples analyzed directly from donors in the absence of fixation. Transfix® was better at preserving the marker expression among all fixatives. However, Transfix® caused marked increased cell membrane permeabilization and was detrimental to intracellular marker identification. Paraformaldehyde fixation, at either 1% or 4% concentrations, was unfavorable for cell preservation under the conditions tested and thus not recommended. Using these data, we have created an online interactive tool which enables researchers to evaluate the impact of different fixatives on their panel of interest. In this study, we have identified Cyto-Chex® as the optimal cellular preservative for high dimensional flow cytometry in large scale studies for shipped whole blood samples, even in the absence of a sustained cold chain.
Fast time-series prediction using high-dimensional data: evaluating confidence interval credibility.
Hirata, Yoshito
2014-05-01
I propose an index for evaluating the credibility of confidence intervals for future observables predicted from high-dimensional time-series data. The index evaluates the distance from the current state to the data manifold. I demonstrate the index with artificial datasets generated from the Lorenz'96 II model [Lorenz, in Proceedings of the Seminar on Predictability, Vol. 1 (ECMWF, Reading, UK, 1996), p. 1], the Lorenz'96 I model [Hansen and Smith, 2859:TROOCI>2.0.CO;2">J. Atmos. Sci. 57, 2859 (2000).
Scale-Invariant Sparse PCA on High Dimensional Meta-elliptical Data.
Han, Fang; Liu, Han
2014-01-01
We propose a semiparametric method for conducting scale-invariant sparse principal component analysis (PCA) on high dimensional non-Gaussian data. Compared with sparse PCA, our method has weaker modeling assumption and is more robust to possible data contamination. Theoretically, the proposed method achieves a parametric rate of convergence in estimating the parameter of interests under a flexible semiparametric distribution family; Computationally, the proposed method exploits a rank-based procedure and is as efficient as sparse PCA; Empirically, our method outperforms most competing methods on both synthetic and real-world datasets.
NASA Astrophysics Data System (ADS)
Chen, Peng; Quarteroni, Alfio
2015-10-01
In this work we develop an adaptive and reduced computational algorithm based on dimension-adaptive sparse grid approximation and reduced basis methods for solving high-dimensional uncertainty quantification (UQ) problems. In order to tackle the computational challenge of "curse of dimensionality" commonly faced by these problems, we employ a dimension-adaptive tensor-product algorithm [16] and propose a verified version to enable effective removal of the stagnation phenomenon besides automatically detecting the importance and interaction of different dimensions. To reduce the heavy computational cost of UQ problems modelled by partial differential equations (PDE), we adopt a weighted reduced basis method [7] and develop an adaptive greedy algorithm in combination with the previous verified algorithm for efficient construction of an accurate reduced basis approximation. The efficiency and accuracy of the proposed algorithm are demonstrated by several numerical experiments.
Cluster headache Overview By Mayo Clinic Staff Cluster headaches, which occur in cyclical patterns or clusters, are one of the most painful types of headache. A cluster headache commonly awakens you ...
Validi, AbdoulAhad
2014-03-01
This study introduces a non-intrusive approach in the context of low-rank separated representation to construct a surrogate of high-dimensional stochastic functions, e.g., PDEs/ODEs, in order to decrease the computational cost of Markov Chain Monte Carlo simulations in Bayesian inference. The surrogate model is constructed via a regularized alternative least-square regression with Tikhonov regularization using a roughening matrix computing the gradient of the solution, in conjunction with a perturbation-based error indicator to detect optimal model complexities. The model approximates a vector of a continuous solution at discrete values of a physical variable. The required number of random realizations to achieve a successful approximation linearly depends on the function dimensionality. The computational cost of the model construction is quadratic in the number of random inputs, which potentially tackles the curse of dimensionality in high-dimensional stochastic functions. Furthermore, this vector-valued separated representation-based model, in comparison to the available scalar-valued case, leads to a significant reduction in the cost of approximation by an order of magnitude equal to the vector size. The performance of the method is studied through its application to three numerical examples including a 41-dimensional elliptic PDE and a 21-dimensional cavity flow.
A Robust Supervised Variable Selection for Noisy High-Dimensional Data
Kalina, Jan; Schlenker, Anna
2015-01-01
The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers. PMID:26137474
Discrimination and synthesis of recursive quantum states in high-dimensional Hilbert spaces
NASA Astrophysics Data System (ADS)
Simon, David S.; Fitzpatrick, Casey A.; Sergienko, Alexander V.
2015-04-01
We propose an interferometric method for statistically discriminating between nonorthogonal states in high-dimensional Hilbert spaces for use in quantum information processing. The method is illustrated for the case of photon orbital angular momentum (OAM) states. These states belong to pairs of bases that are mutually unbiased on a sequence of two-dimensional subspaces of the full Hilbert space, but the vectors within the same basis are not necessarily orthogonal to each other. Over multiple trials, this method allows distinguishing OAM eigenstates from superpositions of multiple such eigenstates. Variations of the same method are then shown to be capable of preparing and detecting arbitrary linear combinations of states in Hilbert space. One further variation allows the construction of chains of states obeying recurrence relations on the Hilbert space itself, opening a new range of possibilities for more abstract information-coding algorithms to be carried out experimentally in a simple manner. Among other applications, we show that this approach provides a simplified means of switching between pairs of high-dimensional mutually unbiased OAM bases.
Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data
Ko, Hyoseok; Kim, Kipoong
2016-01-01
In genetic association studies with high-dimensional genomic data, multiple group testing procedures are often required in order to identify disease/trait-related genes or genetic regions, where multiple genetic sites or variants are located within the same gene or genetic region. However, statistical testing procedures based on an individual test suffer from multiple testing issues such as the control of family-wise error rate and dependent tests. Moreover, detecting only a few of genes associated with a phenotype outcome among tens of thousands of genes is of main interest in genetic association studies. In this reason regularization procedures, where a phenotype outcome regresses on all genomic markers and then regression coefficients are estimated based on a penalized likelihood, have been considered as a good alternative approach to analysis of high-dimensional genomic data. But, selection performance of regularization procedures has been rarely compared with that of statistical group testing procedures. In this article, we performed extensive simulation studies where commonly used group testing procedures such as principal component analysis, Hotelling's T2 test, and permutation test are compared with group lasso (least absolute selection and shrinkage operator) in terms of true positive selection. Also, we applied all methods considered in simulation studies to identify genes associated with ovarian cancer from over 20,000 genetic sites generated from Illumina Infinium HumanMethylation27K Beadchip. We found a big discrepancy of selected genes between multiple group testing procedures and group lasso. PMID:28154510
Quantum secret sharing based on modulated high-dimensional time-bin entanglement
Takesue, Hiroki; Inoue, Kyo
2006-07-15
We propose a scheme for quantum secret sharing (QSS) that uses a modulated high-dimensional time-bin entanglement. By modulating the relative phase randomly by {l_brace}0,{pi}{r_brace}, a sender with the entanglement source can randomly change the sign of the correlation of the measurement outcomes obtained by two distant recipients. The two recipients must cooperate if they are to obtain the sign of the correlation, which is used as a secret key. We show that our scheme is secure against intercept-and-resend (IR) and beam splitting attacks by an outside eavesdropper thanks to the nonorthogonality of high-dimensional time-bin entangled states. We also show that a cheating attempt based on an IR attack by one of the recipients can be detected by changing the dimension of the time-bin entanglement randomly and inserting two 'vacant' slots between the packets. Then, cheating attempts can be detected by monitoring the count rate in the vacant slots. The proposed scheme has better experimental feasibility than previously proposed entanglement-based QSS schemes.
Finite-key analysis for time-energy high-dimensional quantum key distribution
NASA Astrophysics Data System (ADS)
Niu, Murphy Yuezhen; Xu, Feihu; Shapiro, Jeffrey H.; Furrer, Fabian
2016-11-01
Time-energy high-dimensional quantum key distribution (HD-QKD) leverages the high-dimensional nature of time-energy entangled biphotons and the loss tolerance of single-photon detection to achieve long-distance key distribution with high photon information efficiency. To date, the general-attack security of HD-QKD has only been proven in the asymptotic regime, while HD-QKD's finite-key security has only been established for a limited set of attacks. Here we fill this gap by providing a rigorous HD-QKD security proof for general attacks in the finite-key regime. Our proof relies on an entropic uncertainty relation that we derive for time and conjugate-time measurements that use dispersive optics, and our analysis includes an efficient decoy-state protocol in its parameter estimation. We present numerically evaluated secret-key rates illustrating the feasibility of secure and composable HD-QKD over metropolitan-area distances when the system is subjected to the most powerful eavesdropping attack.
A Joint Modeling Approach for Right Censored High Dimensional Multivariate Longitudinal Data
Jaffa, Miran A.; Gebregziabher, Mulugeta; Jaffa, Ayad A
2015-01-01
Analysis of multivariate longitudinal data becomes complicated when the outcomes are of high dimension and informative right censoring is prevailing. Here, we propose a likelihood based approach for high dimensional outcomes wherein we jointly model the censoring process along with the slopes of the multivariate outcomes in the same likelihood function. We utilized pseudo likelihood function to generate parameter estimates for the population slopes and Empirical Bayes estimates for the individual slopes. The proposed approach was applied to jointly model longitudinal measures of blood urea nitrogen, plasma creatinine, and estimated glomerular filtration rate which are key markers of kidney function in a cohort of renal transplant patients followed from kidney transplant to kidney failure. Feasibility of the proposed joint model for high dimensional multivariate outcomes was successfully demonstrated and its performance was compared to that of a pairwise bivariate model. Our simulation study results suggested that there was a significant reduction in bias and mean squared errors associated with the joint model compared to the pairwise bivariate model. PMID:25688330
Does the Cerebral Cortex Exploit High-Dimensional, Non-linear Dynamics for Information Processing?
Singer, Wolf; Lazar, Andreea
2016-01-01
The discovery of stimulus induced synchronization in the visual cortex suggested the possibility that the relations among low-level stimulus features are encoded by the temporal relationship between neuronal discharges. In this framework, temporal coherence is considered a signature of perceptual grouping. This insight triggered a large number of experimental studies which sought to investigate the relationship between temporal coordination and cognitive functions. While some core predictions derived from the initial hypothesis were confirmed, these studies, also revealed a rich dynamical landscape beyond simple coherence whose role in signal processing is still poorly understood. In this paper, a framework is presented which establishes links between the various manifestations of cortical dynamics by assigning specific coding functions to low-dimensional dynamic features such as synchronized oscillations and phase shifts on the one hand and high-dimensional non-linear, non-stationary dynamics on the other. The data serving as basis for this synthetic approach have been obtained with chronic multisite recordings from the visual cortex of anesthetized cats and from monkeys trained to solve cognitive tasks. It is proposed that the low-dimensional dynamics characterized by synchronized oscillations and large-scale correlations are substates that represent the results of computations performed in the high-dimensional state-space provided by recurrently coupled networks. PMID:27713697
Pang, Herbert; Jung, Sin-Ho
2013-04-01
A variety of prediction methods are used to relate high-dimensional genome data with a clinical outcome using a prediction model. Once a prediction model is developed from a data set, it should be validated using a resampling method or an independent data set. Although the existing prediction methods have been intensively evaluated by many investigators, there has not been a comprehensive study investigating the performance of the validation methods, especially with a survival clinical outcome. Understanding the properties of the various validation methods can allow researchers to perform more powerful validations while controlling for type I error. In addition, sample size calculation strategy based on these validation methods is lacking. We conduct extensive simulations to examine the statistical properties of these validation strategies. In both simulations and a real data example, we have found that 10-fold cross-validation with permutation gave the best power while controlling type I error close to the nominal level. Based on this, we have also developed a sample size calculation method that will be used to design a validation study with a user-chosen combination of prediction. Microarray and genome-wide association studies data are used as illustrations. The power calculation method in this presentation can be used for the design of any biomedical studies involving high-dimensional data and survival outcomes.
Prediction of Incident Diabetes in the Jackson Heart Study Using High-Dimensional Machine Learning
Casanova, Ramon; Saldana, Santiago; Simpson, Sean L.; Lacy, Mary E.; Subauste, Angela R.; Blackshear, Chad; Wagenknecht, Lynne; Bertoni, Alain G.
2016-01-01
Statistical models to predict incident diabetes are often based on limited variables. Here we pursued two main goals: 1) investigate the relative performance of a machine learning method such as Random Forests (RF) for detecting incident diabetes in a high-dimensional setting defined by a large set of observational data, and 2) uncover potential predictors of diabetes. The Jackson Heart Study collected data at baseline and in two follow-up visits from 5,301 African Americans. We excluded those with baseline diabetes and no follow-up, leaving 3,633 individuals for analyses. Over a mean 8-year follow-up, 584 participants developed diabetes. The full RF model evaluated 93 variables including demographic, anthropometric, blood biomarker, medical history, and echocardiogram data. We also used RF metrics of variable importance to rank variables according to their contribution to diabetes prediction. We implemented other models based on logistic regression and RF where features were preselected. The RF full model performance was similar (AUC = 0.82) to those more parsimonious models. The top-ranked variables according to RF included hemoglobin A1C, fasting plasma glucose, waist circumference, adiponectin, c-reactive protein, triglycerides, leptin, left ventricular mass, high-density lipoprotein cholesterol, and aldosterone. This work shows the potential of RF for incident diabetes prediction while dealing with high-dimensional data. PMID:27727289
The end of gating? An introduction to automated analysis of high dimensional cytometry data.
Mair, Florian; Hartmann, Felix J; Mrdjen, Dunja; Tosevski, Vinko; Krieg, Carsten; Becher, Burkhard
2016-01-01
Ever since its invention half a century ago, flow cytometry has been a major tool for single-cell analysis, fueling advances in our understanding of a variety of complex cellular systems, in particular the immune system. The last decade has witnessed significant technical improvements in available cytometry platforms, such that more than 20 parameters can be analyzed on a single-cell level by fluorescence-based flow cytometry. The advent of mass cytometry has pushed this limit up to, currently, 50 parameters. However, traditional analysis approaches for the resulting high-dimensional datasets, such as gating on bivariate dot plots, have proven to be inefficient. Although a variety of novel computational analysis approaches to interpret these datasets are already available, they have not yet made it into the mainstream and remain largely unknown to many immunologists. Therefore, this review aims at providing a practical overview of novel analysis techniques for high-dimensional cytometry data including SPADE, t-SNE, Wanderlust, Citrus, and PhenoGraph, and how these applications can be used advantageously not only for the most complex datasets, but also for standard 14-parameter cytometry datasets.
Array-representation Integration Factor Method for High-dimensional Systems
Wang, Dongyong; Zhang, Lei; Nie, Qing
2013-01-01
High order spatial derivatives and stiff reactions often introduce severe temporal stability constraints on the time step in numerical methods. Implicit integration method (IIF) method, which treats diffusion exactly and reaction implicitly, provides excellent stability properties with good efficiency by decoupling the treatment of reactions and diffusions. One major challenge for IIF is storage and calculation of the potential dense exponential matrices of the sparse discretization matrices resulted from the linear differential operators. Motivated by a compact representation for IIF (cIIF) for Laplacian operators in two and three dimensions, we introduce an array-representation technique for efficient handling of exponential matrices from a general linear differential operator that may include cross-derivatives and non-constant diffusion coefficients. In this approach, exponentials are only needed for matrices of small size that depend only on the order of derivatives and number of discretization points, independent of the size of spatial dimensions. This method is particularly advantageous for high dimensional systems, and it can be easily incorporated with IIF to preserve the excellent stability of IIF. Implementation and direct simulations of the array-representation compact IIF (AcIIF) on systems, such as Fokker-Planck equations in three and four dimensions and chemical master equations, in addition to reaction-diffusion equations, show efficiency, accuracy, and robustness of the new method. Such array-presentation based on methods may have broad applications for simulating other complex systems involving high-dimensional data. PMID:24415797
Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression
Laimighofer, Michael; Krumsiek, Jan; Theis, Fabian J.
2016-01-01
Abstract With widespread availability of omics profiling techniques, the analysis and interpretation of high-dimensional omics data, for example, for biomarkers, is becoming an increasingly important part of clinical medicine because such datasets constitute a promising resource for predicting survival outcomes. However, early experience has shown that biomarkers often generalize poorly. Thus, it is crucial that models are not overfitted and give accurate results with new data. In addition, reliable detection of multivariate biomarkers with high predictive power (feature selection) is of particular interest in clinical settings. We present an approach that addresses both aspects in high-dimensional survival models. Within a nested cross-validation (CV), we fit a survival model, evaluate a dataset in an unbiased fashion, and select features with the best predictive power by applying a weighted combination of CV runs. We evaluate our approach using simulated toy data, as well as three breast cancer datasets, to predict the survival of breast cancer patients after treatment. In all datasets, we achieve more reliable estimation of predictive power for unseen cases and better predictive performance compared to the standard CoxLasso model. Taken together, we present a comprehensive and flexible framework for survival models, including performance estimation, final feature selection, and final model construction. The proposed algorithm is implemented in an open source R package (SurvRank) available on CRAN. PMID:26894327
Multivariate multidistance tests for high-dimensional low sample size case-control studies.
Marozzi, Marco
2015-04-30
A class of multivariate tests for case-control studies with high-dimensional low sample size data and with complex dependence structure, which are common in medical imaging and molecular biology, is proposed. The tests can be applied when the number of variables is much larger than the number of subjects and when the underlying population distributions are heavy-tailed or skewed. As a motivating application, we consider a case-control study where phase-contrast cinematic cardiovascular magnetic resonance imaging has been used to compare many cardiovascular characteristics of young healthy smokers and young healthy non-smokers. The tests are based on the combination of tests on interpoint distances. It is theoretically proved that the tests are exact, unbiased and consistent. It is shown that the tests are very powerful under normal, heavy-tailed and skewed distributions. The tests can also be applied to case-control studies with high-dimensional low sample size data from other medical imaging techniques (like computed tomography or X-ray radiography), chemometrics and microarray data (proteomics and transcriptomics).
2006-09-01
extending manifold learning to stratification learning. 1 Introduction Data in high dimensions is becoming ubiquitous, from image analysis and finances to...hard clustering technique based on the fractal dimension (box-counting). Starting from an initial clustering, they incrementally add points into the...References [1] D. Barbara and P. Chen. Using the fractal dimension to cluster datasets. In Proceedings of the Sixth ACM SIGKDD, pages 260–264, 2000. [2
Online clustering algorithms for radar emitter classification.
Liu, Jun; Lee, Jim P Y; Senior; Li, Lingjie; Luo, Zhi-Quan; Wong, K Max
2005-08-01
Radar emitter classification is a special application of data clustering for classifying unknown radar emitters from received radar pulse samples. The main challenges of this task are the high dimensionality of radar pulse samples, small sample group size, and closely located radar pulse clusters. In this paper, two new online clustering algorithms are developed for radar emitter classification: One is model-based using the Minimum Description Length (MDL) criterion and the other is based on competitive learning. Computational complexity is analyzed for each algorithm and then compared. Simulation results show the superior performance of the model-based algorithm over competitive learning in terms of better classification accuracy, flexibility, and stability.
Multiple frame cluster tracking
NASA Astrophysics Data System (ADS)
Gadaleta, Sabino; Klusman, Mike; Poore, Aubrey; Slocumb, Benjamin J.
2002-08-01
Tracking large number of closely spaced objects is a challenging problem for any tracking system. In missile defense systems, countermeasures in the form of debris, chaff, spent fuel, and balloons can overwhelm tracking systems that track only individual objects. Thus, tracking these groups or clusters of objects followed by transitions to individual object tracking (if and when individual objects separate from the groups) is a necessary capability for a robust and real-time tracking system. The objectives of this paper are to describe the group tracking problem in the context of multiple frame target tracking and to formulate a general assignment problem for the multiple frame cluster/group tracking problem. The proposed approach forms multiple clustering hypotheses on each frame of data and base individual frame clustering decisions on the information from multiple frames of data in much the same way that MFA or MHT work for individual object tracking. The formulation of the assignment problem for resolved object tracking and candidate clustering methods for use in multiple frame cluster tracking are briefly reviewed. Then, three different formulations are presented for the combination of multiple clustering hypotheses on each frame of data and the multiple frame assignments of clusters between frames.
Steinwand, Daniel R.; Maddox, Brian; Beckmann, Tim; Hamer, George
2003-01-01
Beowulf clusters can provide a cost-effective way to compute numerical models and process large amounts of remote sensing image data. Usually a Beowulf cluster is designed to accomplish a specific set of processing goals, and processing is very efficient when the problem remains inside the constraints of the original design. There are cases, however, when one might wish to compute a problem that is beyond the capacity of the local Beowulf system. In these cases, spreading the problem to multiple clusters or to other machines on the network may provide a cost-effective solution.
An Efficient Initialization Method for K-Means Clustering of Hyperspectral Data
NASA Astrophysics Data System (ADS)
Alizade Naeini, A.; Jamshidzadeh, A.; Saadatseresht, M.; Homayouni, S.
2014-10-01
K-means is definitely the most frequently used partitional clustering algorithm in the remote sensing community. Unfortunately due to its gradient decent nature, this algorithm is highly sensitive to the initial placement of cluster centers. This problem deteriorates for the high-dimensional data such as hyperspectral remotely sensed imagery. To tackle this problem, in this paper, the spectral signatures of the endmembers in the image scene are extracted and used as the initial positions of the cluster centers. For this purpose, in the first step, A Neyman-Pearson detection theory based eigen-thresholding method (i.e., the HFC method) has been employed to estimate the number of endmembers in the image. Afterwards, the spectral signatures of the endmembers are obtained using the Minimum Volume Enclosing Simplex (MVES) algorithm. Eventually, these spectral signatures are used to initialize the k-means clustering algorithm. The proposed method is implemented on a hyperspectral dataset acquired by ROSIS sensor with 103 spectral bands over the Pavia University campus, Italy. For comparative evaluation, two other commonly used initialization methods (i.e., Bradley & Fayyad (BF) and Random methods) are implemented and compared. The confusion matrix, overall accuracy and Kappa coefficient are employed to assess the methods' performance. The evaluations demonstrate that the proposed solution outperforms the other initialization methods and can be applied for unsupervised classification of hyperspectral imagery for landcover mapping.
Pal, Ranjan; Chelmis, Charalampos; Aman, Saima; Frincu, Marc; Prasanna, Viktor
2015-07-15
The advent of smart meters and advanced communication infrastructures catalyzes numerous smart grid applications such as dynamic demand response, and paves the way to solve challenging research problems in sustainable energy consumption. The space of solution possibilities are restricted primarily by the huge amount of generated data requiring considerable computational resources and efficient algorithms. To overcome this Big Data challenge, data clustering techniques have been proposed. Current approaches however do not scale in the face of the “increasing dimensionality” problem where a cluster point is represented by the entire customer consumption time series. To overcome this aspect we first rethink the way cluster points are created and designed, and then design an efficient online clustering technique for demand response (DR) in order to analyze high volume, high dimensional energy consumption time series data at scale, and on the fly. Our online algorithm is randomized in nature, and provides optimal performance guarantees in a computationally efficient manner. Unlike prior work we (i) study the consumption properties of the whole population simultaneously rather than developing individual models for each customer separately, claiming it to be a ‘killer’ approach that breaks the “curse of dimensionality” in online time series clustering, and (ii) provide tight performance guarantees in theory to validate our approach. Our insights are driven by the field of sociology, where collective behavior often emerges as the result of individual patterns and lifestyles.
High-dimensional quantum key distribution with the entangled single-photon-added coherent state
NASA Astrophysics Data System (ADS)
Wang, Yang; Bao, Wan-Su; Bao, Hai-Ze; Zhou, Chun; Jiang, Mu-Sheng; Li, Hong-Wei
2017-04-01
High-dimensional quantum key distribution (HD-QKD) can generate more secure bits for one detection event so that it can achieve long distance key distribution with a high secret key capacity. In this Letter, we present a decoy state HD-QKD scheme with the entangled single-photon-added coherent state (ESPACS) source. We present two tight formulas to estimate the single-photon fraction of postselected events and Eve's Holevo information and derive lower bounds on the secret key capacity and the secret key rate of our protocol. We also present finite-key analysis for our protocol by using the Chernoff bound. Our numerical results show that our protocol using one decoy state can perform better than that of previous HD-QKD protocol with the spontaneous parametric down conversion (SPDC) using two decoy states. Moreover, when considering finite resources, the advantage is more obvious.
A two-state hysteresis model from high-dimensional friction
Biswas, Saurabh; Chatterjee, Anindya
2015-01-01
In prior work (Biswas & Chatterjee 2014 Proc. R. Soc. A 470, 20130817 (doi:10.1098/rspa.2013.0817)), we developed a six-state hysteresis model from a high-dimensional frictional system. Here, we use a more intuitively appealing frictional system that resembles one studied earlier by Iwan. The basis functions now have simple analytical description. The number of states required decreases further, from six to the theoretical minimum of two. The number of fitted parameters is reduced by an order of magnitude, to just six. An explicit and faster numerical solution method is developed. Parameter fitting to match different specified hysteresis loops is demonstrated. In summary, a new two-state model of hysteresis is presented that is ready for practical implementation. Essential Matlab code is provided. PMID:26587279
Testing interaction between treatment and high-dimensional covariates in randomized clinical trials.
Callegaro, Andrea; Spiessens, Bart; Dizier, Benjamin; Montoya, Fernando U; van Houwelingen, Hans C
2016-10-20
In this paper, we considered different methods to test the interaction between treatment and a potentially large number (p) of covariates in randomized clinical trials. The simplest approach was to fit univariate (marginal) models and to combine the univariate statistics or p-values (e.g., minimum p-value). Another possibility was to reduce the dimension of the covariates using the principal components (PCs) and to test the interaction between treatment and PCs. Finally, we considered the Goeman global test applied to the high-dimensional interaction matrix, adjusted for the main (treatment and covariates) effects. These tests can be used for personalized medicine to test if a large set of biomarkers can be useful to identify a subset of patients who may be more responsive to treatment. We evaluated the performance of these methods on simulated data and we applied them on data from two early phases oncology clinical trials.
A two-state hysteresis model from high-dimensional friction.
Biswas, Saurabh; Chatterjee, Anindya
2015-07-01
In prior work (Biswas & Chatterjee 2014 Proc. R. Soc. A 470, 20130817 (doi:10.1098/rspa.2013.0817)), we developed a six-state hysteresis model from a high-dimensional frictional system. Here, we use a more intuitively appealing frictional system that resembles one studied earlier by Iwan. The basis functions now have simple analytical description. The number of states required decreases further, from six to the theoretical minimum of two. The number of fitted parameters is reduced by an order of magnitude, to just six. An explicit and faster numerical solution method is developed. Parameter fitting to match different specified hysteresis loops is demonstrated. In summary, a new two-state model of hysteresis is presented that is ready for practical implementation. Essential Matlab code is provided.
DecisionFlow: Visual Analytics for High-Dimensional Temporal Event Sequence Data.
Gotz, David; Stavropoulos, Harry
2014-12-01
Temporal event sequence data is increasingly commonplace, with applications ranging from electronic medical records to financial transactions to social media activity. Previously developed techniques have focused on low-dimensional datasets (e.g., with less than 20 distinct event types). Real-world datasets are often far more complex. This paper describes DecisionFlow, a visual analysis technique designed to support the analysis of high-dimensional temporal event sequence data (e.g., thousands of event types). DecisionFlow combines a scalable and dynamic temporal event data structure with interactive multi-view visualizations and ad hoc statistical analytics. We provide a detailed review of our methods, and present the results from a 12-person user study. The study results demonstrate that DecisionFlow enables the quick and accurate completion of a range of sequence analysis tasks for datasets containing thousands of event types and millions of individual events.
Multiple imputation for high-dimensional mixed incomplete continuous and binary data.
He, Ren; Belin, Thomas
2014-06-15
It is common in applied research to have large numbers of variables measured on a modest number of cases. Even with low rates of missingness of individual variables, such data sets can have a large number of incomplete cases with a mix of data types. Here, we propose a new joint modeling approach to address the high-dimensional incomplete data with a mix of continuous and binary data. Specifically, we propose a multivariate normal model encompassing both continuous variables and latent variables corresponding to binary variables. We apply a parameter-extended Metropolis–Hastings algorithm to generate the covariance matrix of a mixture of continuous and binary variables. We also introduce prior distribution families for unstructured covariance matrices to reduce the dimension of the parameter space. In several simulation settings, the method is compared with available-case analysis, a rounding method, and a sequential regression method.
Wang, Zhiping; Chen, Jinyu; Yu, Benli
2017-02-20
We investigate the two-dimensional (2D) and three-dimensional (3D) atom localization behaviors via spontaneously generated coherence in a microwave-driven four-level atomic system. Owing to the space-dependent atom-field interaction, it is found that the detecting probability and precision of 2D and 3D atom localization behaviors can be significantly improved via adjusting the system parameters, the phase, amplitude, and initial population distribution. Interestingly, the atom can be localized in volumes that are substantially smaller than a cubic optical wavelength. Our scheme opens a promising way to achieve high-precision and high-efficiency atom localization, which provides some potential applications in high-dimensional atom nanolithography.
High-Dimensional Circular Quantum Secret Sharing Using Orbital Angular Momentum
NASA Astrophysics Data System (ADS)
Tang, Dawei; Wang, Tie-jun; Mi, Sichen; Geng, Xiao-Meng; Wang, Chuan
2016-11-01
Quantum secret sharing is to distribute secret message securely between multi-parties. Here exploiting orbital angular momentum (OAM) state of single photons as the information carrier, we propose a high-dimensional circular quantum secret sharing protocol which increases the channel capacity largely. In the proposed protocol, the secret message is split into two parts, and each encoded on the OAM state of single photons. The security of the protocol is guaranteed by the laws of non-cloning theorem. And the secret messages could not be recovered except that the two receivers collaborated with each other. Moreover, the proposed protocol could be extended into high-level quantum systems, and the enhanced security could be achieved.
Transient times and periods in the high-dimensional shape-space model for immune systems
NASA Astrophysics Data System (ADS)
Zorzenon dos Santos, Rita Maria
1993-05-01
A simplified version of the cellular automata approximation introduced by De Boer, Segel and Perelson in the shape-space model, to describe the interaction of different types of B cells in the immune system, indicates the existence of a threshold separating the periodic regime from the chaotic one, on high-dimensional finite lattices. We study the behavior of the periods of the limit cycles nearby the transition threshold as well as the behavior of the transient times necessary to attain the attractors in the periodic regime. We find that both become large close to the threshold. We also find that even before the chaotic regime is reached, the system is already trapped in a sort of non-healthy state. Nevertheless the system will never attain it, because the transient times in this region are much larger than the usual average lifetime of the system.
The role of high-dimensional diffusive search, stabilization, and frustration in protein folding.
Rimratchada, Supreecha; McLeish, Tom C B; Radford, Sheena E; Paci, Emanuele
2014-04-15
Proteins are polymeric molecules with many degrees of conformational freedom whose internal energetic interactions are typically screened to small distances. Therefore, in the high-dimensional conformation space of a protein, the energy landscape is locally relatively flat, in contrast to low-dimensional representations, where, because of the induced entropic contribution to the full free energy, it appears funnel-like. Proteins explore the conformation space by searching these flat subspaces to find a narrow energetic alley that we call a hypergutter and then explore the next, lower-dimensional, subspace. Such a framework provides an effective representation of the energy landscape and folding kinetics that does justice to the essential characteristic of high-dimensionality of the search-space. It also illuminates the important role of nonnative interactions in defining folding pathways. This principle is here illustrated using a coarse-grained model of a family of three-helix bundle proteins whose conformations, once secondary structure has formed, can be defined by six rotational degrees of freedom. Two folding mechanisms are possible, one of which involves an intermediate. The stabilization of intermediate subspaces (or states in low-dimensional projection) in protein folding can either speed up or slow down the folding rate depending on the amount of native and nonnative contacts made in those subspaces. The folding rate increases due to reduced-dimension pathways arising from the mere presence of intermediate states, but decreases if the contacts in the intermediate are very stable and introduce sizeable topological or energetic frustration that needs to be overcome. Remarkably, the hypergutter framework, although depending on just a few physically meaningful parameters, can reproduce all the types of experimentally observed curvature in chevron plots for realizations of this fold.
Technology innovation clusters are geographic concentrations of interconnected companies, universities, and other organizations with a focus on environmental technology. They play a key role in addressing the nation’s pressing environmental problems.
Dishion, Thomas J; Ha, Thao; Véronneau, Marie-Hélène
2012-05-01
The authors propose that peer relationships should be included in a life history perspective on adolescent problem behavior. Longitudinal analyses were used to examine deviant peer clustering as the mediating link between attenuated family ties, peer marginalization, and social disadvantage in early adolescence and sexual promiscuity in middle adolescence and childbearing by early adulthood. Specifically, 998 youths, along with their families, were assessed at age 11 years and periodically through age 24 years. Structural equation modeling revealed that the peer-enhanced life history model provided a good fit to the longitudinal data, with deviant peer clustering strongly predicting adolescent sexual promiscuity and other correlated problem behaviors. Sexual promiscuity, as expected, also strongly predicted the number of children by ages 22-24 years. Consistent with a life history perspective, family social disadvantage directly predicted deviant peer clustering and number of children in early adulthood, controlling for all other variables in the model. These data suggest that deviant peer clustering is a core dimension of a fast life history strategy, with strong links to sexual activity and childbearing. The implications of these findings are discussed with respect to the need to integrate an evolutionary-based model of self-organized peer groups in developmental and intervention science.
ConsensusCluster: a software tool for unsupervised cluster discovery in numerical data.
Seiler, Michael; Huang, C Chris; Szalma, Sandor; Bhanot, Gyan
2010-02-01
We have created a stand-alone software tool, ConsensusCluster, for the analysis of high-dimensional single nucleotide polymorphism (SNP) and gene expression microarray data. Our software implements the consensus clustering algorithm and principal component analysis to stratify the data into a given number of robust clusters. The robustness is achieved by combining clustering results from data and sample resampling as well as by averaging over various algorithms and parameter settings to achieve accurate, stable clustering results. We have implemented several different clustering algorithms in the software, including K-Means, Partition Around Medoids, Self-Organizing Map, and Hierarchical clustering methods. After clustering the data, ConsensusCluster generates a consensus matrix heatmap to give a useful visual representation of cluster membership, and automatically generates a log of selected features that distinguish each pair of clusters. ConsensusCluster gives more robust and more reliable clusters than common software packages and, therefore, is a powerful unsupervised learning tool that finds hidden patterns in data that might shed light on its biological interpretation. This software is free and available from http://code.google.com/p/consensus-cluster .
NASA Astrophysics Data System (ADS)
Haussaire, Jean-Matthieu; Bocquet, Marc
2016-04-01
Atmospheric chemistry models are becoming increasingly complex, with multiphasic chemistry, size-resolved particulate matter, and possibly coupled to numerical weather prediction models. In the meantime, data assimilation methods have also become more sophisticated. Hence, it will become increasingly difficult to disentangle the merits of data assimilation schemes, of models, and of their numerical implementation in a successful high-dimensional data assimilation study. That is why we believe that the increasing variety of problems encountered in the field of atmospheric chemistry data assimilation puts forward the need for simple low-order models, albeit complex enough to capture the relevant dynamics, physics and chemistry that could impact the performance of data assimilation schemes. Following this analysis, we developped a low-order coupled chemistry meteorology model named L95-GRS [1]. The advective wind is simulated by the Lorenz-95 model, while the chemistry is made of 6 reactive species and simulates ozone concentrations. With this model, we carried out data assimilation experiments to estimate the state of the system as well as the forcing parameter of the wind and the emissions of chemical compounds. This model proved to be a powerful playground giving insights on the hardships of online and offline estimation of atmospheric pollution. Building on the results on this low-order model, we test advanced data assimilation methods on a state-of-the-art chemical transport model to check if the conclusions obtained with our low-order model still stand. References [1] Haussaire, J.-M. and Bocquet, M.: A low-order coupled chemistry meteorology model for testing online and offline data assimilation schemes, Geosci. Model Dev. Discuss., 8, 7347-7394, doi:10.5194/gmdd-8-7347-2015, 2015.
Muetterties, Earl L.
1980-05-01
Metal cluster chemistry is one of the most rapidly developing areas of inorganic and organometallic chemistry. Prior to 1960 only a few metal clusters were well characterized. However, shortly after the early development of boron cluster chemistry, the field of metal cluster chemistry began to grow at a very rapid rate and a structural and a qualitative theoretical understanding of clusters came quickly. Analyzed here is the chemistry and the general significance of clusters with particular emphasis on the cluster research within my group. The importance of coordinately unsaturated, very reactive metal clusters is the major subject of discussion.
Slonim, Noam; Atwal, Gurinder Singh; Tkačik, Gašper; Bialek, William
2005-01-01
In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here, we reformulate the clustering problem from an information theoretic perspective that avoids many of these assumptions. In particular, our formulation obviates the need for defining a cluster “prototype,” does not require an a priori similarity metric, is invariant to changes in the representation of the data, and naturally captures nonlinear relations. We apply this approach to different domains and find that it consistently produces clusters that are more coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering based on collective notions of similarity rather than the traditional pairwise measures. PMID:16352721
Elastic SCAD as a novel penalization method for SVM classification tasks in high-dimensional data
2011-01-01
Background Classification and variable selection play an important role in knowledge discovery in high-dimensional data. Although Support Vector Machine (SVM) algorithms are among the most powerful classification and prediction methods with a wide range of scientific applications, the SVM does not include automatic feature selection and therefore a number of feature selection procedures have been developed. Regularisation approaches extend SVM to a feature selection method in a flexible way using penalty functions like LASSO, SCAD and Elastic Net. We propose a novel penalty function for SVM classification tasks, Elastic SCAD, a combination of SCAD and ridge penalties which overcomes the limitations of each penalty alone. Since SVM models are extremely sensitive to the choice of tuning parameters, we adopted an interval search algorithm, which in comparison to a fixed grid search finds rapidly and more precisely a global optimal solution. Results Feature selection methods with combined penalties (Elastic Net and Elastic SCAD SVMs) are more robust to a change of the model complexity than methods using single penalties. Our simulation study showed that Elastic SCAD SVM outperformed LASSO (L1) and SCAD SVMs. Moreover, Elastic SCAD SVM provided sparser classifiers in terms of median number of features selected than Elastic Net SVM and often better predicted than Elastic Net in terms of misclassification error. Finally, we applied the penalization methods described above on four publicly available breast cancer data sets. Elastic SCAD SVM was the only method providing robust classifiers in sparse and non-sparse situations. Conclusions The proposed Elastic SCAD SVM algorithm provides the advantages of the SCAD penalty and at the same time avoids sparsity limitations for non-sparse data. We were first to demonstrate that the integration of the interval search algorithm and penalized SVM classification techniques provides fast solutions on the optimization of tuning
Cluster-based control of a separating flow over a smoothly contoured ramp
NASA Astrophysics Data System (ADS)
Kaiser, Eurika; Noack, Bernd R.; Spohn, Andreas; Cattafesta, Louis N.; Morzyński, Marek
2017-01-01
The ability to manipulate and control fluid flows is of great importance in many scientific and engineering applications. The proposed closed-loop control framework addresses a key issue of model-based control: The actuation effect often results from slow dynamics of strongly nonlinear interactions which the flow reveals at timescales much longer than the prediction horizon of any model. Hence, we employ a probabilistic approach based on a cluster-based discretization of the Liouville equation for the evolution of the probability distribution. The proposed methodology frames high-dimensional, nonlinear dynamics into low-dimensional, probabilistic, linear dynamics which considerably simplifies the optimal control problem while preserving nonlinear actuation mechanisms. The data-driven approach builds upon a state space discretization using a clustering algorithm which groups kinematically similar flow states into a low number of clusters. The temporal evolution of the probability distribution on this set of clusters is then described by a control-dependent Markov model. This Markov model can be used as predictor for the ergodic probability distribution for a particular control law. This probability distribution approximates the long-term behavior of the original system on which basis the optimal control law is determined. We examine how the approach can be used to improve the open-loop actuation in a separating flow dominated by Kelvin-Helmholtz shedding. For this purpose, the feature space, in which the model is learned, and the admissible control inputs are tailored to strongly oscillatory flows.
PCA leverage: outlier detection for high-dimensional functional magnetic resonance imaging data.
Mejia, Amanda F; Nebel, Mary Beth; Eloyan, Ani; Caffo, Brian; Lindquist, Martin A
2017-02-27
Outlier detection for high-dimensional (HD) data is a popular topic in modern statistical research. However, one source of HD data that has received relatively little attention is functional magnetic resonance images (fMRI), which consists of hundreds of thousands of measurements sampled at hundreds of time points. At a time when the availability of fMRI data is rapidly growing-primarily through large, publicly available grassroots datasets-automated quality control and outlier detection methods are greatly needed. We propose principal components analysis (PCA) leverage and demonstrate how it can be used to identify outlying time points in an fMRI run. Furthermore, PCA leverage is a measure of the influence of each observation on the estimation of principal components, which are often of interest in fMRI data. We also propose an alternative measure, PCA robust distance, which is less sensitive to outliers and has controllable statistical properties. The proposed methods are validated through simulation studies and are shown to be highly accurate. We also conduct a reliability study using resting-state fMRI data from the Autism Brain Imaging Data Exchange and find that removal of outliers using the proposed methods results in more reliable estimation of subject-level resting-state networks using independent components analysis.
Quantum tomography of near-unitary processes in high-dimensional quantum systems
NASA Astrophysics Data System (ADS)
Lysne, Nathan; Sosa Martinez, Hector; Jessen, Poul; Baldwin, Charles; Kalev, Amir; Deutsch, Ivan
2016-05-01
Quantum Tomography (QT) is often considered the ideal tool for experimental debugging of quantum devices, capable of delivering complete information about quantum states (QST) or processes (QPT). In practice, the protocols used for QT are resource intensive and scale poorly with system size. In this situation, a well behaved model system with access to large state spaces (qudits) can serve as a useful platform for examining the tradeoffs between resource cost and accuracy inherent in QT. In past years we have developed one such experimental testbed, consisting of the electron-nuclear spins in the electronic ground state of individual Cs atoms. Our available toolkit includes high fidelity state preparation, complete unitary control, arbitrary orthogonal measurements, and accurate and efficient QST in Hilbert space dimensions up to d = 16. Using these tools, we have recently completed a comprehensive study of QPT in 4, 7 and 16 dimensions. Our results show that QPT of near-unitary processes is quite feasible if one chooses optimal input states and efficient QST on the outputs. We further show that for unitary processes in high dimensional spaces, one can use informationally incomplete QPT to achieve high-fidelity process reconstruction (90% in d = 16) with greatly reduced resource requirements.
NASA Astrophysics Data System (ADS)
Sun, Yifei; Kumar, Mrinal
2015-05-01
In this paper, a tensor decomposition approach combined with Chebyshev spectral differentiation is presented to solve the high dimensional transient Fokker-Planck equations (FPE) arising in the simulation of polymeric fluids via multi-bead-spring (MBS) model. Generalizing the authors' previous work on the stationary FPE, the transient solution is obtained in a single CANDECOMP/PARAFAC decomposition (CPD) form for all times via the alternating least squares algorithm. This is accomplished by treating the temporal dimension in the same manner as all other spatial dimensions, thereby decoupling it from them. As a result, the transient solution is obtained without resorting to expensive time stepping schemes. A new, relaxed approach for imposing the vanishing boundary conditions is proposed, improving the quality of the approximation. The asymptotic behavior of the temporal basis functions is studied. The proposed solver scales very well with the dimensionality of the MBS model. Numerical results for systems up to 14 dimensional state space are successfully obtained on a regular personal computer and compared with the corresponding matrix Riccati differential equation (for linear models) or Monte Carlo simulations (for nonlinear models).
Zhang, Zheng; Yang, Xiu; Oseledets, Ivan V.; Karniadakis, George E.; Daniel, Luca
2015-01-01
Hierarchical uncertainty quantification can reduce the computational cost of stochastic circuit simulation by employing spectral methods at different levels. This paper presents an efficient framework to simulate hierarchically some challenging stochastic circuits/systems that include high-dimensional subsystems. Due to the high parameter dimensionality, it is challenging to both extract surrogate models at the low level of the design hierarchy and to handle them in the high-level simulation. In this paper, we develop an efficient analysis of variance-based stochastic circuit/microelectromechanical systems simulator to efficiently extract the surrogate models at the low level. In order to avoid the curse of dimensionality, we employ tensor-train decomposition at the high level to construct the basis functions and Gauss quadrature points. As a demonstration, we verify our algorithm on a stochastic oscillator with four MEMS capacitors and 184 random parameters. This challenging example is efficiently simulated by our simulator at the cost of only 10min in MATLAB on a regular personal computer.
VARIABLE SELECTION AND PREDICTION WITH INCOMPLETE HIGH-DIMENSIONAL DATA1
Liu, Ying; Wang, Yuanjia; Feng, Yang; Wall, Melanie M.
2016-01-01
We propose a Multiple Imputation Random Lasso (mirl) method to select important variables and to predict the outcome for an epidemiological study of Eating and Activity in Teens. In this study 80% of individuals have at least one variable missing. Therefore, using variable selection methods developed for complete data after listwise deletion substantially reduces prediction power. Recent work on prediction models in the presence of incomplete data cannot adequately account for large numbers of variables with arbitrary missing patterns. We propose MIRL to combine penalized regression techniques with multiple imputation and stability selection. Extensive simulation studies are conducted to compare MIRL with several alternatives. MIRL outperforms other methods in high-dimensional scenarios in terms of both reduced prediction error and improved variable selection performance, and it has greater advantage when the correlation among variables is high and missing proportion is high. MIRL is shown to have improved performance when comparing with other applicable methods when applied to the study of Eating and Activity in Teens for the boys and girls separately, and to a subgroup of low social economic status (ses) Asian boys who are at high risk of developing obesity. PMID:27213023
NASA Astrophysics Data System (ADS)
Hernández, C.; Tutsch, R.
2013-09-01
The Statistical Dynamic Specifications Method (SDSM) relies on in-line measurements to manage dimensional specifications, target and tolerance. SDSM is the cornerstone of an innovative assembling technique based on the Statistical Feed-Forward Control Model (SFFCM) for processes in which the stacked dimensional variation of the assembled components reaches the order of the total allowed tolerance. Since the magnitude of such variation might jeopardize the process capability, it is a matter of interest to study the inclusion of the measurement uncertainty when applying SDSM on a target subprocess. By means of simulating the production of assemblies made of parts having high dimensional variation, a set of experiments were designed to compare the impact of different levels of measurement uncertainty on the capability of a target subprocess. Simulation results showed that depending on the magnitude of the uncertainty the capability index cp of the target subprocess increases between 2.5% and 34.5% from 1.27 to 1.82 as a direct consequence of adjusting the respective tolerance. Thus, the inclusion of the measurement uncertainty in the proposed SDSM has a significant impact in its practical realization since a decrement in cp implies an increment of the scrap percentage of the target subprocess.
Biomarkers for combat-related PTSD: focus on molecular networks from high-dimensional data
Neylan, Thomas C.; Schadt, Eric E.; Yehuda, Rachel
2014-01-01
Posttraumatic stress disorder (PTSD) and other deployment-related outcomes originate from a complex interplay between constellations of changes in DNA, environmental traumatic exposures, and other biological risk factors. These factors affect not only individual genes or bio-molecules but also the entire biological networks that in turn increase or decrease the risk of illness or affect illness severity. This review focuses on recent developments in the field of systems biology which use multidimensional data to discover biological networks affected by combat exposure and post-deployment disease states. By integrating large-scale, high-dimensional molecular, physiological, clinical, and behavioral data, the molecular networks that directly respond to perturbations that can lead to PTSD can be identified and causally associated with PTSD, providing a path to identify key drivers. Reprogrammed neural progenitor cells from fibroblasts from PTSD patients could be established as an in vitro assay for high throughput screening of approved drugs to determine which drugs reverse the abnormal expression of the pathogenic biomarkers or neuronal properties. PMID:25206954
Spanning high-dimensional expression space using ribosome-binding site combinatorics
Zelcbuch, Lior; Antonovsky, Niv; Bar-Even, Arren; Levin-Karp, Ayelet; Barenholz, Uri; Dayagi, Michal; Liebermeister, Wolfram; Flamholz, Avi; Noor, Elad; Amram, Shira; Brandis, Alexander; Bareia, Tasneem; Yofe, Ido; Jubran, Halim; Milo, Ron
2013-01-01
Protein levels are a dominant factor shaping natural and synthetic biological systems. Although proper functioning of metabolic pathways relies on precise control of enzyme levels, the experimental ability to balance the levels of many genes in parallel is a major outstanding challenge. Here, we introduce a rapid and modular method to span the expression space of several proteins in parallel. By combinatorially pairing genes with a compact set of ribosome-binding sites, we modulate protein abundance by several orders of magnitude. We demonstrate our strategy by using a synthetic operon containing fluorescent proteins to span a 3D color space. Using the same approach, we modulate a recombinant carotenoid biosynthesis pathway in Escherichia coli to reveal a diversity of phenotypes, each characterized by a distinct carotenoid accumulation profile. In a single combinatorial assembly, we achieve a yield of the industrially valuable compound astaxanthin 4-fold higher than previously reported. The methodology presented here provides an efficient tool for exploring a high-dimensional expression space to locate desirable phenotypes. PMID:23470993
Dan Maljovec; Bei Wang; Valerio Pascucci; Peer-Timo Bremer; Michael Pernice; Robert Nourgaliev
2013-05-01
The next generation of methodologies for nuclear reactor Probabilistic Risk Assessment (PRA) explicitly accounts for the time element in modeling the probabilistic system evolution and uses numerical simulation tools to account for possible dependencies between failure events. The Monte-Carlo (MC) and the Dynamic Event Tree (DET) approaches belong to this new class of dynamic PRA methodologies. A challenge of dynamic PRA algorithms is the large amount of data they produce which may be difficult to visualize and analyze in order to extract useful information. We present a software tool that is designed to address these goals. We model a large-scale nuclear simulation dataset as a high-dimensional scalar function defined over a discrete sample of the domain. First, we provide structural analysis of such a function at multiple scales and provide insight into the relationship between the input parameters and the output. Second, we enable exploratory analysis for users, where we help the users to differentiate features from noise through multi-scale analysis on an interactive platform, based on domain knowledge and data characterization. Our analysis is performed by exploiting the topological and geometric properties of the domain, building statistical models based on its topological segmentations and providing interactive visual interfaces to facilitate such explorations. We provide a user’s guide to our software tool by highlighting its analysis and visualization capabilities, along with a use case involving dataset from a nuclear reactor safety simulation.
A sparse grid based method for generative dimensionality reduction of high-dimensional data
NASA Astrophysics Data System (ADS)
Bohn, Bastian; Garcke, Jochen; Griebel, Michael
2016-03-01
Generative dimensionality reduction methods play an important role in machine learning applications because they construct an explicit mapping from a low-dimensional space to the high-dimensional data space. We discuss a general framework to describe generative dimensionality reduction methods, where the main focus lies on a regularized principal manifold learning variant. Since most generative dimensionality reduction algorithms exploit the representer theorem for reproducing kernel Hilbert spaces, their computational costs grow at least quadratically in the number n of data. Instead, we introduce a grid-based discretization approach which automatically scales just linearly in n. To circumvent the curse of dimensionality of full tensor product grids, we use the concept of sparse grids. Furthermore, in real-world applications, some embedding directions are usually more important than others and it is reasonable to refine the underlying discretization space only in these directions. To this end, we employ a dimension-adaptive algorithm which is based on the ANOVA (analysis of variance) decomposition of a function. In particular, the reconstruction error is used to measure the quality of an embedding. As an application, the study of large simulation data from an engineering application in the automotive industry (car crash simulation) is performed.
Semi-implicit Integration Factor Methods on Sparse Grids for High-Dimensional Systems
Wang, Dongyong; Chen, Weitao; Nie, Qing
2015-01-01
Numerical methods for partial differential equations in high-dimensional spaces are often limited by the curse of dimensionality. Though the sparse grid technique, based on a one-dimensional hierarchical basis through tensor products, is popular for handling challenges such as those associated with spatial discretization, the stability conditions on time step size due to temporal discretization, such as those associated with high-order derivatives in space and stiff reactions, remain. Here, we incorporate the sparse grids with the implicit integration factor method (IIF) that is advantageous in terms of stability conditions for systems containing stiff reactions and diffusions. We combine IIF, in which the reaction is treated implicitly and the diffusion is treated explicitly and exactly, with various sparse grid techniques based on the finite element and finite difference methods and a multi-level combination approach. The overall method is found to be efficient in terms of both storage and computational time for solving a wide range of PDEs in high dimensions. In particular, the IIF with the sparse grid combination technique is flexible and effective in solving systems that may include cross-derivatives and non-constant diffusion coefficients. Extensive numerical simulations in both linear and nonlinear systems in high dimensions, along with applications of diffusive logistic equations and Fokker-Planck equations, demonstrate the accuracy, efficiency, and robustness of the new methods, indicating potential broad applications of the sparse grid-based integration factor method. PMID:25897178
Revealing the diversity of extracellular vesicles using high-dimensional flow cytometry analyses.
Marcoux, Geneviève; Duchez, Anne-Claire; Cloutier, Nathalie; Provost, Patrick; Nigrovic, Peter A; Boilard, Eric
2016-10-27
Extracellular vesicles (EV) are small membrane vesicles produced by cells upon activation and apoptosis. EVs are heterogeneous according to their origin, mode of release, membrane composition, organelle and biochemical content, and other factors. Whereas it is apparent that EVs are implicated in intercellular communication, they can also be used as biomarkers. Continuous improvements in pre-analytical parameters and flow cytometry permit more efficient assessment of EVs; however, methods to more objectively distinguish EVs from cells and background, and to interpret multiple single-EV parameters are lacking. We used spanning-tree progression analysis of density-normalized events (SPADE) as a computational approach for the organization of EV subpopulations released by platelets and erythrocytes. SPADE distinguished EVs, and logically organized EVs detected by high-sensitivity flow cytofluorometry based on size estimation, granularity, mitochondrial content, and phosphatidylserine and protein receptor surface expression. Plasma EVs were organized by hierarchy, permitting appreciation of their heterogeneity. Furthermore, SPADE was used to analyze EVs present in the synovial fluid of patients with inflammatory arthritis. Its algorithm efficiently revealed subtypes of arthritic patients based on EV heterogeneity patterns. Our study reveals that computational algorithms are useful for the analysis of high-dimensional single EV data, thereby facilitating comprehension of EV functions and biomarker development.
On Varying-coefficient Independence Screening for High-dimensional Varying-coefficient Models.
Song, Rui; Yi, Feng; Zou, Hui
2014-01-01
Varying coefficient models have been widely used in longitudinal data analysis, nonlinear time series, survival analysis, and so on. They are natural non-parametric extensions of the classical linear models in many contexts, keeping good interpretability and allowing us to explore the dynamic nature of the model. Recently, penalized estimators have been used for fitting varying-coefficient models for high-dimensional data. In this paper, we propose a new computationally attractive algorithm called IVIS for fitting varying-coefficient models in ultra-high dimensions. The algorithm first fits a gSCAD penalized varying-coefficient model using a subset of covariates selected by a new varying-coefficient independence screening (VIS) technique. The sure screening property is established for VIS. The proposed algorithm then iterates between a greedy conditional VIS step and a gSCAD penalized fitting step. Simulation and a real data analysis demonstrate that IVIS has very competitive performance for moderate sample size and high dimension.
McCarthy, John F; Marx, Kenneth A; Hoffman, Patrick E; Gee, Alexander G; O'Neil, Philip; Ujwal, M L; Hotchkiss, John
2004-05-01
Recent technical advances in combinatorial chemistry, genomics, and proteomics have made available large databases of biological and chemical information that have the potential to dramatically improve our understanding of cancer biology at the molecular level. Such an understanding of cancer biology could have a substantial impact on how we detect, diagnose, and manage cancer cases in the clinical setting. One of the biggest challenges facing clinical oncologists is how to extract clinically useful knowledge from the overwhelming amount of raw molecular data that are currently available. In this paper, we discuss how the exploratory data analysis techniques of machine learning and high-dimensional visualization can be applied to extract clinically useful knowledge from a heterogeneous assortment of molecular data. After an introductory overview of machine learning and visualization techniques, we describe two proprietary algorithms (PURS and RadViz) that we have found to be useful in the exploratory analysis of large biological data sets. We next illustrate, by way of three examples, the applicability of these techniques to cancer detection, diagnosis, and management using three very different types of molecular data. We first discuss the use of our exploratory analysis techniques on proteomic mass spectroscopy data for the detection of ovarian cancer. Next, we discuss the diagnostic use of these techniques on gene expression data to differentiate between squamous and adenocarcinoma of the lung. Finally, we illustrate the use of such techniques in selecting from a database of chemical compounds those most effective in managing patients with melanoma versus leukemia.
Feature Augmentation via Nonparametrics and Selection (FANS) in High-Dimensional Classification
Feng, Yang; Jiang, Jiancheng; Tong, Xin
2015-01-01
We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing. PMID:27185970
On Varying-coefficient Independence Screening for High-dimensional Varying-coefficient Models
Song, Rui; Yi, Feng; Zou, Hui
2014-01-01
Varying coefficient models have been widely used in longitudinal data analysis, nonlinear time series, survival analysis, and so on. They are natural non-parametric extensions of the classical linear models in many contexts, keeping good interpretability and allowing us to explore the dynamic nature of the model. Recently, penalized estimators have been used for fitting varying-coefficient models for high-dimensional data. In this paper, we propose a new computationally attractive algorithm called IVIS for fitting varying-coefficient models in ultra-high dimensions. The algorithm first fits a gSCAD penalized varying-coefficient model using a subset of covariates selected by a new varying-coefficient independence screening (VIS) technique. The sure screening property is established for VIS. The proposed algorithm then iterates between a greedy conditional VIS step and a gSCAD penalized fitting step. Simulation and a real data analysis demonstrate that IVIS has very competitive performance for moderate sample size and high dimension. PMID:25484548
Valdivia, Fernando
2014-01-01
Introduction. Medial temporal lobe atrophy assessment via magnetic resonance imaging (MRI) has been proposed in recent criteria as an in vivo diagnostic biomarker of Alzheimer's disease (AD). However, practical application of these criteria in a clinical setting will require automated MRI analysis techniques. To this end, we wished to validate our automated, high-dimensional morphometry technique to the hypothetical prediction of future clinical status from baseline data in a cohort of subjects in a large, multicentric setting, compared to currently known clinical status for these subjects. Materials and Methods. The study group consisted of 214 controls, 371 mild cognitive impairment (147 having progressed to probable AD and 224 stable), and 181 probable AD from the Alzheimer's Disease Neuroimaging Initiative, with data acquired on 58 different 1.5 T scanners. We measured the sensitivity and specificity of our technique in a hierarchical fashion, first testing the effect of intensity standardization, then between different volumes of interest, and finally its generalizability for a large, multicentric cohort. Results. We obtained 73.2% prediction accuracy with 79.5% sensitivity for the prediction of MCI progression to clinically probable AD. The positive predictive value was 81.6% for MCI progressing on average within 1.5 (0.3 s.d.) year. Conclusion. With high accuracy, the technique's ability to identify discriminant medial temporal lobe atrophy has been demonstrated in a large, multicentric environment. It is suitable as an aid for clinical diagnostic of AD. PMID:25254139
Large ν - \\overline{ν} oscillations from high-dimensional lepton number violating operator
NASA Astrophysics Data System (ADS)
Geng, Chao-Qiang; Huang, Da
2017-03-01
It is usually believed that the observation of the neutrino-antineutrino ( ν - \\overline{ν} ) oscillations is almost impossible since the oscillation probabilities are expected to be greatly suppressed by the square of tiny ratio of neutrino masses to energies. Such an argument is applicable to most models for neutrino mass generation based on the Weinberg operator, including the seesaw models. However, in the present paper, we shall give a counterexample to this argument, and show that large ν - \\overline{ν} oscillation probabilities can be obtained in a class of models in which both neutrino masses and neutrinoless double beta (0 νββ) decays are induced by the high-dimensional lepton number violating operator O}_7={\\overline{u}}_R{l}_R^c{\\overline{L}}_L{H}^{\\ast }{d}_R+H.c. with u and d representing the first two generations of quarks. In particular, we find that the predicted 0 νββ decay rates have already placed interesting constraints on the {ν}_e\\leftrightarrow {\\overline{ν}}_e oscillation. Moreover, we provide an UV-complete model to realize this scenario, in which a dark matter candidate naturally appears due to the new U(1) d symmetry.
Snyder, Abigail C.; Jiao, Yu
2010-10-01
Neutron experiments at the Spallation Neutron Source (SNS) at Oak Ridge National Laboratory (ORNL) frequently generate large amounts of data (on the order of 106-1012 data points). Hence, traditional data analysis tools run on a single CPU take too long to be practical and scientists are unable to efficiently analyze all data generated by experiments. Our goal is to develop a scalable algorithm to efficiently compute high-dimensional integrals of arbitrary functions. This algorithm can then be used to integrate the four-dimensional integrals that arise as part of modeling intensity from the experiments at the SNS. Here, three different one-dimensional numerical integration solvers from the GNU Scientific Library were modified and implemented to solve four-dimensional integrals. The results of these solvers on a final integrand provided by scientists at the SNS can be compared to the results of other methods, such as quasi-Monte Carlo methods, computing the same integral. A parallelized version of the most efficient method can allow scientists the opportunity to more effectively analyze all experimental data.
Qiu, Huitong; Xu, Sheng; Han, Fang; Liu, Han; Caffo, Brian
2016-01-01
Gaussian vector autoregressive (VAR) processes have been extensively studied in the literature. However, Gaussian assumptions are stringent for heavy-tailed time series that frequently arises in finance and economics. In this paper, we develop a unified framework for modeling and estimating heavy-tailed VAR processes. In particular, we generalize the Gaussian VAR model by an elliptical VAR model that naturally accommodates heavy-tailed time series. Under this model, we develop a quantile-based robust estimator for the transition matrix of the VAR process. We show that the proposed estimator achieves parametric rates of convergence in high dimensions. This is the first work in analyzing heavy-tailed high dimensional VAR processes. As an application of the proposed framework, we investigate Granger causality in the elliptical VAR process, and show that the robust transition matrix estimator induces sign-consistent estimators of Granger causality. The empirical performance of the proposed methodology is demonstrated by both synthetic and real data. We show that the proposed estimator is robust to heavy tails, and exhibit superior performance in stock price prediction. PMID:28133642
Relating high dimensional stochastic complex systems to low-dimensional intermittency
NASA Astrophysics Data System (ADS)
Diaz-Ruelas, Alvaro; Jensen, Henrik Jeldtoft; Piovani, Duccio; Robledo, Alberto
2017-02-01
We evaluate the implication and outlook of an unanticipated simplification in the macroscopic behavior of two high-dimensional sto-chastic models: the Replicator Model with Mutations and the Tangled Nature Model (TaNa) of evolutionary ecology. This simplification consists of the apparent display of low-dimensional dynamics in the non-stationary intermittent time evolution of the model on a coarse-grained scale. Evolution on this time scale spans generations of individuals, rather than single reproduction, death or mutation events. While a local one-dimensional map close to a tangent bifurcation can be derived from a mean-field version of the TaNa model, a nonlinear dynamical model consisting of successive tangent bifurcations generates time evolution patterns resembling those of the full TaNa model. To advance the interpretation of this finding, here we consider parallel results on a game-theoretic version of the TaNa model that in discrete time yields a coupled map lattice. This in turn is represented, a la Langevin, by a one-dimensional nonlinear map. Among various kinds of behaviours we obtain intermittent evolution associated with tangent bifurcations. We discuss our results.
Revealing the diversity of extracellular vesicles using high-dimensional flow cytometry analyses
Marcoux, Geneviève; Duchez, Anne-Claire; Cloutier, Nathalie; Provost, Patrick; Nigrovic, Peter A.; Boilard, Eric
2016-01-01
Extracellular vesicles (EV) are small membrane vesicles produced by cells upon activation and apoptosis. EVs are heterogeneous according to their origin, mode of release, membrane composition, organelle and biochemical content, and other factors. Whereas it is apparent that EVs are implicated in intercellular communication, they can also be used as biomarkers. Continuous improvements in pre-analytical parameters and flow cytometry permit more efficient assessment of EVs; however, methods to more objectively distinguish EVs from cells and background, and to interpret multiple single-EV parameters are lacking. We used spanning-tree progression analysis of density-normalized events (SPADE) as a computational approach for the organization of EV subpopulations released by platelets and erythrocytes. SPADE distinguished EVs, and logically organized EVs detected by high-sensitivity flow cytofluorometry based on size estimation, granularity, mitochondrial content, and phosphatidylserine and protein receptor surface expression. Plasma EVs were organized by hierarchy, permitting appreciation of their heterogeneity. Furthermore, SPADE was used to analyze EVs present in the synovial fluid of patients with inflammatory arthritis. Its algorithm efficiently revealed subtypes of arthritic patients based on EV heterogeneity patterns. Our study reveals that computational algorithms are useful for the analysis of high-dimensional single EV data, thereby facilitating comprehension of EV functions and biomarker development. PMID:27786276
A common, high-dimensional model of the representational space in human ventral temporal cortex
Haxby, James V.; Guntupalli, J. Swaroop; Connolly, Andrew C.; Halchenko, Yaroslav O.; Conroy, Bryan R.; Gobbini, M. Ida; Hanke, Michael; Ramadge, Peter J.
2011-01-01
Summary We present a high-dimensional model of the representational space in human ventral temporal (VT) cortex in which dimensions are response-tuning functions that are common across individuals and patterns of response are modeled as weighted sums of basis patterns associated with these response-tunings. We map response pattern vectors, measured with fMRI, from individual subjects’ voxel spaces into this common model space using a new method, ‘hyperalignment’. Hyperalignment parameters based on responses during one experiment – movie-viewing – identified 35 common response-tuning functions that captured fine-grained distinctions among a wide range of stimuli in the movie and in two category perception experiments. Between-subject classification (BSC, multivariate pattern classification based on other subjects’ data) of response pattern vectors in common model space greatly exceeded BSC of anatomically-aligned responses and matched within-subject classification. Results indicate that population codes for complex visual stimuli in VT cortex are based on response-tuning functions that are common across individuals. PMID:22017997
A common, high-dimensional model of the representational space in human ventral temporal cortex.
Haxby, James V; Guntupalli, J Swaroop; Connolly, Andrew C; Halchenko, Yaroslav O; Conroy, Bryan R; Gobbini, M Ida; Hanke, Michael; Ramadge, Peter J
2011-10-20
We present a high-dimensional model of the representational space in human ventral temporal (VT) cortex in which dimensions are response-tuning functions that are common across individuals and patterns of response are modeled as weighted sums of basis patterns associated with these response tunings. We map response-pattern vectors, measured with fMRI, from individual subjects' voxel spaces into this common model space using a new method, "hyperalignment." Hyperalignment parameters based on responses during one experiment--movie viewing--identified 35 common response-tuning functions that captured fine-grained distinctions among a wide range of stimuli in the movie and in two category perception experiments. Between-subject classification (BSC, multivariate pattern classification based on other subjects' data) of response-pattern vectors in common model space greatly exceeded BSC of anatomically aligned responses and matched within-subject classification. Results indicate that population codes for complex visual stimuli in VT cortex are based on response-tuning functions that are common across individuals.
Viewpoints: A High-Performance High-Dimensional Exploratory Data Analysis Tool
NASA Astrophysics Data System (ADS)
Gazis, P. R.; Levit, C.; Way, M. J.
2010-12-01
Scientific data sets continue to increase in both size and complexity. In the past, dedicated graphics systems at supercomputing centers were required to visualize large data sets, but as the price of commodity graphics hardware has dropped and its capability has increased, it is now possible, in principle, to view large complex data sets on a single workstation. To do this in practice, an investigator will need software that is written to take advantage of the relevant graphics hardware. The Viewpoints visualization package described herein is an example of such software. Viewpoints is an interactive tool for exploratory visual analysis of large high-dimensional (multivariate) data. It leverages the capabilities of modern graphics boards (GPUs) to run on a single workstation or laptop. Viewpoints is minimalist: it attempts to do a small set of useful things very well (or at least very quickly) in comparison with similar packages today. Its basic feature set includes linked scatter plots with brushing, dynamic histograms, normalization, and outlier detection/removal. Viewpoints was originally designed for astrophysicists, but it has since been used in a variety of fields that range from astronomy, quantum chemistry, fluid dynamics, machine learning, bioinformatics, and finance to information technology server log mining. In this article, we describe the Viewpoints package and show examples of its usage.
Hou, Jiayi
2015-01-01
An ordinal scale is commonly used to measure health status and disease related outcomes in hospital settings as well as in translational medical research. In addition, repeated measurements are common in clinical practice for tracking and monitoring the progression of complex diseases. Classical methodology based on statistical inference, in particular, ordinal modeling has contributed to the analysis of data in which the response categories are ordered and the number of covariates (p) remains smaller than the sample size (n). With the emergence of genomic technologies being increasingly applied for more accurate diagnosis and prognosis, high-dimensional data where the number of covariates (p) is much larger than the number of samples (n), are generated. To meet the emerging needs, we introduce our proposed model which is a two-stage algorithm: Extend the Generalized Monotone Incremental Forward Stagewise (GMIFS) method to the cumulative logit ordinal model; and combine the GMIFS procedure with the classical mixed-effects model for classifying disease status in disease progression along with time. We demonstrate the efficiency and accuracy of the proposed models in classification using a time-course microarray dataset collected from the Inflammation and the Host Response to Injury study. PMID:25720102
On the History of Cluster Beams
NASA Astrophysics Data System (ADS)
Becker, E. W.
1986-06-01
The methods to produce and investigate cluster beams have been developed primarily with the use of permanent gases. A summary is given of related work carried out at Marburg and Karlsruhe. The report deals with the effect of carrier gases on cluster beam production; ionization, electrical acceleration and magnetic deflection of cluster beams; the retarding potential mass spectrometry of cluster beams; cluster size measurement by atomic beam attenuation; reflection of cluster beams at solid surfaces; scattering properties of4He and3He clusters; the application of cluster beams in plasma physics, and the reduction of space charge problems by acceleration of cluster ions.
ERIC Educational Resources Information Center
Snellings, Patrick; van der Leij, Aryan; Blok, Henk; de Jong, Peter F.
2010-01-01
This study investigated the role of speech perception accuracy and speed in fluent word decoding of reading disabled (RD) children. A same-different phoneme discrimination task with natural speech tested the perception of single consonants and consonant clusters by young but persistent RD children. RD children were slower than chronological age…
Improving the text classification using clustering and a novel HMM to reduce the dimensionality.
Seara Vieira, A; Borrajo, L; Iglesias, E L
2016-11-01
In text classification problems, the representation of a document has a strong impact on the performance of learning systems. The high dimensionality of the classical structured representations can lead to burdensome computations due to the great size of real-world data. Consequently, there is a need for reducing the quantity of handled information to improve the classification process. In this paper, we propose a method to reduce the dimensionality of a classical text representation based on a clustering technique to group documents, and a previously developed Hidden Markov Model to represent them. We have applied tests with the k-NN and SVM classifiers on the OHSUMED and TREC benchmark text corpora using the proposed dimensionality reduction technique. The experimental results obtained are very satisfactory compared to commonly used techniques like InfoGain and the statistical tests performed demonstrate the suitability of the proposed technique for the preprocessing step in a text classification task.
From Ambiguities to Insights: Query-based Comparisons of High-Dimensional Data
NASA Astrophysics Data System (ADS)
Kowalski, Jeanne; Talbot, Conover; Tsai, Hua L.; Prasad, Nijaguna; Umbricht, Christopher; Zeiger, Martha A.
2007-11-01
Genomic technologies will revolutionize drag discovery and development; that much is universally agreed upon. The high dimension of data from such technologies has challenged available data analytic methods; that much is apparent. To date, large-scale data repositories have not been utilized in ways that permit their wealth of information to be efficiently processed for knowledge, presumably due in large part to inadequate analytical tools to address numerous comparisons of high-dimensional data. In candidate gene discovery, expression comparisons are often made between two features (e.g., cancerous versus normal), such that the enumeration of outcomes is manageable. With multiple features, the setting becomes more complex, in terms of comparing expression levels of tens of thousands transcripts across hundreds of features. In this case, the number of outcomes, while enumerable, become rapidly large and unmanageable, and scientific inquiries become more abstract, such as "which one of these (compounds, stimuli, etc.) is not like the others?" We develop analytical tools that promote more extensive, efficient, and rigorous utilization of the public data resources generated by the massive support of genomic studies. Our work innovates by enabling access to such metadata with logically formulated scientific inquires that define, compare and integrate query-comparison pair relations for analysis. We demonstrate our computational tool's potential to address an outstanding biomedical informatics issue of identifying reliable molecular markers in thyroid cancer. Our proposed query-based comparison (QBC) facilitates access to and efficient utilization of metadata through logically formed inquires expressed as query-based comparisons by organizing and comparing results from biotechnologies to address applications in biomedicine.
Multivariate linear regression of high-dimensional fMRI data with multiple target variables.
Valente, Giancarlo; Castellanos, Agustin Lage; Vanacore, Gianluca; Formisano, Elia
2014-05-01
Multivariate regression is increasingly used to study the relation between fMRI spatial activation patterns and experimental stimuli or behavioral ratings. With linear models, informative brain locations are identified by mapping the model coefficients. This is a central aspect in neuroimaging, as it provides the sought-after link between the activity of neuronal populations and subject's perception, cognition or behavior. Here, we show that mapping of informative brain locations using multivariate linear regression (MLR) may lead to incorrect conclusions and interpretations. MLR algorithms for high dimensional data are designed to deal with targets (stimuli or behavioral ratings, in fMRI) separately, and the predictive map of a model integrates information deriving from both neural activity patterns and experimental design. Not accounting explicitly for the presence of other targets whose associated activity spatially overlaps with the one of interest may lead to predictive maps of troublesome interpretation. We propose a new model that can correctly identify the spatial patterns associated with a target while achieving good generalization. For each target, the training is based on an augmented dataset, which includes all remaining targets. The estimation on such datasets produces both maps and interaction coefficients, which are then used to generalize. The proposed formulation is independent of the regression algorithm employed. We validate this model on simulated fMRI data and on a publicly available dataset. Results indicate that our method achieves high spatial sensitivity and good generalization and that it helps disentangle specific neural effects from interaction with predictive maps associated with other targets.
Zhao, Lue Ping; Bolouri, Hamid
2016-01-01
Maturing omics technologies enable researchers to generate high dimension omics data (HDOD) routinely in translational clinical studies. In the field of oncology, The Cancer Genome Atlas (TCGA) provided funding support to researchers to generate different types of omics data on a common set of biospecimens with accompanying clinical data and to make the data available for the research community to mine. One important application, and the focus of this manuscript, is to build predictive models for prognostic outcomes based on HDOD. To complement prevailing regression-based approaches, we propose to use an object-oriented regression (OOR) methodology to identify exemplars specified by HDOD patterns and to assess their associations with prognostic outcome. Through computing patient’s similarities to these exemplars, the OOR-based predictive model produces a risk estimate using a patient’s HDOD. The primary advantages of OOR are twofold: reducing the penalty of high dimensionality and retaining the interpretability to clinical practitioners. To illustrate its utility, we apply OOR to gene expression data from non-small cell lung cancer patients in TCGA and build a predictive model for prognostic survivorship among stage I patients, i.e., we stratify these patients by their prognostic survival risks beyond histological classifications. Identification of these high-risk patients helps oncologists to develop effective treatment protocols and post-treatment disease management plans. Using the TCGA data, the total sample is divided into training and validation data sets. After building up a predictive model in the training set, we compute risk scores from the predictive model, and validate associations of risk scores with prognostic outcome in the validation data (p=0.015). PMID:26972839
Giorla, Jean; Masson, Annie; Poggi, Francoise; Quach, Robert; Seytor, Patricia; Garnier, Josselin
2009-03-15
Inertial confinement fusion targets must be carefully designed to ignite their central hot spots and burn. Changes in the optimal implosion could reduce the fusion energy or even prevent ignition. Since there are unavoidable uncertainties due to technological defects and not perfect reproducibility from shot to shot, the fusion energy will remain uncertain. The degree with which a target can tolerate larger specifications than specified, and the probability with which a particular yield is exceeded, are possible measures of the robustness of that design. This robustness must be assessed in a very high-dimensional parameter space whose variables include every characteristics of the given target and of the associated laser pulse shape, using high-fidelity simulations. Therefore, these studies would remain computationally very intensive. In this paper we propose an approach which consist first of constructing an accurate metamodel of the yield on the whole parameter space with a reasonable data set of simulations. Then the robustness is very quickly assessed for any set of specifications with this surrogate. The yield is approximated by a neural network, and an iterative method adds new points in the data set by means of D-optimal experimental designs. The robustness study of the baseline Laser Megajoule target against one-dimensional defects illustrates this approach. A set of 2000 simulations is sufficient to metamodel the fusion energy on a large 22-dimensional parameter space around the nominal point. Furthermore, a metamodel of the robustness margin against all specifications has been obtained, providing guidance for target fabrication research and development.
Self-Consistent Chaotic Transport in a High-Dimensional Mean-Field Hamiltonian Map Model
Martínez-del-Río, D.; del-Castillo-Negrete, D.; Olvera, A.; ...
2015-10-30
We studied the self-consistent chaotic transport in a Hamiltonian mean-field model. This model provides a simplified description of transport in marginally stable systems including vorticity mixing in strong shear flows and electron dynamics in plasmas. Self-consistency is incorporated through a mean-field that couples all the degrees-of-freedom. The model is formulated as a large set of N coupled standard-like area-preserving twist maps in which the amplitude and phase of the perturbation, rather than being constant like in the standard map, are dynamical variables. Of particular interest is the study of the impact of periodic orbits on the chaotic transport and coherentmore » structures. Furthermore, numerical simulations show that self-consistency leads to the formation of a coherent macro-particle trapped around the elliptic fixed point of the system that appears together with an asymptotic periodic behavior of the mean field. To model this asymptotic state, we introduced a non-autonomous map that allows a detailed study of the onset of global transport. A turnstile-type transport mechanism that allows transport across instantaneous KAM invariant circles in non-autonomous systems is discussed. As a first step to understand transport, we study a special type of orbits referred to as sequential periodic orbits. Using symmetry properties we show that, through replication, high-dimensional sequential periodic orbits can be generated starting from low-dimensional periodic orbits. We show that sequential periodic orbits in the self-consistent map can be continued from trivial (uncoupled) periodic orbits of standard-like maps using numerical and asymptotic methods. Normal forms are used to describe these orbits and to find the values of the map parameters that guarantee their existence. Numerical simulations are used to verify the prediction from the asymptotic methods.« less
Self-Consistent Chaotic Transport in a High-Dimensional Mean-Field Hamiltonian Map Model
Martínez-del-Río, D.; del-Castillo-Negrete, D.; Olvera, A.; Calleja, R.
2015-10-30
We studied the self-consistent chaotic transport in a Hamiltonian mean-field model. This model provides a simplified description of transport in marginally stable systems including vorticity mixing in strong shear flows and electron dynamics in plasmas. Self-consistency is incorporated through a mean-field that couples all the degrees-of-freedom. The model is formulated as a large set of N coupled standard-like area-preserving twist maps in which the amplitude and phase of the perturbation, rather than being constant like in the standard map, are dynamical variables. Of particular interest is the study of the impact of periodic orbits on the chaotic transport and coherent structures. Furthermore, numerical simulations show that self-consistency leads to the formation of a coherent macro-particle trapped around the elliptic fixed point of the system that appears together with an asymptotic periodic behavior of the mean field. To model this asymptotic state, we introduced a non-autonomous map that allows a detailed study of the onset of global transport. A turnstile-type transport mechanism that allows transport across instantaneous KAM invariant circles in non-autonomous systems is discussed. As a first step to understand transport, we study a special type of orbits referred to as sequential periodic orbits. Using symmetry properties we show that, through replication, high-dimensional sequential periodic orbits can be generated starting from low-dimensional periodic orbits. We show that sequential periodic orbits in the self-consistent map can be continued from trivial (uncoupled) periodic orbits of standard-like maps using numerical and asymptotic methods. Normal forms are used to describe these orbits and to find the values of the map parameters that guarantee their existence. Numerical simulations are used to verify the prediction from the asymptotic methods.
Zhao, Lue Ping; Bolouri, Hamid
2016-04-01
Maturing omics technologies enable researchers to generate high dimension omics data (HDOD) routinely in translational clinical studies. In the field of oncology, The Cancer Genome Atlas (TCGA) provided funding support to researchers to generate different types of omics data on a common set of biospecimens with accompanying clinical data and has made the data available for the research community to mine. One important application, and the focus of this manuscript, is to build predictive models for prognostic outcomes based on HDOD. To complement prevailing regression-based approaches, we propose to use an object-oriented regression (OOR) methodology to identify exemplars specified by HDOD patterns and to assess their associations with prognostic outcome. Through computing patient's similarities to these exemplars, the OOR-based predictive model produces a risk estimate using a patient's HDOD. The primary advantages of OOR are twofold: reducing the penalty of high dimensionality and retaining the interpretability to clinical practitioners. To illustrate its utility, we apply OOR to gene expression data from non-small cell lung cancer patients in TCGA and build a predictive model for prognostic survivorship among stage I patients, i.e., we stratify these patients by their prognostic survival risks beyond histological classifications. Identification of these high-risk patients helps oncologists to develop effective treatment protocols and post-treatment disease management plans. Using the TCGA data, the total sample is divided into training and validation data sets. After building up a predictive model in the training set, we compute risk scores from the predictive model, and validate associations of risk scores with prognostic outcome in the validation data (P-value=0.015).
Shimada, Makoto K; Nishida, Tsunetoshi
2017-04-01
Felsenstein's PHYLIP package of molecular phylogeny tools has been used globally since 1980. The programs are receiving renewed attention because of their character-based user interface, which has the advantage of being scriptable for use with large-scale data studies based on super-computers or massively parallel computing clusters. However, occasionally we found, the PHYLIP Consense program output text file displays two or more divided bootstrap values for the same cluster in its result table, and when this happens the output Newick tree file incorrectly assigns only the last value to that cluster that disturbs correct estimation of a consensus tree. We ascertained the cause of this aberrant behavior in the bootstrapping calculation. Our rewrite of the Consense program source code outputs bootstrap values, without redundancy, in its result table, and a Newick tree file with appropriate, corresponding bootstrap values. Furthermore, we developed an add-on program and shell script, add_bootstrap.pl and fasta2tre_bs.bsh, to generate a Newick tree containing the topology and branch lengths inferred from the original data along with valid bootstrap values, and to actualize the automated inference of a phylogenetic tree containing the originally inferred topology and branch lengths with bootstrap values, from multiple unaligned sequences, respectively. These programs can be downloaded at: https://github.com/ShimadaMK/PHYLIP_enhance/.
Bayesian Decision Theoretical Framework for Clustering
ERIC Educational Resources Information Center
Chen, Mo
2011-01-01
In this thesis, we establish a novel probabilistic framework for the data clustering problem from the perspective of Bayesian decision theory. The Bayesian decision theory view justifies the important questions: what is a cluster and what a clustering algorithm should optimize. We prove that the spectral clustering (to be specific, the…
Histamine headache; Headache - histamine; Migrainous neuralgia; Headache - cluster; Horton's headache; Vascular headache - cluster ... be related to the body's sudden release of histamine (chemical in the body released during an allergic ...
Sanfilippo, Antonio P.; Calapristi, Augustin J.; Crow, Vernon L.; Hetzler, Elizabeth G.; Turner, Alan E.
2004-05-26
We present an approach to the disambiguation of cluster labels that capitalizes on the notion of semantic similarity to assign WordNet senses to cluster labels. The approach provides interesting insights on how document clustering can provide the basis for developing a novel approach to word sense disambiguation.
Clustering cancer gene expression data by projective clustering ensemble
Yu, Xianxue; Yu, Guoxian
2017-01-01
Gene expression data analysis has paramount implications for gene treatments, cancer diagnosis and other domains. Clustering is an important and promising tool to analyze gene expression data. Gene expression data is often characterized by a large amount of genes but with limited samples, thus various projective clustering techniques and ensemble techniques have been suggested to combat with these challenges. However, it is rather challenging to synergy these two kinds of techniques together to avoid the curse of dimensionality problem and to boost the performance of gene expression data clustering. In this paper, we employ a projective clustering ensemble (PCE) to integrate the advantages of projective clustering and ensemble clustering, and to avoid the dilemma of combining multiple projective clusterings. Our experimental results on publicly available cancer gene expression data show PCE can improve the quality of clustering gene expression data by at least 4.5% (on average) than other related techniques, including dimensionality reduction based single clustering and ensemble approaches. The empirical study demonstrates that, to further boost the performance of clustering cancer gene expression data, it is necessary and promising to synergy projective clustering with ensemble clustering. PCE can serve as an effective alternative technique for clustering gene expression data. PMID:28234920
The Environmental Technology Innovation Clusters Program advises cluster organizations, encourages collaboration between clusters, tracks U.S. environmental technology clusters, and connects EPA programs to cluster needs.
Diffuse Interface Models on Graphs for Classification of High Dimensional Data
2011-01-01
not be appropri- ate, however, given the original set of labels. This problem is mitigated by including a fidelity term in the minimization problem...2/τ) (2.10) is a common similarity function. Depending on the choice of metric, this similarity function includes the Yaroslavsky filter [58] and...the the non-local means filter [9]. 2. Zelnik-Manor and Perona introduced local scaling weights for sparse matrix computations [60]. They start with a
Simon, Richard M; Subramanian, Jyothi; Li, Ming-Chung; Menezes, Supriya
2011-05-01
Developments in whole genome biotechnology have stimulated statistical focus on prediction methods. We review here methodology for classifying patients into survival risk groups and for using cross-validation to evaluate such classifications. Measures of discrimination for survival risk models include separation of survival curves, time-dependent ROC curves and Harrell's concordance index. For high-dimensional data applications, however, computing these measures as re-substitution statistics on the same data used for model development results in highly biased estimates. Most developments in methodology for survival risk modeling with high-dimensional data have utilized separate test data sets for model evaluation. Cross-validation has sometimes been used for optimization of tuning parameters. In many applications, however, the data available are too limited for effective division into training and test sets and consequently authors have often either reported re-substitution statistics or analyzed their data using binary classification methods in order to utilize familiar cross-validation. In this article we have tried to indicate how to utilize cross-validation for the evaluation of survival risk models; specifically how to compute cross-validated estimates of survival distributions for predicted risk groups and how to compute cross-validated time-dependent ROC curves. We have also discussed evaluation of the statistical significance of a survival risk model and evaluation of whether high-dimensional genomic data adds predictive accuracy to a model based on standard covariates alone.
Cool Cluster Correctly Correlated
Varganov, Sergey Aleksandrovich
2005-01-01
tens of atoms. Therefore, they are quantum objects. Some qualitative information about the geometries of such clusters can be obtained with classical empirical methods, for example geometry optimization using an empirical Lennard-Jones potential. However, to predict their accurate geometries and other physical and chemical properties it is necessary to solve a Schroedinger equation. If one is not interested in dynamics of clusters it is enough to solve the stationary (time-independent) Schroedinger equation (HΦ=EΦ). This equation represents a multidimensional eigenvalue problem. The solution of the Schroedinger equation is a set of eigenvectors (wave functions) and their eigenvalues (energies). The lowest energy solution (wave function) corresponds to the ground state of the cluster. The other solutions correspond to excited states. The wave function gives all information about the quantum state of the cluster and can be used to calculate different physical and chemical properties, such as photoelectron, X-ray, NMR, EPR spectra, dipole moment, polarizability etc. The dimensionality of the Schroedinger equation is determined by the number of particles (nuclei and electrons) in the cluster. The analytic solution is only known for a two particle problem. In order to solve the equation for clusters of interest it is necessary to make a number of approximations and use numerical methods.
NASA Astrophysics Data System (ADS)
Talib, Imran; Belgacem, Fethi Bin Muhammad; Asif, Naseer Ahmad; Khalil, Hammad
2017-01-01
In this research article, we derive and analyze an efficient spectral method based on the operational matrices of three dimensional orthogonal Jacobi polynomials to solve numerically the mixed partial derivatives type multi-terms high dimensions generalized class of fractional order partial differential equations. We transform the considered fractional order problem to an easily solvable algebraic equations with the aid of the operational matrices. Being easily solvable, the associated algebraic system leads to finding the solution of the problem. Some test problems are considered to confirm the accuracy and validity of the proposed numerical method. The convergence of the method is ensured by comparing our Matlab software simulations based obtained results with the exact solutions in the literature, yielding negligible errors. Moreover, comparative results discussed in the literature are extended and improved in this study.
Improvements in Ionized Cluster-Beam Deposition
NASA Technical Reports Server (NTRS)
Fitzgerald, D. J.; Compton, L. E.; Pawlik, E. V.
1986-01-01
Lower temperatures result in higher purity and fewer equipment problems. In cluster-beam deposition, clusters of atoms formed by adiabatic expansion nozzle and with proper nozzle design, expanding vapor cools sufficiently to become supersaturated and form clusters of material deposited. Clusters are ionized and accelerated in electric field and then impacted on substrate where films form. Improved cluster-beam technique useful for deposition of refractory metals.
Timmerman, Marieke E; Ceulemans, Eva; De Roover, Kim; Van Leeuwen, Karla
2013-12-01
To achieve an insightful clustering of multivariate data, we propose subspace K-means. Its central idea is to model the centroids and cluster residuals in reduced spaces, which allows for dealing with a wide range of cluster types and yields rich interpretations of the clusters. We review the existing related clustering methods, including deterministic, stochastic, and unsupervised learning approaches. To evaluate subspace K-means, we performed a comparative simulation study, in which we manipulated the overlap of subspaces, the between-cluster variance, and the error variance. The study shows that the subspace K-means algorithm is sensitive to local minima but that the problem can be reasonably dealt with by using partitions of various cluster procedures as a starting point for the algorithm. Subspace K-means performs very well in recovering the true clustering across all conditions considered and appears to be superior to its competitor methods: K-means, reduced K-means, factorial K-means, mixtures of factor analyzers (MFA), and MCLUST. The best competitor method, MFA, showed a performance similar to that of subspace K-means in easy conditions but deteriorated in more difficult ones. Using data from a study on parental behavior, we show that subspace K-means analysis provides a rich insight into the cluster characteristics, in terms of both the relative positions of the clusters (via the centroids) and the shape of the clusters (via the within-cluster residuals).
NASA Astrophysics Data System (ADS)
Jazaeri, S.; Amiri-Simkooei, A. R.; Sharifi, M. A.
2012-02-01
GNSS ambiguity resolution is the key issue in the high-precision relative geodetic positioning and navigation applications. It is a problem of integer programming plus integer quality evaluation. Different integer search estimation methods have been proposed for the integer solution of ambiguity resolution. Slow rate of convergence is the main obstacle to the existing methods where tens of ambiguities are involved. Herein, integer search estimation for the GNSS ambiguity resolution based on the lattice theory is proposed. It is mathematically shown that the closest lattice point problem is the same as the integer least-squares (ILS) estimation problem and that the lattice reduction speeds up searching process. We have implemented three integer search strategies: Agrell, Eriksson, Vardy, Zeger (AEVZ), modification of Schnorr-Euchner enumeration (M-SE) and modification of Viterbo-Boutros enumeration (M-VB). The methods have been numerically implemented in several simulated examples under different scenarios and over 100 independent runs. The decorrelation process (or unimodular transformations) has been first used to transform the original ILS problem to a new one in all simulations. We have then applied different search algorithms to the transformed ILS problem. The numerical simulations have shown that AEVZ, M-SE, and M-VB are about 320, 120 and 50 times faster than LAMBDA, respectively, for a search space of dimension 40. This number could change to about 350, 160 and 60 for dimension 45. The AEVZ is shown to be faster than MLAMBDA by a factor of 5. Similar conclusions could be made using the application of the proposed algorithms to the real GPS data.
Hierarchical clustering in minimum spanning trees.
Yu, Meichen; Hillebrand, Arjan; Tewarie, Prejaas; Meier, Jil; van Dijk, Bob; Van Mieghem, Piet; Stam, Cornelis Jan
2015-02-01
The identification of clusters or communities in complex networks is a reappearing problem. The minimum spanning tree (MST), the tree connecting all nodes with minimum total weight, is regarded as an important transport backbone of the original weighted graph. We hypothesize that the clustering of the MST reveals insight in the hierarchical structure of weighted graphs. However, existing theories and algorithms have difficulties to define and identify clusters in trees. Here, we first define clustering in trees and then propose a tree agglomerative hierarchical clustering (TAHC) method for the detection of clusters in MSTs. We then demonstrate that the TAHC method can detect clusters in artificial trees, and also in MSTs of weighted social networks, for which the clusters are in agreement with the previously reported clusters of the original weighted networks. Our results therefore not only indicate that clusters can be found in MSTs, but also that the MSTs contain information about the underlying clusters of the original weighted network.
Hierarchical clustering in minimum spanning trees
NASA Astrophysics Data System (ADS)
Yu, Meichen; Hillebrand, Arjan; Tewarie, Prejaas; Meier, Jil; van Dijk, Bob; Van Mieghem, Piet; Stam, Cornelis Jan
2015-02-01
The identification of clusters or communities in complex networks is a reappearing problem. The minimum spanning tree (MST), the tree connecting all nodes with minimum total weight, is regarded as an important transport backbone of the original weighted graph. We hypothesize that the clustering of the MST reveals insight in the hierarchical structure of weighted graphs. However, existing theories and algorithms have difficulties to define and identify clusters in trees. Here, we first define clustering in trees and then propose a tree agglomerative hierarchical clustering (TAHC) method for the detection of clusters in MSTs. We then demonstrate that the TAHC method can detect clusters in artificial trees, and also in MSTs of weighted social networks, for which the clusters are in agreement with the previously reported clusters of the original weighted networks. Our results therefore not only indicate that clusters can be found in MSTs, but also that the MSTs contain information about the underlying clusters of the original weighted network.
NASA Technical Reports Server (NTRS)
Stothers, Richard B.; Chin, Chao-Wen
1992-01-01
New theoretical evolutionary sequences of models for stars with low metallicities, appropriate to the Small Magellanic Cloud, are derived with both standard Cox-Stewart opacities and the new Rogers-Iglesias opacities. Only those sequences with little or no convective core overshooting are found to be capable of reproducing the two most critical observations: the maximum effective temperature displayed by the hot evolved stars and the difference between the average bolometric magnitudes of the hot and cool evolved stars. An upper limit to the ratio of the mean overshoot distance beyond the classical Schwarzschild core boundary to the local pressure scale height is set at 0.2. It is inferred from the frequency of cool supergiants in NGC 330 that the Ledoux criterion, rather than the Schwarzschild criterion, for convection and semiconvection in the envelopes of massive stars is strongly favored. Residuals from the fitting for NGC 330 suggest the possibility of fast interior rotation in the stars of this cluster. NGC 330 and NGC 458 have ages of about 3 x 10 exp 7 and about 1 x 10 exp 8 yr, respectively.
Overview on techniques in cluster analysis.
Frades, Itziar; Matthiesen, Rune
2010-01-01
Clustering is the unsupervised, semisupervised, and supervised classification of patterns into groups. The clustering problem has been addressed in many contexts and disciplines. Cluster analysis encompasses different methods and algorithms for grouping objects of similar kinds into respective categories. In this chapter, we describe a number of methods and algorithms for cluster analysis in a stepwise framework. The steps of a typical clustering analysis process include sequentially pattern representation, the choice of the similarity measure, the choice of the clustering algorithm, the assessment of the output, and the representation of the clusters.
Medicolegal issues in cluster headache.
Loder, Elizabeth; Loder, John
2004-04-01
This paper identifies legal issues of relevance to the diagnosis and treatment of cluster headache, including areas of actual and potential malpractice liability. Legal topics that are relevant to cluster headache can be divided into five categories: diagnostic-related issues, risks inherent in the disease process, prescribing and treatment-related problems, research-related issues, and disability determination.
Sample-Based Motion Planning in High-Dimensional and Differentially-Constrained Systems
2010-02-01
Sample-Based Planning The book by LaValle [LaValle, 2006] provides a broad overview of state of the art planning problems and algorithms. Much of the...or to describe the LittleDog robot standing on its two hind legs. Although underactuated systems are typically not feedback linearizable, it is...Grizzle, 2007]). Full and Koditschek, [Full and Koditschek, 1999] gave this idea some broad appeal by describing “templates and anchors”, and suggesting
Local-learning-based feature selection for high-dimensional data analysis.
Sun, Yijun; Todorovic, Sinisa; Goodison, Steve
2010-09-01
This paper considers feature selection for data classification in the presence of a huge number of irrelevant features. We propose a new feature-selection algorithm that addresses several major issues with prior work, including problems with algorithm implementation, computational complexity, and solution accuracy. The key idea is to decompose an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then learn feature relevance globally within the large margin framework. The proposed algorithm is based on well-established machine learning and numerical analysis techniques, without making any assumptions about the underlying data distribution. It is capable of processing many thousands of features within minutes on a personal computer while maintaining a very high accuracy that is nearly insensitive to a growing number of irrelevant features. Theoretical analyses of the algorithm's sample complexity suggest that the algorithm has a logarithmical sample complexity with respect to the number of features. Experiments on 11 synthetic and real-world data sets demonstrate the viability of our formulation of the feature-selection problem for supervised learning and the effectiveness of our algorithm.
Local-Learning-Based Feature Selection for High-Dimensional Data Analysis
Sun, Yijun; Todorovic, Sinisa; Goodison, Steve
2012-01-01
This paper considers feature selection for data classification in the presence of a huge number of irrelevant features. We propose a new feature-selection algorithm that addresses several major issues with prior work, including problems with algorithm implementation, computational complexity, and solution accuracy. The key idea is to decompose an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then learn feature relevance globally within the large margin framework. The proposed algorithm is based on well-established machine learning and numerical analysis techniques, without making any assumptions about the underlying data distribution. It is capable of processing many thousands of features within minutes on a personal computer while maintaining a very high accuracy that is nearly insensitive to a growing number of irrelevant features. Theoretical analyses of the algorithm’s sample complexity suggest that the algorithm has a logarithmical sample complexity with respect to the number of features. Experiments on 11 synthetic and real-world data sets demonstrate the viability of our formulation of the feature-selection problem for supervised learning and the effectiveness of our algorithm. PMID:20634556
Pyne, Saumyadipta; Lee, Sharon X; Wang, Kui; Irish, Jonathan; Tamayo, Pablo; Nazaire, Marc-Danie; Duong, Tarn; Ng, Shu-Kay; Hafler, David; Levy, Ronald; Nolan, Garry P; Mesirov, Jill; McLachlan, Geoffrey J
2014-01-01
In biomedical applications, an experimenter encounters different potential sources of variation in data such as individual samples, multiple experimental conditions, and multivariate responses of a panel of markers such as from a signaling network. In multiparametric cytometry, which is often used for analyzing patient samples, such issues are critical. While computational methods can identify cell populations in individual samples, without the ability to automatically match them across samples, it is difficult to compare and characterize the populations in typical experiments, such as those responding to various stimulations or distinctive of particular patients or time-points, especially when there are many samples. Joint Clustering and Matching (JCM) is a multi-level framework for simultaneous modeling and registration of populations across a cohort. JCM models every population with a robust multivariate probability distribution. Simultaneously, JCM fits a random-effects model to construct an overall batch template--used for registering populations across samples, and classifying new samples. By tackling systems-level variation, JCM supports practical biomedical applications involving large cohorts. Software for fitting the JCM models have been implemented in an R package EMMIX-JCM, available from http://www.maths.uq.edu.au/~gjm/mix_soft/EMMIX-JCM/.
NASA Technical Reports Server (NTRS)
Benediktsson, J. A.; Swain, P. H.; Ersoy, O. K.
1993-01-01
Application of neural networks to classification of remote sensing data is discussed. Conventional two-layer backpropagation is found to give good results in classification of remote sensing data but is not efficient in training. A more efficient variant, based on conjugate-gradient optimization, is used for classification of multisource remote sensing and geographic data and very-high-dimensional data. The conjugate-gradient neural networks give excellent performance in classification of multisource data, but do not compare as well with statistical methods in classification of very-high-dimentional data.
Okuno, Yuta; Small, Michael; Gotoda, Hiroshi
2015-04-01
We have examined the dynamics of self-excited thermoacoustic instability in a fundamentally and practically important gas-turbine model combustion system on the basis of complex network approaches. We have incorporated sophisticated complex networks consisting of cycle networks and phase space networks, neither of which has been considered in the areas of combustion physics and science. Pseudo-periodicity and high-dimensionality exist in the dynamics of thermoacoustic instability, including the possible presence of a clear power-law distribution and small-world-like nature.
Garashchuk, Sophya; Rassolov, Vitaly A
2008-07-14
Semiclassical implementation of the quantum trajectory formalism [J. Chem. Phys. 120, 1181 (2004)] is further developed to give a stable long-time description of zero-point energy in anharmonic systems of high dimensionality. The method is based on a numerically cheap linearized quantum force approach; stabilizing terms compensating for the linearization errors are added into the time-evolution equations for the classical and nonclassical components of the momentum operator. The wave function normalization and energy are rigorously conserved. Numerical tests are performed for model systems of up to 40 degrees of freedom.
A Selective Review of Group Selection in High-Dimensional Models.
Huang, Jian; Breheny, Patrick; Ma, Shuangge
2012-01-01
Grouping structures arise naturally in many statistical modeling problems. Several methods have been proposed for variable selection that respect grouping structure in variables. Examples include the group LASSO and several concave group selection methods. In this article, we give a selective review of group selection concerning methodological developments, theoretical properties and computational algorithms. We pay particular attention to group selection methods involving concave penalties. We address both group selection and bi-level selection methods. We describe several applications of these methods in nonparametric additive models, semiparametric regression, seemingly unrelated regressions, genomic data analysis and genome wide association studies. We also highlight some issues that require further study.
Absolute classification with unsupervised clustering
NASA Technical Reports Server (NTRS)
Jeon, Byeungwoo; Landgrebe, D. A.
1992-01-01
An absolute classification algorithm is proposed in which the class definition through training samples or otherwise is required only for a particular class of interest. The absolute classification is considered as a problem of unsupervised clustering when one cluster is known initially. The definitions and statistics of the other classes are automatically developed through the weighted unsupervised clustering procedure, which is developed to keep the cluster corresponding to the class of interest from losing its identity as the class of interest. Once all the classes are developed, a conventional relative classifier such as the maximum-likelihood classifier is used in the classification.
Hemmelmann, Claudia; Ziegler, Andreas; Guiard, Volker; Weiss, Sabine; Walther, Mario; Vollandt, Rüdiger
2008-05-15
Frequency analyses of EEG data yield large data sets, which are high-dimensional and have to be evaluated statistically without a large number of false positive statements. There exist several methods to deal with this problem in multiple comparisons. Knowing the number of true hypotheses increases the power of some multiple test procedures, however the number of true hypotheses is unknown, in general, and must be estimated. In this paper, we derive two new multiple test procedures by using an upper bound for the number of true hypotheses. Our first procedure controls the generalized family-wise error rate, and thus is an improvement of the step-down procedure of Hommel and Hoffmann [Hommel G., Hoffmann T. Controlled uncertainty. In: Bauer P. Hommel G. Sonnemann E., editors. Multiple Hypotheses Testing, Heidelberg: Springer 1987;ISBN 3540505598:p. 154-61]. The second new procedure controls the false discovery proportion and improves upon the approach of Lehmann and Romano [Lehmann E.L., Romano J.P. Generalizations of the familywise error rate. Ann. Stat. 2005;33:1138-54]. By Monte-Carlo simulations, we show how the gain in power depends upon the accuracy of the estimate of the number of true hypotheses. The gain in power of our procedures is demonstrated in an example using EEG data on the processing of memorized lexical items.
Noe, F; Oswald, Marcus; Reinelt, Gerhard; Fischer, S.; Smith, Jeremy C
2006-01-01
The direct computation of rare transitions in high-dimensional dynamical systems such as biomolecules via numerical integration or Monte Carlo is limited by the sampling problem. Alternatively, the dynamics of these systems can be modeled by transition networks (TNs) which are weighted graphs whose edges represent transitions between stable states of the system. The computation of the globally best transition paths connecting two selected stable states is straightforward with available graph-theoretical methods. However, these methods require that the energy barriers of all TN edges be determined, which is often computationally infeasible for large systems. Here, we introduce energy-bounded TNs, in which the transition barriers are specified in terms of lower and upper bounds. We present algorithms permitting the determination of the globally best paths on these TNs while requiring the computation of only a small subset of the true transition barriers. Several variants of the algorithm are given which achieve improved performance, including a parallel version. The effectiveness of the approach is demonstrated by various benchmarks on random TNs and by computing the refolding pathways of a polypeptide: the best transition pathways between the alphaL helix, alphaR helix, and beta-hairpin conformations of the octaalanine (Ala8) molecule in aqueous solution.
NASA Technical Reports Server (NTRS)
1999-01-01
Penetrating 25,000 light-years of obscuring dust and myriad stars, NASA's Hubble Space Telescope has provided the clearest view yet of one of the largest young clusters of stars inside our Milky Way galaxy, located less than 100 light-years from the very center of the Galaxy. Having the equivalent mass greater than 10,000 stars like our sun, the monster cluster is ten times larger than typical young star clusters scattered throughout our Milky Way. It is destined to be ripped apart in just a few million years by gravitational tidal forces in the galaxy's core. But in its brief lifetime it shines more brightly than any other star cluster in the Galaxy. Quintuplet Cluster is 4 million years old. It has stars on the verge of blowing up as supernovae. It is the home of the brightest star seen in the galaxy, called the Pistol star. This image was taken in infrared light by Hubble's NICMOS camera in September 1997. The false colors correspond to infrared wavelengths. The galactic center stars are white, the red stars are enshrouded in dust or behind dust, and the blue stars are foreground stars between us and the Milky Way's center. The cluster is hidden from direct view behind black dust clouds in the constellation Sagittarius. If the cluster could be seen from earth it would appear to the naked eye as a 3rd magnitude star, 1/6th of a full moon's diameter apart.
ℓ(1)-penalized linear mixed-effects models for high dimensional data with application to BCI.
Fazli, Siamac; Danóczy, Márton; Schelldorfer, Jürg; Müller, Klaus-Robert
2011-06-15
Recently, a novel statistical model has been proposed to estimate population effects and individual variability between subgroups simultaneously, by extending Lasso methods. We will for the first time apply this so-called ℓ(1)-penalized linear regression mixed-effects model for a large scale real world problem: we study a large set of brain computer interface data and through the novel estimator are able to obtain a subject-independent classifier that compares favorably with prior zero-training algorithms. This unifying model inherently compensates shifts in the input space attributed to the individuality of a subject. In particular we are now for the first time able to differentiate within-subject and between-subject variability. Thus a deeper understanding both of the underlying statistical and physiological structures of the data is gained.
NASA Astrophysics Data System (ADS)
Sjöstrand, Karl; Cardenas, Valerie A.; Larsen, Rasmus; Studholme, Colin
2008-03-01
Whole-brain morphometry denotes a group of methods with the aim of relating clinical and cognitive measurements to regions of the brain. Typically, such methods require the statistical analysis of a data set with many variables (voxels and exogenous variables) paired with few observations (subjects). A common approach to this ill-posed problem is to analyze each spatial variable separately, dividing the analysis into manageable subproblems. A disadvantage of this method is that the correlation structure of the spatial variables is not taken into account. This paper investigates the use of ridge regression to address this issue, allowing for a gradual introduction of correlation information into the model. We make the connections between ridge regression and voxel-wise procedures explicit and discuss relations to other statistical methods. Results are given on an in-vivo data set of deformation based morphometry from a study of cognitive decline in an elderly population.
NASA Astrophysics Data System (ADS)
Krick, Kessica
This proposal is a specific response to the strategic goal of NASA's research program to "discover how the universe works and explore how the universe evolved into its present form." Towards this goal, we propose to mine the Spitzer archive for all observations of galaxy groups and clusters for the purpose of studying galaxy evolution in clusters, contamination rates for Sunyaev Zeldovich cluster surveys, and to provide a database of Spitzer observed clusters to the broader community. Funding from this proposal will go towards two years of support for a Postdoc to do this work. After searching the Spitzer Heritage Archive, we have found 194 unique galaxy groups and clusters that have data from both the Infrared array camera (IRAC; Fazio et al. 2004) at 3.6 - 8 microns and the multiband imaging photometer for Spitzer (MIPS; Rieke et al. 2004) at 24microns. This large sample will add value beyond the individual datasets because it will be a larger sample of IR clusters than ever before and will have sufficient diversity in mass, redshift, and dynamical state to allow us to differentiate amongst the effects of these cluster properties. An infrared sample is important because it is unaffected by dust extinction while at the same time is an excellent measure of both stellar mass (IRAC wavelengths) and star formation rate (MIPS wavelengths). Additionally, IRAC can be used to differentiate star forming galaxies (SFG) from active galactic nuclei (AGN), due to their different spectral shapes in this wavelength regime. Specifically, we intend to identify SFG and AGN in galaxy groups and clusters. Groups and clusters differ from the field because the galaxy densities are higher, there is a large potential well due mainly to the mass of the dark matter, and there is hot X-ray gas (the intracluster medium; ICM). We will examine the impact of these differences in environment on galaxy formation by comparing cluster properties of AGN and SFG to those in the field. Also, we will
NASA Astrophysics Data System (ADS)
Miller, Christopher J. Miller
2012-03-01
There are many examples of clustering in astronomy. Stars in our own galaxy are often seen as being gravitationally bound into tight globular or open clusters. The Solar System's Trojan asteroids cluster at the gravitational Langrangian in front of Jupiter’s orbit. On the largest of scales, we find gravitationally bound clusters of galaxies, the Virgo cluster (in the constellation of Virgo at a distance of ˜50 million light years) being a prime nearby example. The Virgo cluster subtends an angle of nearly 8◦ on the sky and is known to contain over a thousand member galaxies. Galaxy clusters play an important role in our understanding of theUniverse. Clusters exist at peaks in the three-dimensional large-scale matter density field. Their sky (2D) locations are easy to detect in astronomical imaging data and their mean galaxy redshifts (redshift is related to the third spatial dimension: distance) are often better (spectroscopically) and cheaper (photometrically) when compared with the entire galaxy population in large sky surveys. Photometric redshift (z) [Photometric techniques use the broad band filter magnitudes of a galaxy to estimate the redshift. Spectroscopic techniques use the galaxy spectra and emission/absorption line features to measure the redshift] determinations of galaxies within clusters are accurate to better than delta_z = 0.05 [7] and when studied as a cluster population, the central galaxies form a line in color-magnitude space (called the the E/S0 ridgeline and visible in Figure 16.3) that contains galaxies with similar stellar populations [15]. The shape of this E/S0 ridgeline enables astronomers to measure the cluster redshift to within delta_z = 0.01 [23]. The most accurate cluster redshift determinations come from spectroscopy of the member galaxies, where only a fraction of the members need to be spectroscopically observed [25,42] to get an accurate redshift to the whole system. If light traces mass in the Universe, then the locations
NASA Astrophysics Data System (ADS)
Labhardt, Lukas; Binggeli, Bruno
Star clusters are at the heart of astronomy, being key objects for our understanding of stellar evolution and galactic structure. Observations with the Hubble Space Telescope and other modern equipment have revealed fascinating new facts about these galactic building blocks. This book provides two comprehensive and up-to-date, pedagogically designed reviews on star clusters by two well-known experts in the field. Bruce Carney presents our current knowledge of the relative and absolute ages of globular clusters and the chemical history of our Galaxy. Bill Harris addresses globular clusters in external galaxies and their use as tracers of galaxy formation and cosmic distance indicators. The book is written for graduate students as well as professionals in astronomy and astrophysics.
ERIC Educational Resources Information Center
Pottawattamie County School System, Council Bluffs, IA.
The 15 occupational clusters (transportation, fine arts and humanities, communications and media, personal service occupations, construction, hospitality and recreation, health occupations, marine science occupations, consumer and homemaking-related occupations, agribusiness and natural resources, environment, public service, business and office…
Donchev, Todor I.; Petrov, Ivan G.
2011-05-31
Described herein is an apparatus and a method for producing atom clusters based on a gas discharge within a hollow cathode. The hollow cathode includes one or more walls. The one or more walls define a sputtering chamber within the hollow cathode and include a material to be sputtered. A hollow anode is positioned at an end of the sputtering chamber, and atom clusters are formed when a gas discharge is generated between the hollow anode and the hollow cathode.
Robust High-dimensional Bioinformatics Data Streams Mining by ODR-ioVFDT
Wang, Dantong; Fong, Simon; Wong, Raymond K.; Mohammed, Sabah; Fiaidhi, Jinan; Wong, Kelvin K. L.
2017-01-01
Outlier detection in bioinformatics data streaming mining has received significant attention by research communities in recent years. The problems of how to distinguish noise from an exception and deciding whether to discard it or to devise an extra decision path for accommodating it are causing dilemma. In this paper, we propose a novel algorithm called ODR with incrementally Optimized Very Fast Decision Tree (ODR-ioVFDT) for taking care of outliers in the progress of continuous data learning. By using an adaptive interquartile-range based identification method, a tolerance threshold is set. It is then used to judge if a data of exceptional value should be included for training or otherwise. This is different from the traditional outlier detection/removal approaches which are two separate steps in processing through the data. The proposed algorithm is tested using datasets of five bioinformatics scenarios and comparing the performance of our model and other ones without ODR. The results show that ODR-ioVFDT has better performance in classification accuracy, kappa statistics, and time consumption. The ODR-ioVFDT applied onto bioinformatics streaming data processing for detecting and quantifying the information of life phenomena, states, characters, variables and components of the organism can help to diagnose and treat disease more effectively. PMID:28230161
Statistical Significance of Clustering using Soft Thresholding
Huang, Hanwen; Liu, Yufeng; Yuan, Ming; Marron, J. S.
2015-01-01
Clustering methods have led to a number of important discoveries in bioinformatics and beyond. A major challenge in their use is determining which clusters represent important underlying structure, as opposed to spurious sampling artifacts. This challenge is especially serious, and very few methods are available, when the data are very high in dimension. Statistical Significance of Clustering (SigClust) is a recently developed cluster evaluation tool for high dimensional low sample size data. An important component of the SigClust approach is the very definition of a single cluster as a subset of data sampled from a multivariate Gaussian distribution. The implementation of SigClust requires the estimation of the eigenvalues of the covariance matrix for the null multivariate Gaussian distribution. We show that the original eigenvalue estimation can lead to a test that suffers from severe inflation of type-I error, in the important case where there are a few very large eigenvalues. This paper addresses this critical challenge using a novel likelihood based soft thresholding approach to estimate these eigenvalues, which leads to a much improved SigClust. Major improvements in SigClust performance are shown by both mathematical analysis, based on the new notion of Theoretical Cluster Index, and extensive simulation studies. Applications to some cancer genomic data further demonstrate the usefulness of these improvements. PMID:26755893
The molecular matching problem
NASA Technical Reports Server (NTRS)
Kincaid, Rex K.
1993-01-01
Molecular chemistry contains many difficult optimization problems that have begun to attract the attention of optimizers in the Operations Research community. Problems including protein folding, molecular conformation, molecular similarity, and molecular matching have been addressed. Minimum energy conformations for simple molecular structures such as water clusters, Lennard-Jones microclusters, and short polypeptides have dominated the literature to date. However, a variety of interesting problems exist and we focus here on a molecular structure matching (MSM) problem.
Big Data Analytics for Demand Response: Clustering Over Space and Time
Chelmis, Charalampos; Kolte, Jahanvi; Prasanna, Viktor K.
2015-10-29
The pervasive deployment of advanced sensing infrastructure in Cyber-Physical systems, such as the Smart Grid, has resulted in an unprecedented data explosion. Such data exhibit both large volumes and high velocity characteristics, two of the three pillars of Big Data, and have a time-series notion as datasets in this context typically consist of successive measurements made over a time interval. Time-series data can be valuable for data mining and analytics tasks such as identifying the “right” customers among a diverse population, to target for Demand Response programs. However, time series are challenging to mine due to their high dimensionality. In this paper, we motivate this problem using a real application from the smart grid domain. We explore novel representations of time-series data for BigData analytics, and propose a clustering technique for determining natural segmentation of customers and identification of temporal consumption patterns. Our method is generizable to large-scale, real-world scenarios, without making any assumptions about the data. We evaluate our technique using real datasets from smart meters, totaling ~ 18,200,000 data points, and show the efficacy of our technique in efficiency detecting the number of optimal number of clusters.
Pseudospectral sampling of Gaussian basis sets as a new avenue to high-dimensional quantum dynamics
NASA Astrophysics Data System (ADS)
Heaps, Charles
This thesis presents a novel approach to modeling quantum molecular dynamics (QMD). Theoretical approaches to QMD are essential to understanding and predicting chemical reactivity and spectroscopy. We implement a method based on a trajectory-guided basis set. In this case, the nuclei are propagated in time using classical mechanics. Each nuclear configuration corresponds to a basis function in the quantum mechanical expansion. Using the time-dependent configurations as a basis set, we are able to evolve in time using relatively little information at each time step. We use a basis set of moving frozen (time-independent width) Gaussian functions that are well-known to provide a simple and efficient basis set for nuclear dynamics. We introduce a new perspective to trajectory-guided Gaussian basis sets based on existing numerical methods. The distinction is based on the Galerkin and collocation methods. In the former, the basis set is tested using basis functions, projecting the solution onto the functional space of the problem and requiring integration over all space. In the collocation method, the Dirac delta function tests the basis set, projecting the solution onto discrete points in space. This effectively reduces the integral evaluation to function evaluation, a fundamental characteristic of pseudospectral methods. We adopt this idea for independent trajectory-guided Gaussian basis functions. We investigate a series of anharmonic vibrational models describing dynamics in up to six dimensions. The pseudospectral sampling is found to be as accurate as full integral evaluation, while the former method is fully general and integration is only possible on very particular model potential energy surfaces. Nonadiabatic dynamics are also investigated in models of photodissociation and collinear triatomic vibronic coupling. Using Ehrenfest trajectories to guide the basis set on multiple surfaces, we observe convergence to exact results using hundreds of basis functions
Clustering of High Throughput Gene Expression Data
Pirim, Harun; Ekşioğlu, Burak; Perkins, Andy; Yüceer, Çetin
2012-01-01
High throughput biological data need to be processed, analyzed, and interpreted to address problems in life sciences. Bioinformatics, computational biology, and systems biology deal with biological problems using computational methods. Clustering is one of the methods used to gain insight into biological processes, particularly at the genomics level. Clearly, clustering can be used in many areas of biological data analysis. However, this paper presents a review of the current clustering algorithms designed especially for analyzing gene expression data. It is also intended to introduce one of the main problems in bioinformatics - clustering gene expression data - to the operations research community. PMID:23144527
NASA Astrophysics Data System (ADS)
Watanabe, Tomohiko; Sugitani, Yoshiki; Konishi, Keiji; Hara, Naoyuki
2017-01-01
The present paper studies amplitude death in high-dimensional maps coupled by time-delay connections. A linear stability analysis provides several sufficient conditions for an amplitude death state to be unstable, i.e., an odd number property and its extended properties. Furthermore, necessary conditions for stability are provided. These conditions, which reduce trial-and-error tasks for design, and the convex direction, which is a popular concept in the field of robust control, allow us to propose a design procedure for system parameters, such as coupling strength, connection delay, and input-output matrices, for a given network topology. These analytical results are confirmed numerically using delayed logistic maps, generalized Henon maps, and piecewise linear maps.
NASA Astrophysics Data System (ADS)
Mehta, Piyush M.; Kubicek, Martin; Minisci, Edmondo; Vasile, Massimiliano
2017-01-01
Well-known tools developed for satellite and debris re-entry perform break-up and trajectory simulations in a deterministic sense and do not perform any uncertainty treatment. The treatment of uncertainties associated with the re-entry of a space object requires a probabilistic approach. A Monte Carlo campaign is the intuitive approach to performing a probabilistic analysis, however, it is computationally very expensive. In this work, we use a recently developed approach based on a new derivation of the high dimensional model representation method for implementing a computationally efficient probabilistic analysis approach for re-entry. Both aleatoric and epistemic uncertainties that affect aerodynamic trajectory and ground impact location are considered. The method is applicable to both controlled and un-controlled re-entry scenarios. The resulting ground impact distributions are far from the typically used Gaussian or ellipsoid distributions.
Li, Ke; Liu, Yi; Wang, Quanxin; Wu, Yalei; Song, Shimin; Sun, Yi; Liu, Tengchong; Wang, Jun; Li, Yang; Du, Shaoyi
2015-01-01
This paper proposes a novel multi-label classification method for resolving the spacecraft electrical characteristics problems which involve many unlabeled test data processing, high-dimensional features, long computing time and identification of slow rate. Firstly, both the fuzzy c-means (FCM) offline clustering and the principal component feature extraction algorithms are applied for the feature selection process. Secondly, the approximate weighted proximal support vector machine (WPSVM) online classification algorithms is used to reduce the feature dimension and further improve the rate of recognition for electrical characteristics spacecraft. Finally, the data capture contribution method by using thresholds is proposed to guarantee the validity and consistency of the data selection. The experimental results indicate that the method proposed can obtain better data features of the spacecraft electrical characteristics, improve the accuracy of identification and shorten the computing time effectively. PMID:26544549
Li, Ke; Liu, Yi; Wang, Quanxin; Wu, Yalei; Song, Shimin; Sun, Yi; Liu, Tengchong; Wang, Jun; Li, Yang; Du, Shaoyi
2015-01-01
This paper proposes a novel multi-label classification method for resolving the spacecraft electrical characteristics problems which involve many unlabeled test data processing, high-dimensional features, long computing time and identification of slow rate. Firstly, both the fuzzy c-means (FCM) offline clustering and the principal component feature extraction algorithms are applied for the feature selection process. Secondly, the approximate weighted proximal support vector machine (WPSVM) online classification algorithms is used to reduce the feature dimension and further improve the rate of recognition for electrical characteristics spacecraft. Finally, the data capture contribution method by using thresholds is proposed to guarantee the validity and consistency of the data selection. The experimental results indicate that the method proposed can obtain better data features of the spacecraft electrical characteristics, improve the accuracy of identification and shorten the computing time effectively.
2013-01-01
optimization problem (2)–(3) is convex and can 1We adopt the convention that yii = 1 for any node i that belongs to a cluster. 2We assume aii = 1 for all i. 3The...relaxations: The formulation (2)–(3) is not the only way to relax the non - convex ML estimator. Instead of the nuclear norm regularizer, a hard constraint ...presented a convex optimization formulation, essentially a convexification of the maximum likelihood estimator. Our theoretic analysis shows that this
Active matter clusters at interfaces.
NASA Astrophysics Data System (ADS)
Copenhagen, Katherine; Gopinathan, Ajay
2016-03-01
Collective and directed motility or swarming is an emergent phenomenon displayed by many self-organized assemblies of active biological matter such as clusters of embryonic cells during tissue development, cancerous cells during tumor formation and metastasis, colonies of bacteria in a biofilm, or even flocks of birds and schools of fish at the macro-scale. Such clusters typically encounter very heterogeneous environments. What happens when a cluster encounters an interface between two different environments has implications for its function and fate. Here we study this problem by using a mathematical model of a cluster that treats it as a single cohesive unit that moves in two dimensions by exerting a force/torque per unit area whose magnitude depends on the nature of the local environment. We find that low speed (overdamped) clusters encountering an interface with a moderate difference in properties can lead to refraction or even total internal reflection of the cluster. For large speeds (underdamped), where inertia dominates, the clusters show more complex behaviors crossing the interface multiple times and deviating from the predictable refraction and reflection for the low velocity clusters. We then present an extreme limit of the model in the absence of rotational damping where clusters can become stuck spiraling along the interface or move in large circular trajectories after leaving the interface. Our results show a wide range of behaviors that occur when collectively moving active biological matter moves across interfaces and these insights can be used to control motion by patterning environments.
Ride, Jemimah; Rowe, Heather; Wynter, Karen; Fisher, Jane; Lorgelly, Paula
2014-01-01
Introduction Postnatal mental health problems, which are an international public health priority, are a suitable target for preventive approaches. The financial burden of these disorders is borne across sectors in society, including health, early childhood, education, justice and the workforce. This paper describes the planned economic evaluation of What Were We Thinking, a psychoeducational intervention for the prevention of postnatal mental health problems in first-time mothers. Methods and analysis The evaluation will be conducted alongside a cluster-randomised controlled trial of its clinical effectiveness. Cost-effectiveness and costs-utility analyses will be conducted, resulting in estimates of cost per percentage point reduction in combined 30-day prevalence of depression, anxiety and adjustment disorders and cost per quality-adjusted life year gained. Uncertainty surrounding these estimates will be addressed using non-parametric bootstrapping and represented using cost-effectiveness acceptability curves. Additional cost analyses relevant for implementation will also be conducted. Modelling will be employed to estimate longer term cost-effectiveness if the intervention is found to be clinically effective during the period of the trial. Ethics and dissemination Approval to conduct the study was granted by the Southern Health (now Monash Health) Human Research Ethics Committee (24 April 2013; 11388B). The study was registered with the Monash University Human Research Ethics Committee (30 April 2013; CF12/1022-2012000474). The Education and Policy Research Committee, Victorian Government Department of Education and Early Childhood Development approved the study (22 March 2012; 2012_001472). Use of the EuroQol was registered with the EuroQol Group; 16 August 2012. Trial registration number The trial was registered with the Australian New Zealand Clinical Trials Registry on 7 May 2012 (registration number ACTRN12613000506796). PMID:25280810
HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree.
Obulkasim, Askar; van de Wiel, Mark A
2015-01-01
Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster. However, the fixed-height branch cut may not be ideal in situations where one expects a complicated tree structure with nested clusters. Furthermore, due to lack of utilization of related background information in selecting the cutoff, induced clusters are often difficult to interpret. This paper describes a novel procedure that aims to automatically extract meaningful clusters from the HC tree in a semi-supervised way. The procedure is implemented in the R package HCsnip available from Bioconductor. Rather than cutting the HC tree at a fixed-height, HCsnip probes the various way of snipping, possibly at variable heights, to tease out hidden clusters ensconced deep down in the tree. The cluster extraction process utilizes, along with the data set from which the HC tree is derived, commonly available background information. Consequently, the extracted clusters are highly reproducible and robust against various sources of variations that "haunted" high-dimensional genomics data. Since the clustering process is guided by the background information, clusters are easy to interpret. Unlike existing packages, no constraint is placed on the data type on which clustering is desired. Particularly, the package accepts patient follow-up data for guiding the cluster extraction process. To our knowledge, HCsnip is the first package that is able to decomposes the HC tree into clusters with piecewise snipping under the guidance of patient time-to-event information. Our implementation of the semi-supervised HC tree snipping framework is generic, and can
2013-08-27
implemented using extremely simple linear algebra operations such as singular value decomposition and tensor power iterations. Method of Moments for...independent component analysis), optimization (e.g con- vex relaxation techniques and tensor algebra ), statistical physics (e.g. phase transitions) and social...underlying tensor algebra in many popular latent variable models such as Gaussian mixture, latent Dirichlet allocation and hidden Markov models. These
Splitting Methods for Convex Clustering
Chi, Eric C.; Lange, Kenneth
2016-01-01
Clustering is a fundamental problem in many scientific applications. Standard methods such as k-means, Gaussian mixture models, and hierarchical clustering, however, are beset by local minima, which are sometimes drastically suboptimal. Recently introduced convex relaxations of k-means and hierarchical clustering shrink cluster centroids toward one another and ensure a unique global minimizer. In this work we present two splitting methods for solving the convex clustering problem. The first is an instance of the alternating direction method of multipliers (ADMM); the second is an instance of the alternating minimization algorithm (AMA). In contrast to previously considered algorithms, our ADMM and AMA formulations provide simple and unified frameworks for solving the convex clustering problem under the previously studied norms and open the door to potentially novel norms. We demonstrate the performance of our algorithm on both simulated and real data examples. While the differences between the two algorithms appear to be minor on the surface, complexity analysis and numerical experiments show AMA to be significantly more efficient. This article has supplemental materials available online. PMID:27087770
Clustering Millions of Faces by Identity.
Otto, Charles; Wang, Dayong; Jain, Anil
2017-03-07
Given a large collection of unlabeled face images, we address the problem of clustering faces into an unknown number of identities. This problem is of interest in social media, law enforcement, and other applications, where the number of faces can be of the order of hundreds of million, while the number of identities (clusters) can range from a few thousand to millions. To address the challenges of run-time complexity and cluster quality, we present an approximate Rank-Order clustering algorithm that performs better than popular clustering algorithms (k-Means and Spectral). Our experiments include clustering up to 123 million face images into over 10 million clusters. Clustering results are analyzed in terms of external (known face labels) and internal (unknown face labels) quality measures, and run-time. Our algorithm achieves an F-measure of 0:87 on the LFW benchmark (13K faces of 5; 749 individuals), which drops to 0:27 on the largest dataset considered (13K faces in LFW + 123M distractor images). Additionally, we show that frames in the YouTube benchmark can be clustered with an F-measure of 0:71. An internal per-cluster quality measure is developed to rank individual clusters for manual exploration of high quality clusters that are compact and isolated.
Zemmour, Christophe; Bertucci, François; Finetti, Pascal; Chetrit, Bernard; Birnbaum, Daniel; Filleron, Thomas; Boher, Jean-Marie
2015-01-01
BACKGROUND DNA microarray studies identified gene expression signatures predictive of metastatic relapse in early breast cancer. Standard feature selection procedures applied to reduce the set of predictive genes did not take into account the correlation between genes. In this paper, we studied the performances of three high-dimensional regression methods – CoxBoost, LASSO (Least Absolute Shrinkage and Selection Operator), and Elastic net – to identify prognostic signatures in patients with early breast cancer. METHODS We analyzed three public retrospective datasets, including a total of 384 patients with axillary lymph node-negative breast cancer. The Amsterdam van’t Veer’s training set of 78 patients was used to determine the optimal gene sets and classifiers using sensitivity thresholds resulting in mis-classification of no more than 10% of the poor-prognosis group. To ensure the comparability between different methods, an automatic selection procedure was used to determine the number of genes included in each model. The van de Vijver’s and Desmedt’s datasets were used as validation sets to evaluate separately the prognostic performances of our classifiers. The results were compared to the original Amsterdam 70-gene classifier. RESULTS The automatic selection procedure reduced the number of predictive genes up to a minimum of six genes. In the two validation sets, the three models (Elastic net, LASSO, and CoxBoost) led to the definition of genomic classifiers predicting the 5-year metastatic status with similar performances, with respective 59, 56, and 54% accuracy, 83, 75, and 83% sensitivity, and 53, 52, and 48% specificity in the Desmedt’s dataset. In comparison, the Amsterdam 70-gene signature showed 45% accuracy, 97% sensitivity, and 34% specificity. The gene overlap and the classification concordance between the three classifiers were high. All the classifiers added significant prognostic information to that provided by the traditional
NASA Technical Reports Server (NTRS)
Socolovsky, Eduardo A.; Bushnell, Dennis M. (Technical Monitor)
2002-01-01
The cosine or correlation measures of similarity used to cluster high dimensional data are interpreted as projections, and the orthogonal components are used to define a complementary dissimilarity measure to form a similarity-dissimilarity measure pair. Using a geometrical approach, a number of properties of this pair is established. This approach is also extended to general inner-product spaces of any dimension. These properties include the triangle inequality for the defined dissimilarity measure, error estimates for the triangle inequality and bounds on both measures that can be obtained with a few floating-point operations from previously computed values of the measures. The bounds and error estimates for the similarity and dissimilarity measures can be used to reduce the computational complexity of clustering algorithms and enhance their scalability, and the triangle inequality allows the design of clustering algorithms for high dimensional distributed data.
Two generalizations of Kohonen clustering
NASA Technical Reports Server (NTRS)
Bezdek, James C.; Pal, Nikhil R.; Tsao, Eric C. K.
1993-01-01
The relationship between the sequential hard c-means (SHCM), learning vector quantization (LVQ), and fuzzy c-means (FCM) clustering algorithms is discussed. LVQ and SHCM suffer from several major problems. For example, they depend heavily on initialization. If the initial values of the cluster centers are outside the convex hull of the input data, such algorithms, even if they terminate, may not produce meaningful results in terms of prototypes for cluster representation. This is due in part to the fact that they update only the winning prototype for every input vector. The impact and interaction of these two families with Kohonen's self-organizing feature mapping (SOFM), which is not a clustering method, but which often leads ideas to clustering algorithms is discussed. Then two generalizations of LVQ that are explicitly designed as clustering algorithms are presented; these algorithms are referred to as generalized LVQ = GLVQ; and fuzzy LVQ = FLVQ. Learning rules are derived to optimize an objective function whose goal is to produce 'good clusters'. GLVQ/FLVQ (may) update every node in the clustering net for each input vector. Neither GLVQ nor FLVQ depends upon a choice for the update neighborhood or learning rate distribution - these are taken care of automatically. Segmentation of a gray tone image is used as a typical application of these algorithms to illustrate the performance of GLVQ/FLVQ.
Active matter clusters at interfaces
NASA Astrophysics Data System (ADS)
Copenhagen, Katherine; Gopinathan, Ajay
Collective and directed motility or swarming is an emergent phenomenon displayed by many self-organized assemblies of active biological matter such as clusters of embryonic cells during tissue development and flocks of birds. Such clusters typically encounter very heterogeneous environments. What happens when a cluster encounters an interface between two different environments has implications for its function and fate. Here we study this problem by using a mathematical model of a cluster that treats it as a single cohesive unit whose movement depends on the nature of the local environment. We find that low speed clusters which exert forces but no active torques, encountering an interface with a moderate difference in properties can lead to refraction or even total internal reflection of the cluster. For large speeds and clusters with active torques, they show more complex behaviors crossing the interface multiple times, becoming trapped at the interface and deviating from the predictable refraction and reflection of the low velocity clusters. Our results show a wide range of behaviors that occur when collectively moving active biological matter moves across interfaces and these insights can be used to control motion by patterning environments.
NASA Astrophysics Data System (ADS)
Massey, Richard; Kitching, Thomas; Nagai, Daisuke
2011-05-01
The unique properties of dark matter are revealed during collisions between clusters of galaxies, such as the bullet cluster (1E 0657-56) and baby bullet (MACS J0025-12). These systems provide evidence for an additional, invisible mass in the separation between the distributions of their total mass, measured via gravitational lensing, and their ordinary 'baryonic' matter, measured via its X-ray emission. Unfortunately, the information available from these systems is limited by their rarity. Constraints on the properties of dark matter, such as its interaction cross-section, are therefore restricted by uncertainties in the individual systems' impact velocity, impact parameter and orientation with respect to the line of sight. Here we develop a complementary, statistical measurement in which every piece of substructure falling into every massive cluster is treated as a bullet. We define 'bulleticity' as the mean separation between dark matter and ordinary matter, and we measure the signal in hydrodynamical simulations. The phase space of substructure orbits also exhibits symmetries that provide an equivalent control test. Any detection of bulleticity in real data would indicate a difference in the interaction cross-sections of baryonic and dark matter that may rule out hypotheses of non-particulate dark matter that are otherwise able to model individual systems. A subsequent measurement of bulleticity could constrain the dark matter cross-section. Even with conservative estimates, the existing Hubble Space Telescope archive should yield an independent constraint tighter than that from the bullet cluster. This technique is then trivially extendable to and benefits enormously from larger, future surveys.
Multitask spectral clustering by exploring intertask correlation.
Yang, Yang; Ma, Zhigang; Yang, Yi; Nie, Feiping; Shen, Heng Tao
2015-05-01
Clustering, as one of the most classical research problems in pattern recognition and data mining, has been widely explored and applied to various applications. Due to the rapid evolution of data on the Web, more emerging challenges have been posed on traditional clustering techniques: 1) correlations among related clustering tasks and/or within individual task are not well captured; 2) the problem of clustering out-of-sample data is seldom considered; and 3) the discriminative property of cluster label matrix is not well explored. In this paper, we propose a novel clustering model, namely multitask spectral clustering (MTSC), to cope with the above challenges. Specifically, two types of correlations are well considered: 1) intertask clustering correlation, which refers the relations among different clustering tasks and 2) intratask learning correlation, which enables the processes of learning cluster labels and learning mapping function to reinforce each other. We incorporate a novel l2,p -norm regularizer to control the coherence of all the tasks based on an assumption that related tasks should share a common low-dimensional representation. Moreover, for each individual task, an explicit mapping function is simultaneously learnt for predicting cluster labels by mapping features to the cluster label matrix. Meanwhile, we show that the learning process can naturally incorporate discriminative information to further improve clustering performance. We explore and discuss the relationships between our proposed model and several representative clustering techniques, including spectral clustering, k -means and discriminative k -means. Extensive experiments on various real-world datasets illustrate the advantage of the proposed MTSC model compared to state-of-the-art clustering approaches.
Cuny, Jérôme; Xie, Yu; Pickard, Chris J; Hassanali, Ali A
2016-02-09
Nuclear magnetic resonance (NMR) spectroscopy is one of the most powerful experimental tools to probe the local atomic order of a wide range of solid-state compounds. However, due to the complexity of the related spectra, in particular for amorphous materials, their interpretation in terms of structural information is often challenging. These difficulties can be overcome by combining molecular dynamics simulations to generate realistic structural models with an ab initio evaluation of the corresponding chemical shift and quadrupolar coupling tensors. However, due to computational constraints, this approach is limited to relatively small system sizes which, for amorphous materials, prevents an adequate statistical sampling of the distribution of the local environments that is required to quantitatively describe the system. In this work, we present an approach to efficiently and accurately predict the NMR parameters of very large systems. This is achieved by using a high-dimensional neural-network representation of NMR parameters that are calculated using an ab initio formalism. To illustrate the potential of this approach, we applied this neural-network NMR (NN-NMR) method on the (17)O and (29)Si quadrupolar coupling and chemical shift parameters of various crystalline silica polymorphs and silica glasses. This approach is, in principal, general and has the potential to be applied to predict the NMR properties of various materials.
NASA Astrophysics Data System (ADS)
Cavaglieri, Daniele; Bewley, Thomas
2015-04-01
Implicit/explicit (IMEX) Runge-Kutta (RK) schemes are effective for time-marching ODE systems with both stiff and nonstiff terms on the RHS; such schemes implement an (often A-stable or better) implicit RK scheme for the stiff part of the ODE, which is often linear, and, simultaneously, a (more convenient) explicit RK scheme for the nonstiff part of the ODE, which is often nonlinear. Low-storage RK schemes are especially effective for time-marching high-dimensional ODE discretizations of PDE systems on modern (cache-based) computational hardware, in which memory management is often the most significant computational bottleneck. In this paper, we develop and characterize eight new low-storage implicit/explicit RK schemes which have higher accuracy and better stability properties than the only low-storage implicit/explicit RK scheme available previously, the venerable second-order Crank-Nicolson/Runge-Kutta-Wray (CN/RKW3) algorithm that has dominated the DNS/LES literature for the last 25 years, while requiring similar storage (two, three, or four registers of length N) and comparable floating-point operations per timestep.
Rassen, Jeremy A; Schneeweiss, Sebastian
2012-01-01
Distributed medical product safety monitoring systems such as the Sentinel System, to be developed as a part of Food and Drug Administration's Sentinel Initiative, will require automation of large parts of the safety evaluation process to achieve the necessary speed and scale at reasonable cost without sacrificing validity. Although certain functions will require investigator intervention, confounding control is one area that can largely be automated. The high-dimensional propensity score (hd-PS) algorithm is one option for automated confounding control in longitudinal healthcare databases. In this article, we discuss the use of hd-PS for automating confounding control in sequential database cohort studies, as applied to safety monitoring systems. In particular, we discuss the robustness of the covariate selection process, the potential for over- or under-selection of variables including the possibilities of M-bias and Z-bias, the computation requirements, the practical considerations in a federated database network, and the cases where automated confounding adjustment may not function optimally. We also outline recent improvements to the algorithm and show how the algorithm has performed in several published studies. We conclude that despite certain limitations, hd-PS offers substantial advantages over non-automated alternatives in active product safety monitoring systems.
McMahon, Sean M.; Metcalf, Charlotte J. E.; Woodall, Christopher W.
2011-01-01
Theoretical models indicate that trade-offs between growth and survival strategies of tree species can lead to coexistence across life history stages (ontogeny) and physical conditions experienced by individuals. There exist predicted physiological mechanisms regulating these trade-offs, such as an investment in leaf characters that may increase survival in stressful environments at the expense of investment in bole or root growth. Confirming these mechanisms, however, requires that potential environmental, ontogenetic, and trait influences are analyzed together. Here, we infer growth and mortality of tree species given size, site, and light characteristics from forest inventory data from Wisconsin to test hypotheses about growth-survival trade-offs given species functional trait values under different ontogenetic and environmental states. A series of regression analyses including traits and rates their interactions with environmental and ontogenetic stages supported the relationships between traits and vital rates expected from the expectations from tree physiology. A combined model including interactions between all variables indicated that relationships between demographic rates and functional traits supports growth-survival trade-offs and their differences across species in high-dimensional niche space. The combined model explained 65% of the variation in tree growth and supports a concept of community coexistence similar to Hutchinson's n-dimensional hypervolume and not a low-dimensional niche model or neutral model. PMID:21305020
Meng, Xi; Nguyen, Bao D.; Ridge, Clark; Shaka, A. J.
2009-01-01
High-dimensional (HD) NMR spectra have poorer digital resolution than low-dimensional (LD) spectra, for a fixed amount of experiment time. This has led to “reduced-dimensionality” strategies, in which several LD projections of the HD NMR spectrum are acquired, each with higher digital resolution; an approximate HD spectrum is then inferred by some means. We propose a strategy that moves in the opposite direction, by adding more time dimensions to increase the information content of the data set, even if only a very sparse time grid is used in each dimension. The full HD time-domain data can be analyzed by the Filter Diagonalization Method (FDM), yielding very narrow resonances along all of the frequency axes, even those with sparse sampling. Integrating over the added dimensions of HD FDM NMR spectra reconstitutes LD spectra with enhanced resolution, often more quickly than direct acquisition of the LD spectrum with a larger number of grid points in each of the fewer dimensions. If the extra dimensions do not appear in the final spectrum, and are used solely to boost information content, we propose the moniker hidden-dimension NMR. This work shows that HD peaks have unmistakable frequency signatures that can be detected as single HD objects by an appropriate algorithm, even though their patterns would be tricky for a human operator to visualize or recognize, and even if digital resolution in an HD FT spectrum is very coarse compared with natural line widths. PMID:18926747
Cluster synchronization induced by one-node clusters in networks with asymmetric negative couplings
Zhang, Jianbao; Ma, Zhongjun; Zhang, Gang
2013-12-15
This paper deals with the problem of cluster synchronization in networks with asymmetric negative couplings. By decomposing the coupling matrix into three matrices, and employing Lyapunov function method, sufficient conditions are derived for cluster synchronization. The conditions show that the couplings of multi-node clusters from one-node clusters have beneficial effects on cluster synchronization. Based on the effects of the one-node clusters, an effective and universal control scheme is put forward for the first time. The obtained results may help us better understand the relation between cluster synchronization and cluster structures of the networks. The validity of the control scheme is confirmed through two numerical simulations, in a network with no cluster structure and in a scale-free network.
Clustering PPI data by combining FA and SHC method
2015-01-01
Clustering is one of main methods to identify functional modules from protein-protein interaction (PPI) data. Nevertheless traditional clustering methods may not be effective for clustering PPI data. In this paper, we proposed a novel method for clustering PPI data by combining firefly algorithm (FA) and synchronization-based hierarchical clustering (SHC) algorithm. Firstly, the PPI data are preprocessed via spectral clustering (SC) which transforms the high-dimensional similarity matrix into a low dimension matrix. Then the SHC algorithm is used to perform clustering. In SHC algorithm, hierarchical clustering is achieved by enlarging the neighborhood radius of synchronized objects continuously, while the hierarchical search is very difficult to find the optimal neighborhood radius of synchronization and the efficiency is not high. So we adopt the firefly algorithm to determine the optimal threshold of the neighborhood radius of synchronization automatically. The proposed algorithm is tested on the MIPS PPI dataset. The results show that our proposed algorithm is better than the traditional algorithms in precision, recall and f-measure value. PMID:25707632
Progeny Clustering: A Method to Identify Biological Phenotypes
Hu, Chenyue W.; Kornblau, Steven M.; Slater, John H.; Qutub, Amina A.
2015-01-01
Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset. PMID:26267476
Clustering PPI data by combining FA and SHC method.
Lei, Xiujuan; Ying, Chao; Wu, Fang-Xiang; Xu, Jin
2015-01-01
Clustering is one of main methods to identify functional modules from protein-protein interaction (PPI) data. Nevertheless traditional clustering methods may not be effective for clustering PPI data. In this paper, we proposed a novel method for clustering PPI data by combining firefly algorithm (FA) and synchronization-based hierarchical clustering (SHC) algorithm. Firstly, the PPI data are preprocessed via spectral clustering (SC) which transforms the high-dimensional similarity matrix into a low dimension matrix. Then the SHC algorithm is used to perform clustering. In SHC algorithm, hierarchical clustering is achieved by enlarging the neighborhood radius of synchronized objects continuously, while the hierarchical search is very difficult to find the optimal neighborhood radius of synchronization and the efficiency is not high. So we adopt the firefly algorithm to determine the optimal threshold of the neighborhood radius of synchronization automatically. The proposed algorithm is tested on the MIPS PPI dataset. The results show that our proposed algorithm is better than the traditional algorithms in precision, recall and f-measure value.
NASA Astrophysics Data System (ADS)
Chang, Yang-Lang; Han, Chin-Chuan; Fan, Kuo-Chin; Chen, Kun S.; Chang, Jeng-Horng
2002-11-01
In this paper, a novel filter-based greedy modular subspace (GMS)technique is proposed to improve the accuracy of high-dimensional remote sensing image supervisor classification. The approach initially divides the whole set of high-dimensional features into several arbitrary number of highly correlated subgroup by performing a greedy correlation matrix reordering transformation for each class. These GMS can be regarded as a unique feature for each distinguishable class in high-dimensional data sets. The similarity measures are next calculated by projecting the samples into different modular feature subspaces. Finally, a supervised multi-class classifer which is implemented based on positive Boolean function (PBF) schemes is adopted to build a non-linear optimal classifer. A PBF is exactly one sum-of-product form without any negative components. The PBF possesses the well-known threshold decomposition and stacking properties. The classification errors can be calculated from the summation of the absolute errors incurred at each level. The optimal PBF are found and designed as a classifer by minimize the classification error rate among the training samples. Experimental results demonstrate that the proposed GMS feature extraction method suits the PBF classifer best as a classification preprocess. It signifcantly improves the precision of image classification compared with conventional feature extraction schemes. Moreover, a practicable and convenient "vague" boundary sampling property of PBF is introduced to visually select training samples from high-dimensional data sets more effciently.
On evaluating clustering procedures for use in classification
NASA Technical Reports Server (NTRS)
Pore, M. D.; Moritz, T. E.; Register, D. T.; Yao, S. S.; Eppler, W. G. (Principal Investigator)
1979-01-01
The problem of evaluating clustering algorithms and their respective computer programs for use in a preprocessing step for classification is addressed. In clustering for classification the probability of correct classification is suggested as the ultimate measure of accuracy on training data. A means of implementing this criterion and a measure of cluster purity are discussed. Examples are given. A procedure for cluster labeling that is based on cluster purity and sample size is presented.
2012-01-01
Background Based on available biological information, genomic data can often be partitioned into pre-defined sets (e.g. pathways) and subsets within sets. Biologists are often interested in determining whether some pre-defined sets of variables (e.g. genes) are differentially expressed under varying experimental conditions. Several procedures are available in the literature for making such determinations, however, they do not take into account information regarding the subsets within each set. Secondly, variables (e.g. genes) belonging to a set or a subset are potentially correlated, yet such information is often ignored and univariate methods are used. This may result in loss of power and/or inflated false positive rate. Results We introduce a multiple testing-based methodology which makes use of available information regarding biologically relevant subsets within each pre-defined set of variables while exploiting the underlying dependence structure among the variables. Using this methodology, a biologist may not only determine whether a set of variables are differentially expressed between two experimental conditions, but may also test whether specific subsets within a significant set are also significant. Conclusions The proposed methodology; (a) is easy to implement, (b) does not require inverting potentially singular covariance matrices, and (c) controls the family wise error rate (FWER) at the desired nominal level, (d) is robust to the underlying distribution and covariance structures. Although for simplicity of exposition, the methodology is described for microarray gene expression data, it is also applicable to any high dimensional data, such as the mRNA seq data, CpG methylation data etc. PMID:22827252
NASA Astrophysics Data System (ADS)
Nagarajan, Mahesh B.; Huber, Markus B.; Schlossbauer, Thomas; Leinsinger, Gerda; Wismueller, Axel
2010-03-01
Haralick texture features derived from gray-level co-occurrence matrices (GLCM) were used to classify the character of suspicious breast lesions as benign or malignant on dynamic contrast-enhanced MRI studies. Lesions were identified and annotated by an experienced radiologist on 54 MRI exams of female patients where histopathological reports were available prior to this investigation. GLCMs were then extracted from these 2D regions of interest (ROI) for four principal directions (0°, 45°, 90° & 135°) and used to compute Haralick texture features. A fuzzy k-nearest neighbor (k- NN) classifier was optimized in ten-fold cross-validation for each texture feature and the classification performance was calculated on an independent test set as a function of area under the ROC curve. The lesion ROIs were characterized by texture feature vectors containing the Haralick feature values computed from each directional-GLCM; and the classifier results obtained were compared to a previously used approach where the directional-GLCMs were summed to a nondirectional GLCM which could further yield a set of texture feature values. The impact of varying the inter-pixel distance while generating the GLCMs on the classifier's performance was also investigated. Classifier's AUC was found to significantly increase when the high-dimensional texture feature vector approach was pursued, and when features derived from GLCMs generated using different inter-pixel distances were incorporated into the classification task. These results indicate that lesion character classification accuracy could be improved by retaining the texture features derived from the different directional GLCMs rather than combining these to yield a set of scalar feature values instead.
Cluster randomization and political philosophy.
Chwang, Eric
2012-11-01
In this paper, I will argue that, while the ethical issues raised by cluster randomization can be challenging, they are not new. My thesis divides neatly into two parts. In the first, easier part I argue that many of the ethical challenges posed by cluster randomized human subjects research are clearly present in other types of human subjects research, and so are not novel. In the second, more difficult part I discuss the thorniest ethical challenge for cluster randomized research--cases where consent is genuinely impractical to obtain. I argue that once again these cases require no new analytic insight; instead, we should look to political philosophy for guidance. In other words, the most serious ethical problem that arises in cluster randomized research also arises in political philosophy.
A framework for feature selection in clustering
Witten, Daniela M.; Tibshirani, Robert
2010-01-01
We consider the problem of clustering observations using a potentially large set of features. One might expect that the true underlying clusters present in the data differ only with respect to a small fraction of the features, and will be missed if one clusters the observations using the full set of features. We propose a novel framework for sparse clustering, in which one clusters the observations using an adaptively chosen subset of the features. The method uses a lasso-type penalty to select the features. We use this framework to develop simple methods for sparse K-means and sparse hierarchical clustering. A single criterion governs both the selection of the features and the resulting clusters. These approaches are demonstrated on simulated data and on genomic data sets. PMID:20811510
Local matrix learning in clustering and applications for manifold visualization.
Arnonkijpanich, Banchar; Hasenfuss, Alexander; Hammer, Barbara
2010-05-01
Electronic data sets are increasing rapidly with respect to both, size of the data sets and data resolution, i.e. dimensionality, such that adequate data inspection and data visualization have become central issues of data mining. In this article, we present an extension of classical clustering schemes by local matrix adaptation, which allows a better representation of data by means of clusters with an arbitrary spherical shape. Unlike previous proposals, the method is derived from a global cost function. The focus of this article is to demonstrate the applicability of this matrix clustering scheme to low-dimensional data embedding for data inspection. The proposed method is based on matrix learning for neural gas and manifold charting. This provides an explicit mapping of a given high-dimensional data space to low dimensionality. We demonstrate the usefulness of this method for data inspection and manifold visualization.
3D visualization of gene clusters and networks
NASA Astrophysics Data System (ADS)
Zhang, Leishi; Sheng, Weiguo; Liu, Xiaohui
2005-03-01
In this paper, we try to provide a global view of DNA microarray gene expression data analysis and modeling process by combining novel and effective visualization techniques with data mining algorithms. An integrated framework has been proposed to model and visualize short, high-dimensional gene expression data. The framework reduces the dimensionality of variables before applying appropriate temporal modeling method. Prototype has been built using Java3D to visualize the framework. The prototype takes gene expression data as input, clusters the genes, displays the clustering results using a novel graph layout algorithm, models individual gene clusters using Dynamic Bayesian Network and then visualizes the modeling results using simple but effective visualization techniques.
SAIL: Summation-bAsed Incremental Learning for Information-Theoretic Text Clustering.
Cao, Jie; Wu, Zhiang; Wu, Junjie; Xiong, Hui
2013-04-01
Information-theoretic clustering aims to exploit information-theoretic measures as the clustering criteria. A common practice on this topic is the so-called Info-Kmeans, which performs K-means clustering with KL-divergence as the proximity function. While expert efforts on Info-Kmeans have shown promising results, a remaining challenge is to deal with high-dimensional sparse data such as text corpora. Indeed, it is possible that the centroids contain many zero-value features for high-dimensional text vectors, which leads to infinite KL-divergence values and creates a dilemma in assigning objects to centroids during the iteration process of Info-Kmeans. To meet this challenge, in this paper, we propose a Summation-bAsed Incremental Learning (SAIL) algorithm for Info-Kmeans clustering. Specifically, by using an equivalent objective function, SAIL replaces the computation of KL-divergence by the incremental computation of Shannon entropy. This can avoid the zero-feature dilemma caused by the use of KL-divergence. To improve the clustering quality, we further introduce the variable neighborhood search scheme and propose the V-SAIL algorithm, which is then accelerated by a multithreaded scheme in PV-SAIL. Our experimental results on various real-world text collections have shown that, with SAIL as a booster, the clustering performance of Info-Kmeans can be significantly improved. Also, V-SAIL and PV-SAIL indeed help improve the clustering quality at a lower cost of computation.
Unbridled growth of spin-glass clusters
NASA Astrophysics Data System (ADS)
Kessler, David A.; Bretz, Michael
1990-03-01
We investigate the application of the recent cluster-based acceleration methods of Wolff and of Kandel et al. to the problem of simulating spin glasses. We find the techniques offer no improvement as the clusters generated by these algorithms are infinitely large or interact infinitely strongly, respectively. We comment on the reasons for this failure.
Feature Clustering for Accelerating Parallel Coordinate Descent
Scherrer, Chad; Tewari, Ambuj; Halappanavar, Mahantesh; Haglin, David J.
2012-12-06
We demonstrate an approach for accelerating calculation of the regularization path for L1 sparse logistic regression problems. We show the benefit of feature clustering as a preconditioning step for parallel block-greedy coordinate descent algorithms.
Annest, Amalia; Bumgarner, Roger E; Raftery, Adrian E; Yeung, Ka Yee
2009-01-01
Background Microarray technology is increasingly used to identify potential biomarkers for cancer prognostics and diagnostics. Previously, we have developed the iterative Bayesian Model Averaging (BMA) algorithm for use in classification. Here, we extend the iterative BMA algorithm for application to survival analysis on high-dimensional microarray data. The main goal in applying survival analysis to microarray data is to determine a highly predictive model of patients' time to event (such as death, relapse, or metastasis) using a small number of selected genes. Our multivariate procedure combines the effectiveness of multiple contending models by calculating the weighted average of their posterior probability distributions. Our results demonstrate that our iterative BMA algorithm for survival analysis achieves high prediction accuracy while consistently selecting a small and cost-effective number of predictor genes. Results We applied the iterative BMA algorithm to two cancer datasets: breast cancer and diffuse large B-cell lymphoma (DLBCL) data. On the breast cancer data, the algorithm selected a total of 15 predictor genes across 84 contending models from the training data. The maximum likelihood estimates of the selected genes and the posterior probabilities of the selected models from the training data were used to divide patients in the test (or validation) dataset into high- and low-risk categories. Using the genes and models determined from the training data, we assigned patients from the test data into highly distinct risk groups (as indicated by a p-value of 7.26e-05 from the log-rank test). Moreover, we achieved comparable results using only the 5 top selected genes with 100% posterior probabilities. On the DLBCL data, our iterative BMA procedure selected a total of 25 genes across 3 contending models from the training data. Once again, we assigned the patients in the validation set to significantly distinct risk groups (p-value = 0.00139). Conclusion
NASA Astrophysics Data System (ADS)
Chowdhary, K.; Guo, Z.; Wang, M.; Lucas, D. D.; Debusschere, B.
2014-12-01
High-dimensional parametric uncertainty exists in many parts of atmospheric climatemodels. It is computationally intractable to fully understand their impact on the climatewithout a significant reduction in the number of dimensions. We employ Bayesian CompressedSensing (BCS) to perform adaptive sensitivity analysis in order to determine whichparameters affect the Quantity of Interest (QoI) the most and the least. In short, BCSfits a polynomial to the QoI via a Bayesian framework with an L1 (Laplace) prior. Thus,BCS tries to find the sparsest polynomial representation of the QoI, i.e., the fewestterms, while still trying to retain high accuracy. This procedure is adaptive in the sensethat higher order polynomial terms can be added to the polynomial model when it is likely thatparticular parameters have a significant effect on the QoI. This helps avoid overfitting and is much more computationally efficient. We apply the BCS algorithm to two sets of single column CAM (Community Atmosphere Model)simulations. In the first application, we analyze liquid cloud fraction as modeled byCLUBB (Cloud Layers Unified By Binormals), an atmospheric cloud and turbulence model.This liquid cloud fraction QoI depends on 29 different input parameters. We compare mainSobol sensitivity indices obtained with the BCS algorithm for the liquid cloud fraction in6 cases, with a previous approach to sensitivity analysis using deviance. We show BCS canprovide almost identical sensitivity analysis results. Additionally, BCS can provide animproved, lower-dimensional, higher order model for prediction. In the secondapplication, we study the time averaged ozone concentration, at varying altitudes, as afunction of 95 photochemical parameters, in order to study the sensitivity to theseparameters. To further improve model prediction, we also explore k-fold cross validationto obtain a better model for both liquid cloud fraction in CLUBB and ozone concentrationin CAM. This material is based upon work
Leroux, Elizabeth; Ducros, Anne
2008-01-01
Cluster headache (CH) is a primary headache disease characterized by recurrent short-lasting attacks (15 to 180 minutes) of excruciating unilateral periorbital pain accompanied by ipsilateral autonomic signs (lacrimation, nasal congestion, ptosis, miosis, lid edema, redness of the eye). It affects young adults, predominantly males. Prevalence is estimated at 0.5–1.0/1,000. CH has a circannual and circadian periodicity, attacks being clustered (hence the name) in bouts that can occur during specific months of the year. Alcohol is the only dietary trigger of CH, strong odors (mainly solvents and cigarette smoke) and napping may also trigger CH attacks. During bouts, attacks may happen at precise hours, especially during the night. During the attacks, patients tend to be restless. CH may be episodic or chronic, depending on the presence of remission periods. CH is associated with trigeminovascular activation and neuroendocrine and vegetative disturbances, however, the precise cautive mechanisms remain unknown. Involvement of the hypothalamus (a structure regulating endocrine function and sleep-wake rhythms) has been confirmed, explaining, at least in part, the cyclic aspects of CH. The disease is familial in about 10% of cases. Genetic factors play a role in CH susceptibility, and a causative role has been suggested for the hypocretin receptor gene. Diagnosis is clinical. Differential diagnoses include other primary headache diseases such as migraine, paroxysmal hemicrania and SUNCT syndrome. At present, there is no curative treatment. There are efficient treatments to shorten the painful attacks (acute treatments) and to reduce the number of daily attacks (prophylactic treatments). Acute treatment is based on subcutaneous administration of sumatriptan and high-flow oxygen. Verapamil, lithium, methysergide, prednisone, greater occipital nerve blocks and topiramate may be used for prophylaxis. In refractory cases, deep-brain stimulation of the hypothalamus and
Formation of Cluster Complexes by Cluster-Cluster-Collisions
NASA Astrophysics Data System (ADS)
Ichihashi, Masahiko; Odaka, Hideho
2015-03-01
Multi-element clusters are interested in their chemical and physical properties, and it is expected that they are utilized as catalysts, for example. Their properties critically depend on the size, composition and atomic ordering, and it should be important to adjust the above parameters for their functionality. One of the ways to form a multi-element cluster is to employ a low-energy collision between clusters. Here, we show characteristic results obtained in the collision between a neutral Ar cluster and a size-selected Co cluster ion. Low-energy collision experiment was accomplished by using a newly developed merging-beam apparatus. Cobalt cluster ions were produced by laser ablation, and mass-selected. On the other hand, argon clusters were prepared by the supersonic expansion of Ar gas. Both cluster beams were merged together in an ion guide, and ionic cluster complexes were mass-analyzed. In the collision of Co2+ and ArN, Co2Arn+ (n = 1 - 30) were observed, and the total intensity of Co2Arn+ (n >= 1) is inversely proportional to the relative velocity between Co2+ and ArN. This suggests that the charge-induced dipole interaction between Co2+ and a neutral Ar cluster is dominant in the formation of the cluster complex, Co2+Arn.
A hierarchical clustering algorithm for MIMD architecture.
Du, Zhihua; Lin, Feng
2004-12-01
Hierarchical clustering is the most often used method for grouping similar patterns of gene expression data. A fundamental problem with existing implementations of this clustering method is the inability to handle large data sets within a reasonable time and memory resources. We propose a parallelized algorithm of hierarchical clustering to solve this problem. Our implementation on a multiple instruction multiple data (MIMD) architecture shows considerable reduction in computational time and inter-node communication overhead, especially for large data sets. We use the standard message passing library, message passing interface (MPI) for any MIMD systems.
ERIC Educational Resources Information Center
Kinsella, John J.
1970-01-01
Discussed are the nature of a mathematical problem, problem solving in the traditional and modern mathematics programs, problem solving and psychology, research related to problem solving, and teaching problem solving in algebra and geometry. (CT)
Swarm Intelligence in Text Document Clustering
Cui, Xiaohui; Potok, Thomas E
2008-01-01
Social animals or insects in nature often exhibit a form of emergent collective behavior. The research field that attempts to design algorithms or distributed problem-solving devices inspired by the collective behavior of social insect colonies is called Swarm Intelligence. Compared to the traditional algorithms, the swarm algorithms are usually flexible, robust, decentralized and self-organized. These characters make the swarm algorithms suitable for solving complex problems, such as document collection clustering. The major challenge of today's information society is being overwhelmed with information on any topic they are searching for. Fast and high-quality document clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the overwhelmed information. In this chapter, we introduce three nature inspired swarm intelligence clustering approaches for document clustering analysis. These clustering algorithms use stochastic and heuristic principles discovered from observing bird flocks, fish schools and ant food forage.
Bipartite graph partitioning and data clustering
Zha, Hongyuan; He, Xiaofeng; Ding, Chris; Gu, Ming; Simon, Horst D.
2001-05-07
Many data types arising from data mining applications can be modeled as bipartite graphs, examples include terms and documents in a text corpus, customers and purchasing items in market basket analysis and reviewers and movies in a movie recommender system. In this paper, the authors propose a new data clustering method based on partitioning the underlying biopartite graph. The partition is constructed by minimizing a normalized sum of edge weights between unmatched pairs of vertices of the bipartite graph. They show that an approximate solution to the minimization problem can be obtained by computing a partial singular value decomposition (SVD) of the associated edge weight matrix of the bipartite graph. They point out the connection of their clustering algorithm to correspondence analysis used in multivariate analysis. They also briefly discuss the issue of assigning data objects to multiple clusters. In the experimental results, they apply their clustering algorithm to the problem of document clustering to illustrate its effectiveness and efficiency.
On the clustering of multidimensional pictorial data
NASA Technical Reports Server (NTRS)
Bryant, J. D. (Principal Investigator)
1979-01-01
Obvious approaches to reducing the cost (in computer resources) of applying current clustering techniques to the problem of remote sensing are discussed. The use of spatial information in finding fields and in classifying mixture pixels is examined, and the AMOEBA clustering program is described. Internally, a pattern recognition program, from without, AMOEBA appears to be an unsupervised clustering program. It is fast and automatic. No choices (such as arbitrary thresholds to set split/combine sequences) need be made. The problem of finding the number of clusters is solved automatically. At the conclusion of the program, all points in the scene are classified; however, a provision is included for a reject classification of some points which, within the theoretical framework, cannot rationally be assigned to any cluster.
Pattern Clustering Using a Swarm Intelligence Approach
NASA Astrophysics Data System (ADS)
Das, Swagatam; Abraham, Ajith
Clustering aims at representing large datasets by a fewer number of prototypes or clusters. It brings simplicity in modeling data and thus plays a central role in the process of knowledge discovery and data mining. Data mining tasks, in these days, require fast and accurate partitioning of huge datasets, which may come with a variety of attributes or features. This, in turn, imposes severe computational requirements on the relevant clustering techniques. A family of bio-inspired algorithms, well-known as Swarm Intelligence (SI) has recently emerged that meets these requirements and has successfully been applied to a number of real world clustering problems. This chapter explores the role of SI in clustering different kinds of datasets. It finally describes a new SI technique for partitioning a linearly non-separable dataset into an optimal number of clusters in the kernel- induced feature space. Computer simulations undertaken in this research have also been provided to demonstrate the effectiveness of the proposed algorithm.
Segmentation of dynamic PET images with kinetic spectral clustering
NASA Astrophysics Data System (ADS)
Mouysset, S.; Zbib, H.; Stute, S.; Girault, J. M.; Charara, J.; Noailles, J.; Chalon, S.; Buvat, I.; Tauber, C.
2013-10-01
Segmentation is often required for the analysis of dynamic positron emission tomography (PET) images. However, noise and low spatial resolution make it a difficult task and several supervised and unsupervised methods have been proposed in the literature to perform the segmentation based on semi-automatic clustering of the time activity curves of voxels. In this paper we propose a new method based on spectral clustering that does not require any prior information on the shape of clusters in the space in which they are identified. In our approach, the p-dimensional data, where p is the number of time frames, is first mapped into a high dimensional space and then clustering is performed in a low-dimensional space of the Laplacian matrix. An estimation of the bounds for the scale parameter involved in the spectral clustering is derived. The method is assessed using dynamic brain PET images simulated with GATE and results on real images are presented. We demonstrate the usefulness of the method and its superior performance over three other clustering methods from the literature. The proposed approach appears as a promising pre-processing tool before parametric map calculation or ROI-based quantification tasks.
NASA Astrophysics Data System (ADS)
Brandl, Miriam B.; Beck, Dominik; Pham, Tuan D.
2011-06-01
The high dimensionality of image-based dataset can be a drawback for classification accuracy. In this study, we propose the application of fuzzy c-means clustering, cluster validity indices and the notation of a joint-feature-clustering matrix to find redundancies of image-features. The introduced matrix indicates how frequently features are grouped in a mutual cluster. The resulting information can be used to find data-derived feature prototypes with a common biological meaning, reduce data storage as well as computation times and improve the classification accuracy.
Large scale cluster computing workshop
Dane Skow; Alan Silverman
2002-12-23
Recent revolutions in computer hardware and software technologies have paved the way for the large-scale deployment of clusters of commodity computers to address problems heretofore the domain of tightly coupled SMP processors. Near term projects within High Energy Physics and other computing communities will deploy clusters of scale 1000s of processors and be used by 100s to 1000s of independent users. This will expand the reach in both dimensions by an order of magnitude from the current successful production facilities. The goals of this workshop were: (1) to determine what tools exist which can scale up to the cluster sizes foreseen for the next generation of HENP experiments (several thousand nodes) and by implication to identify areas where some investment of money or effort is likely to be needed. (2) To compare and record experimences gained with such tools. (3) To produce a practical guide to all stages of planning, installing, building and operating a large computing cluster in HENP. (4) To identify and connect groups with similar interest within HENP and the larger clustering community.
... often, it could be a sign of a balance problem. Balance problems can make you feel unsteady or as ... fall-related injuries, such as hip fracture. Some balance problems are due to problems in the inner ...
ON CLUSTERING TECHNIQUES OF CITATION GRAPHS.
ERIC Educational Resources Information Center
CHIEN, R.T.; PREPARATA, F.P.
ONE OF THE PROBLEMS ENCOUNTERED IN CLUSTERING TECHNIQUES AS APPLIED TO DOCUMENT RETRIEVAL SYSTEMS USING BIBLIOGRAPHIC COUPLING DEVICES IS THAT THE COMPUTATIONAL EFFORT REQUIRED GROWS ROUGHLY AS THE SQUARE OF THE COLLECTION SIZE. IN THIS STUDY GRAPH THEORY IS APPLIED TO THIS PROBLEM BY FIRST MAPPING THE CITATION GRAPH OF THE DOCUMENT COLLECTION…
A reduced basis Landweber method for nonlinear inverse problems
NASA Astrophysics Data System (ADS)
Garmatter, Dominik; Haasdonk, Bernard; Harrach, Bastian
2016-03-01
We consider parameter identification problems in parametrized partial differential equations (PDEs). These lead to nonlinear ill-posed inverse problems. One way of solving them is using iterative regularization methods, which typically require numerous amounts of forward solutions during the solution process. In this article we consider the nonlinear Landweber method and couple it with the reduced basis method as a model order reduction technique in order to reduce the overall computational time. In particular, we consider PDEs with a high-dimensional parameter space, which are known to pose difficulties in the context of reduced basis methods. We present a new method that is able to handle such high-dimensional parameter spaces by combining the nonlinear Landweber method with adaptive online reduced basis updates. It is then applied to the inverse problem of reconstructing the conductivity in the stationary heat equation.
Performance Comparison Of Evolutionary Algorithms For Image Clustering
NASA Astrophysics Data System (ADS)
Civicioglu, P.; Atasever, U. H.; Ozkan, C.; Besdok, E.; Karkinli, A. E.; Kesikoglu, A.
2014-09-01
Evolutionary computation tools are able to process real valued numerical sets in order to extract suboptimal solution of designed problem. Data clustering algorithms have been intensively used for image segmentation in remote sensing applications. Despite of wide usage of evolutionary algorithms on data clustering, their clustering performances have been scarcely studied by using clustering validation indexes. In this paper, the recently proposed evolutionary algorithms (i.e., Artificial Bee Colony Algorithm (ABC), Gravitational Search Algorithm (GSA), Cuckoo Search Algorithm (CS), Adaptive Differential Evolution Algorithm (JADE), Differential Search Algorithm (DSA) and Backtracking Search Optimization Algorithm (BSA)) and some classical image clustering techniques (i.e., k-means, fcm, som networks) have been used to cluster images and their performances have been compared by using four clustering validation indexes. Experimental test results exposed that evolutionary algorithms give more reliable cluster-centers than classical clustering techniques, but their convergence time is quite long.
Convergence and Energy Landscape for Cheeger Cut Clustering
2014-12-01
theoretical and algorithmic results for the `1-relaxation of the Cheeger cut problem. The `2-relaxation, known as spectral clustering , only loosely...results for the ‘1-relaxation of the Cheeger cut problem. The ‘2-relaxation, known as spectral clustering , only loosely relates to the Cheeger cut...PAGE unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 and improve upon the standard `2 ( spectral clustering ) method [11, 14
PREFACE: Nuclear Cluster Conference; Cluster'07
NASA Astrophysics Data System (ADS)
Freer, Martin
2008-05-01
The Cluster Conference is a long-running conference series dating back to the 1960's, the first being initiated by Wildermuth in Bochum, Germany, in 1969. The most recent meeting was held in Nara, Japan, in 2003, and in 2007 the 9th Cluster Conference was held in Stratford-upon-Avon, UK. As the name suggests the town of Stratford lies upon the River Avon, and shortly before the conference, due to unprecedented rainfall in the area (approximately 10 cm within half a day), lay in the River Avon! Stratford is the birthplace of the `Bard of Avon' William Shakespeare, and this formed an intriguing conference backdrop. The meeting was attended by some 90 delegates and the programme contained 65 70 oral presentations, and was opened by a historical perspective presented by Professor Brink (Oxford) and closed by Professor Horiuchi (RCNP) with an overview of the conference and future perspectives. In between, the conference covered aspects of clustering in exotic nuclei (both neutron and proton-rich), molecular structures in which valence neutrons are exchanged between cluster cores, condensates in nuclei, neutron-clusters, superheavy nuclei, clusters in nuclear astrophysical processes and exotic cluster decays such as 2p and ternary cluster decay. The field of nuclear clustering has become strongly influenced by the physics of radioactive beam facilities (reflected in the programme), and by the excitement that clustering may have an important impact on the structure of nuclei at the neutron drip-line. It was clear that since Nara the field had progressed substantially and that new themes had emerged and others had crystallized. Two particular topics resonated strongly condensates and nuclear molecules. These topics are thus likely to be central in the next cluster conference which will be held in 2011 in the Hungarian city of Debrechen. Martin Freer Participants and Cluster'07
CLAG: an unsupervised non hierarchical clustering algorithm handling biological data
2012-01-01
Background Searching for similarities in a set of biological data is intrinsically difficult due to possible data points that should not be clustered, or that should group within several clusters. Under these hypotheses, hierarchical agglomerative clustering is not appropriate. Moreover, if the dataset is not known enough, like often is the case, supervised classification is not appropriate either. Results CLAG (for CLusters AGgregation) is an unsupervised non hierarchical clustering algorithm designed to cluster a large variety of biological data and to provide a clustered matrix and numerical values indicating cluster strength. CLAG clusterizes correlation matrices for residues in protein families, gene-expression and miRNA data related to various cancer types, sets of species described by multidimensional vectors of characters, binary matrices. It does not ask to all data points to cluster and it converges yielding the same result at each run. Its simplicity and speed allows it to run on reasonably large datasets. Conclusions CLAG can be used to investigate the cluster structure present in biological datasets and to identify its underlying graph. It showed to be more informative and accurate than several known clustering methods, as hierarchical agglomerative clustering, k-means, fuzzy c-means, model-based clustering, affinity propagation clustering, and not to suffer of the convergence problem proper to this latter. PMID:23216858
A revised moving cluster distance to the Pleiades open cluster
NASA Astrophysics Data System (ADS)
Galli, P. A. B.; Moraux, E.; Bouy, H.; Bouvier, J.; Olivares, J.; Teixeira, R.
2017-01-01
Context. The distance to the Pleiades open cluster has been extensively debated in the literature over several decades. Although different methods point to a discrepancy in the trigonometric parallaxes produced by the Hipparcos mission, the number of individual stars with known distances is still small compared to the number of cluster members to help solve this problem. Aims: We provide a new distance estimate for the Pleiades based on the moving cluster method, which will be useful to further discuss the so-called Pleiades distance controversy and compare it with the very precise parallaxes from the Gaia space mission. Methods: We apply a refurbished implementation of the convergent point search method to an updated census of Pleiades stars to calculate the convergent point position of the cluster from stellar proper motions. Then, we derive individual parallaxes for 64 cluster members using radial velocities compiled from the literature, and approximate parallaxes for another 1146 stars based on the spatial velocity of the cluster. This represents the largest sample of Pleiades stars with individual distances to date. Results: The parallaxes derived in this work are in good agreement with previous results obtained in different studies (excluding Hipparcos) for individual stars in the cluster. We report a mean parallax of 7.44 ± 0.08 mas and distance of pc that is consistent with the weighted mean of 135.0 ± 0.6 pc obtained from the non-Hipparcos results in the literature. Conclusions: Our result for the distance to the Pleiades open cluster is not consistent with the Hipparcos catalog, but favors the recent and more precise distance determination of 136.2 ± 1.2 pc obtained from Very Long Baseline Interferometry observations. It is also in good agreement with the mean distance of 133 ± 5 pc obtained from the first trigonometric parallaxes delivered by the Gaia satellite for the brightest cluster members in common with our sample. Full Table B.2 is only
Improved Ant Colony Clustering Algorithm and Its Performance Study
Gao, Wei
2016-01-01
Clustering analysis is used in many disciplines and applications; it is an important tool that descriptively identifies homogeneous groups of objects based on attribute values. The ant colony clustering algorithm is a swarm-intelligent method used for clustering problems that is inspired by the behavior of ant colonies that cluster their corpses and sort their larvae. A new abstraction ant colony clustering algorithm using a data combination mechanism is proposed to improve the computational efficiency and accuracy of the ant colony clustering algorithm. The abstraction ant colony clustering algorithm is used to cluster benchmark problems, and its performance is compared with the ant colony clustering algorithm and other methods used in existing literature. Based on similar computational difficulties and complexities, the results show that the abstraction ant colony clustering algorithm produces results that are not only more accurate but also more efficiently determined than the ant colony clustering algorithm and the other methods. Thus, the abstraction ant colony clustering algorithm can be used for efficient multivariate data clustering. PMID:26839533
Improved Ant Colony Clustering Algorithm and Its Performance Study.
Gao, Wei
2016-01-01
Clustering analysis is used in many disciplines and applications; it is an important tool that descriptively identifies homogeneous groups of objects based on attribute values. The ant colony clustering algorithm is a swarm-intelligent method used for clustering problems that is inspired by the behavior of ant colonies that cluster their corpses and sort their larvae. A new abstraction ant colony clustering algorithm using a data combination mechanism is proposed to improve the computational efficiency and accuracy of the ant colony clustering algorithm. The abstraction ant colony clustering algorithm is used to cluster benchmark problems, and its performance is compared with the ant colony clustering algorithm and other methods used in existing literature. Based on similar computational difficulties and complexities, the results show that the abstraction ant colony clustering algorithm produces results that are not only more accurate but also more efficiently determined than the ant colony clustering algorithm and the other methods. Thus, the abstraction ant colony clustering algorithm can be used for efficient multivariate data clustering.
NASA Astrophysics Data System (ADS)
Tran, Binh; Xue, Bing; Zhang, Mengjie; Nguyen, Su
2016-07-01
Feature selection is an essential step in classification tasks with a large number of features, such as in gene expression data. Recent research has shown that particle swarm optimisation (PSO) is a promising approach to feature selection. However, it also has potential limitation to get stuck into local optima, especially for gene selection problems with a huge search space. Therefore, we developed a PSO algorithm (PSO-LSRG) with a fast "local search" combined with a gbest resetting mechanism as a way to improve the performance of PSO for feature selection. Furthermore, since many existing PSO-based feature selection approaches on the gene expression data have feature selection bias, i.e. no unseen test data is used, 2 sets of experiments on 10 gene expression datasets were designed: with and without feature selection bias. As compared to standard PSO, PSO with gbest resetting only, and PSO with local search only, PSO-LSRG obtained a substantial dimensionality reduction and a significant improvement on the classification performance in both sets of experiments. PSO-LSRG outperforms the other three algorithms when feature selection bias exists. When there is no feature selection bias, PSO-LSRG selects the smallest number of features in all cases, but the classification performance is slightly worse in a few cases, which may be caused by the overfitting problem. This shows that feature selection bias should be avoided when designing a feature selection algorithm to ensure its generalisation ability on unseen data.
Bernardi, Alessandro; Femoni, Cristina; Iapalucci, Maria Carmela; Longoni, Giuliano; Zacchini, Stefano
2009-06-07
The new tetra-acetylide carbonyl clusters [H(4-n)Ni(22)(C(2))(4)(CO)(28)(CdBr)(2)](n-) (n = 2-4) have been prepared by reacting [Ni(10)C(2)(CO)(15)](2-) with a large excess of CdBr(2).xH(2)O and the molecular structure of the di-anion [H(2)Ni(22)(C(2))(4)(CO)(28)(CdBr)(2)](2-) has been fully elucidated by means of X-ray crystallography. The corresponding [HNi(22)(C(2))(4)(CO)(28)(CdBr)(2)](3-) and [Ni(22)(C(2))(4)(CO)(28)(CdBr)(2)](4-) conjugated bases are quantitatively obtained upon dissolution of [H(2)Ni(22)(C(2))(4)(CO)(28)(CdBr)(2)](2-) salts in more basic solvents such as acetonitrile and DMSO, respectively. The hydride nature of both [H(2)Ni(22)(C(2))(4)(CO)(28)(CdBr)(2)](2-) and [HNi(22)(C(2))(4)(CO)(28)(CdBr)(2)](3-) has been directly proved by (1)H NMR spectroscopy. Their resonances are very broad under all experimental conditions and their chemical shift greatly depends on solvent as well as temperature. Observation of the hydride resonances in [H(4-n)Ni(22)(C(2))(4)(CO)(28)(CdBr)(2)](n-) (n = 2, 3) makes these clusters a case study of the phenomena behind the loss of any NMR signal in higher-nuclearity metal carbonyl cluster anions (MCCA). In the attempt to obtain a better insight on this experimental spectroscopic behaviour, solutions of [NMe(4)](3)[HNi(22)(C(2))(4)(CO)(28)(CdBr)(2)] have been investigated by dynamic light scattering (DLS) at various concentrations. The DLS experiments point out the presence in solution of a distribution of particles with nominal hydrodynamic diameters enormously greater than those of the free cluster ions resulting, probably, from aggregation in solution. This could formally justify the observed NMR behaviour, even if the present observations are preliminary and their quantitative assessment requires further systematic studies on MCCA aggregation in solution.
Web document clustering using hyperlink structures
He, Xiaofeng; Zha, Hongyuan; Ding, Chris H.Q; Simon, Horst D.
2001-05-07
With the exponential growth of information on the World Wide Web there is great demand for developing efficient and effective methods for organizing and retrieving the information available. Document clustering plays an important role in information retrieval and taxonomy management for the World Wide Web and remains an interesting and challenging problem in the field of web computing. In this paper we consider document clustering methods exploring textual information hyperlink structure and co-citation relations. In particular we apply the normalized cut clustering method developed in computer vision to the task of hyperdocument clustering. We also explore some theoretical connections of the normalized-cut method to K-means method. We then experiment with normalized-cut method in the context of clustering query result sets for web search engines.
State estimation and prediction using clustered particle filters
Lee, Yoonsang; Majda, Andrew J.
2016-01-01
Particle filtering is an essential tool to improve uncertain model predictions by incorporating noisy observational data from complex systems including non-Gaussian features. A class of particle filters, clustered particle filters, is introduced for high-dimensional nonlinear systems, which uses relatively few particles compared with the standard particle filter. The clustered particle filter captures non-Gaussian features of the true signal, which are typical in complex nonlinear dynamical systems such as geophysical systems. The method is also robust in the difficult regime of high-quality sparse and infrequent observations. The key features of the clustered particle filtering are coarse-grained localization through the clustering of the state variables and particle adjustment to stabilize the method; each observation affects only neighbor state variables through clustering and particles are adjusted to prevent particle collapse due to high-quality observations. The clustered particle filter is tested for the 40-dimensional Lorenz 96 model with several dynamical regimes including strongly non-Gaussian statistics. The clustered particle filter shows robust skill in both achieving accurate filter results and capturing non-Gaussian statistics of the true signal. It is further extended to multiscale data assimilation, which provides the large-scale estimation by combining a cheap reduced-order forecast model and mixed observations of the large- and small-scale variables. This approach enables the use of a larger number of particles due to the computational savings in the forecast model. The multiscale clustered particle filter is tested for one-dimensional dispersive wave turbulence using a forecast model with model errors. PMID:27930332
Cluster Physics with Merging Galaxy Clusters
NASA Astrophysics Data System (ADS)
Molnar, Sandor
Collisions between galaxy clusters provide a unique opportunity to study matter in a parameter space which cannot be explored in our laboratories on Earth. In the standard ΛCDM model, where the total density is dominated by the cosmological constant (Λ) and the matter density by cold dark matter (CDM), structure formation is hierarchical, and clusters grow mostly by merging. Mergers of two massive clusters are the most energetic events in the universe after the Big Bang, hence they provide a unique laboratory to study cluster physics. The two main mass components in clusters behave differently during collisions: the dark matter is nearly collisionless, responding only to gravity, while the gas is subject to pressure forces and dissipation, and shocks and turbulence are developed during collisions. In the present contribution we review the different methods used to derive the physical properties of merging clusters. Different physical processes leave their signatures on different wavelengths, thus our review is based on a multifrequency analysis. In principle, the best way to analyze multifrequency observations of merging clusters is to model them using N-body/HYDRO numerical simulations. We discuss the results of such detailed analyses. New high spatial and spectral resolution ground and space based telescopes will come online in the near future. Motivated by these new opportunities, we briefly discuss methods which will be feasible in the near future in studying merging clusters.
Collins, Anne Gabrielle Eva; Frank, Michael Joshua
2016-07-01
Often the world is structured such that distinct sensory contexts signify the same abstract rule set. Learning from feedback thus informs us not only about the value of stimulus-action associations but also about which rule set applies. Hierarchical clustering models suggest that learners discover structure in the environment, clustering distinct sensory events into a single latent rule set. Such structure enables a learner to transfer any newly acquired information to other contexts linked to the same rule set, and facilitates re-use of learned knowledge in novel contexts. Here, we show that humans exhibit this transfer, generalization and clustering during learning. Trial-by-trial model-based analysis of EEG signals revealed that subjects' reward expectations incorporated this hierarchical structure; these structured neural signals were predictive of behavioral transfer and clustering. These results further our understanding of how humans learn and generalize flexibly by building abstract, behaviorally relevant representations of the complex, high-dimensional sensory environment.
[Autism Spectrum Disorder and DSM-5: Spectrum or Cluster?].
Kienle, Xaver; Freiberger, Verena; Greulich, Heide; Blank, Rainer
2015-01-01
Within the new DSM-5, the currently differentiated subgroups of "Autistic Disorder" (299.0), "Asperger's Disorder" (299.80) and "Pervasive Developmental Disorder" (299.80) are replaced by the more general "Autism Spectrum Disorder". With regard to a patient-oriented and expedient advising therapy planning, however, the issue of an empirically reproducible and clinically feasible differentiation into subgroups must still be raised. Based on two Autism-rating-scales (ASDS and FSK), an exploratory two-step cluster analysis was conducted with N=103 children (age: 5-18) seen in our social-pediatric health care centre to examine potentially autistic symptoms. In the two-cluster solution of both rating scales, mainly the problems in social communication grouped the children into a cluster "with communication problems" (51 % and 41 %), and a cluster "without communication problems". Within the three-cluster solution of the ASDS, sensory hypersensitivity, cleaving to routines and social-communicative problems generated an "autistic" subgroup (22%). The children of the second cluster ("communication problems", 35%) were only described by social-communicative problems, and the third group did not show any problems (38%). In the three-cluster solution of the FSK, the "autistic cluster" of the two-cluster solution differentiated in a subgroup with mainly social-communicative problems (cluster 1) and a second subgroup described by restrictive, repetitive behavior. The different cluster solutions will be discussed with a view to the new DSM-5 diagnostic criteria, for following studies a further specification of some of the ASDS and FSK items could be helpful.
HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree
Obulkasim, Askar; van de Wiel, Mark A
2015-01-01
Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster. However, the fixed-height branch cut may not be ideal in situations where one expects a complicated tree structure with nested clusters. Furthermore, due to lack of utilization of related background information in selecting the cutoff, induced clusters are often difficult to interpret. This paper describes a novel procedure that aims to automatically extract meaningful clusters from the HC tree in a semi-supervised way. The procedure is implemented in the R package HCsnip available from Bioconductor. Rather than cutting the HC tree at a fixed-height, HCsnip probes the various way of snipping, possibly at variable heights, to tease out hidden clusters ensconced deep down in the tree. The cluster extraction process utilizes, along with the data set from which the HC tree is derived, commonly available background information. Consequently, the extracted clusters are highly reproducible and robust against various sources of variations that “haunted” high-dimensional genomics data. Since the clustering process is guided by the background information, clusters are easy to interpret. Unlike existing packages, no constraint is placed on the data type on which clustering is desired. Particularly, the package accepts patient follow-up data for guiding the cluster extraction process. To our knowledge, HCsnip is the first package that is able to decomposes the HC tree into clusters with piecewise snipping under the guidance of patient time-to-event information. Our implementation of the semi-supervised HC tree snipping framework is generic, and
Digital image analysis of haematopoietic clusters.
Benzinou, A; Hojeij, Y; Roudot, A-C
2005-02-01
Counting and differentiating cell clusters is a tedious task when performed with a light microscope. Moreover, biased counts and interpretation are difficult to avoid because of the difficulties to evaluate the limits between different types of clusters. Presented here, is a computer-based application able to solve these problems. The image analysis system is entirely automatic, from the stage screening, to the statistical analysis of the results of each experimental plate. Good correlations are found with measurements made by a specialised technician.
Martínez, Ana
2014-08-07
Metal clusters have interesting characteristics, such as the relationship between properties and size of the cluster. This is not always apparent, so theoretical studies can provide relevant information. In this report, optimized structures and electron donor-acceptor properties of AunBim clusters are reported (n + m = 2-7, 20). Density functional theory calculations were performed to obtain optimized structures. The ground states of gold clusters formed with up to seven atoms are planar. The presence of Bi modifies the structure, and the clusters become 3-D. Several optimized geometries have at least one Bi atom bonded to gold or bismuth atoms and form structures similar to NH3. This fragment is also present in clusters with 20 atoms, where the formation of Au3Bi stabilizes the structures. Bismuth clusters are better electron donors and worse electron acceptors than gold clusters. Mixed clusters fall in between these two extremes. The presence of Bi atoms in gold clusters modifies the electron donor-acceptor properties of the clusters, but there is no correlation between the number of Bi atoms present in the cluster and the capacity for donating electrons. The effect of planarity in Au19Bi clusters is the same as that in Au20 clusters. The properties of pure gold clusters are certainly interesting, but clusters formed by Bi and Au are more important because the introduction of different atoms modifies the geometry, the stability, and consequently the physical and chemical properties. Apparently, the presence of Bi may increase the reactivity of gold clusters, but further studies are necessary to corroborate this hypothesis.
Handwritten text line segmentation by spectral clustering
NASA Astrophysics Data System (ADS)
Han, Xuecheng; Yao, Hui; Zhong, Guoqiang
2017-02-01
Since handwritten text lines are generally skewed and not obviously separated, text line segmentation of handwritten document images is still a challenging problem. In this paper, we propose a novel text line segmentation algorithm based on the spectral clustering. Given a handwritten document image, we convert it to a binary image first, and then compute the adjacent matrix of the pixel points. We apply spectral clustering on this similarity metric and use the orthogonal kmeans clustering algorithm to group the text lines. Experiments on Chinese handwritten documents database (HIT-MW) demonstrate the effectiveness of the proposed method.
... version of this page please turn Javascript on. Balance Problems About Balance Problems Have you ever felt dizzy, lightheaded, or ... dizziness problem during the past year. Why Good Balance is Important Having good balance means being able ...
Nuclear Clusters in Astrophysics
NASA Astrophysics Data System (ADS)
Kubono, S.; Binh, Dam N.; Hayakawa, S.; Hashimoto, H.; Kahl, D.; Wakabayashi, Y.; Yamaguchi, H.; Teranishi, T.; Iwasa, N.; Komatsubara, T.; Kato, S.; Khiem, Le H.
2010-03-01
The role of nuclear clustering is discussed for nucleosynthesis in stellar evolution with Cluster Nucleosynthesis Diagram (CND) proposed before. Special emphasis is placed on α-induced stellar reactions together with molecular states for O and C burning.
Cluster Stability Estimation Based on a Minimal Spanning Trees Approach
NASA Astrophysics Data System (ADS)
Volkovich, Zeev (Vladimir); Barzily, Zeev; Weber, Gerhard-Wilhelm; Toledano-Kitai, Dvora
2009-08-01
Among the areas of data and text mining which are employed today in science, economy and technology, clustering theory serves as a preprocessing step in the data analyzing. However, there are many open questions still waiting for a theoretical and practical treatment, e.g., the problem of determining the true number of clusters has not been satisfactorily solved. In the current paper, this problem is addressed by the cluster stability approach. For several possible numbers of clusters we estimate the stability of partitions obtained from clustering of samples. Partitions are considered consistent if their clusters are stable. Clusters validity is measured as the total number of edges, in the clusters' minimal spanning trees, connecting points from different samples. Actually, we use the Friedman and Rafsky two sample test statistic. The homogeneity hypothesis, of well mingled samples within the clusters, leads to asymptotic normal distribution of the considered statistic. Resting upon this fact, the standard score of the mentioned edges quantity is set, and the partition quality is represented by the worst cluster corresponding to the minimal standard score value. It is natural to expect that the true number of clusters can be characterized by the empirical distribution having the shortest left tail. The proposed methodology sequentially creates the described value distribution and estimates its left-asymmetry. Numerical experiments, presented in the paper, demonstrate the ability of the approach to detect the true number of clusters.
[Pathophysiology of cluster headache].
Donnet, Anne
2015-11-01
The aetiology of cluster headache is partially unknown. Three areas are involved in the pathogenesis of cluster headache: the trigeminal nociceptive pathways, the autonomic system and the hypothalamus. The cluster headache attack involves activation of the trigeminal autonomic reflex. A dysfunction located in posterior hypothalamic gray matter is probably pivotal in the process. There is a probable association between smoke exposure, a possible genetic predisposition and the development of cluster headache.
Nikooienejad, Amir; Wang, Wenyi; Johnson, Valen E.
2016-01-01
Motivation: The advent of new genomic technologies has resulted in the production of massive data sets. Analyses of these data require new statistical and computational methods. In this article, we propose one such method that is useful in selecting explanatory variables for prediction of a binary response. Although this problem has recently been addressed using penalized likelihood methods, we adopt a Bayesian approach that utilizes a mixture of non-local prior densities and point masses on the binary regression coefficient vectors. Results: The resulting method, which we call iMOMLogit, provides improved performance in identifying true models and reducing estimation and prediction error in a number of simulation studies. More importantly, its application to several genomic datasets produces predictions that have high accuracy using far fewer explanatory variables than competing methods. We also describe a novel approach for setting prior hyperparameters by examining the total variation distance between the prior distributions on the regression parameters and the distribution of the maximum likelihood estimator under the null distribution. Finally, we describe a computational algorithm that can be used to implement iMOMLogit in ultrahigh-dimensional settings (p>>n) and provide diagnostics to assess the probability that this algorithm has identified the highest posterior probability model. Availability and implementation: Software to implement this method can be downloaded at: http://www.stat.tamu.edu/∼amir/code.html. Contact: wwang7@mdanderson.org or vjohnson@stat.tamu.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26740524
Matlab Cluster Ensemble Toolbox v. 1.0
2009-04-27
This is a Matlab toolbox for investigating the application of cluster ensembles to data classification, with the objective of improving the accuracy and/or speed of clustering. The toolbox divides the cluster ensemble problem into four areas, providing functionality for each. These include, (1) synthetic data generation, (2) clustering to generate individual data partitions and similarity matrices, (3) consensus function generation and final clustering to generate ensemble data partitioning, and (4) implementation of accuracy metrics. With regard to data generation, Gaussian data of arbitrary dimension can be generated. The kcenters algorithm can then be used to generate individual data partitions by either, (a) subsampling the data and clustering each subsample, or by (b) randomly initializing the algorithm and generating a clustering for each initialization. In either case an overall similarity matrix can be computed using a consensus function operating on the individual similarity matrices. A final clustering can be performed and performance metrics are provided for evaluation purposes.
NASA Astrophysics Data System (ADS)
Graf, Norman A.
2001-07-01
An object-oriented framework for undertaking clustering algorithm studies has been developed. We present here the definitions for the abstract Cells and Clusters as well as the interface for the algorithm. We intend to use this framework to investigate the interplay between various clustering algorithms and the resulting jet reconstruction efficiency and energy resolutions to assist in the design of the calorimeter detector.
Star clusters as simple stellar populations.
Bruzual A, Gustavo
2010-02-28
In this paper, I review to what extent we can understand the photometric properties of star clusters, and of low-mass, unresolved galaxies, in terms of population-synthesis models designed to describe 'simple stellar populations' (SSPs), i.e. groups of stars born at the same time, in the same volume of space and from a gas cloud of homogeneous chemical composition. The photometric properties predicted by these models do not readily match the observations of most star clusters, unless we properly take into account the expected variation in the number of stars occupying sparsely populated evolutionary stages, owing to stochastic fluctuations in the stellar initial mass function. In this case, population-synthesis models reproduce remarkably well the full ranges of observed integrated colours and absolute magnitudes of star clusters of various ages and metallicities. The disagreement between the model predictions and observations of cluster colours and magnitudes may indicate problems with or deficiencies in the modelling, and does not necessarily tell us that star clusters do not behave like SSPs. Matching the photometric properties of star clusters using SSP models is a necessary (but not sufficient) condition for clusters to be considered SSPs. Composite models, characterized by complex star-formation histories, also match the observed cluster colours.
Collaborative Clustering for Sensor Networks
NASA Technical Reports Server (NTRS)
Wagstaff. Loro :/; Green Jillian; Lane, Terran
2011-01-01
Traditionally, nodes in a sensor network simply collect data and then pass it on to a centralized node that archives, distributes, and possibly analyzes the data. However, analysis at the individual nodes could enable faster detection of anomalies or other interesting events, as well as faster responses such as sending out alerts or increasing the data collection rate. There is an additional opportunity for increased performance if individual nodes can communicate directly with their neighbors. Previously, a method was developed by which machine learning classification algorithms could collaborate to achieve high performance autonomously (without requiring human intervention). This method worked for supervised learning algorithms, in which labeled data is used to train models. The learners collaborated by exchanging labels describing the data. The new advance enables clustering algorithms, which do not use labeled data, to also collaborate. This is achieved by defining a new language for collaboration that uses pair-wise constraints to encode useful information for other learners. These constraints specify that two items must, or cannot, be placed into the same cluster. Previous work has shown that clustering with these constraints (in isolation) already improves performance. In the problem formulation, each learner resides at a different node in the sensor network and makes observations (collects data) independently of the other learners. Each learner clusters its data and then selects a pair of items about which it is uncertain and uses them to query its neighbors. The resulting feedback (a must and cannot constraint from each neighbor) is combined by the learner into a consensus constraint, and it then reclusters its data while incorporating the new constraint. A strategy was also proposed for cleaning the resulting constraint sets, which may contain conflicting constraints; this improves performance significantly. This approach has been applied to collaborative
Improving performance through concept formation and conceptual clustering
NASA Technical Reports Server (NTRS)
Fisher, Douglas H.
1992-01-01
Research from June 1989 through October 1992 focussed on concept formation, clustering, and supervised learning for purposes of improving the efficiency of problem-solving, planning, and diagnosis. These projects resulted in two dissertations on clustering, explanation-based learning, and means-ends planning, and publications in conferences and workshops, several book chapters, and journals; a complete Bibliography of NASA Ames supported publications is included. The following topics are studied: clustering of explanations and problem-solving experiences; clustering and means-end planning; and diagnosis of space shuttle and space station operating modes.
NASA Astrophysics Data System (ADS)
Feng, Jian-xin; Tang, Jia-fu; Wang, Guang-xing
2007-04-01
On the basis of the analysis of clustering algorithm that had been proposed for MANET, a novel clustering strategy was proposed in this paper. With the trust defined by statistical hypothesis in probability theory and the cluster head selected by node trust and node mobility, this strategy can realize the function of the malicious nodes detection which was neglected by other clustering algorithms and overcome the deficiency of being incapable of implementing the relative mobility metric of corresponding nodes in the MOBIC algorithm caused by the fact that the receiving power of two consecutive HELLO packet cannot be measured. It's an effective solution to cluster MANET securely.
Vesperini, Enrico
2010-02-28
Dynamical evolution plays a key role in shaping the current properties of star clusters and star cluster systems. A detailed understanding of the effects of evolutionary processes is essential to be able to disentangle the properties that result from dynamical evolution from those imprinted at the time of cluster formation. In this review, I focus my attention on globular clusters, and review the main physical ingredients driving their early and long-term evolution, describe the possible evolutionary routes and show how cluster structure and stellar content are affected by dynamical evolution.
NASA Astrophysics Data System (ADS)
Lee, J. H.; Yoon, H.; Kitanidis, P. K.; Werth, C. J.; Valocchi, A. J.
2015-12-01
Characterizing subsurface properties, particularly hydraulic conductivity, is crucial for reliable and cost-effective groundwater supply management, contaminant remediation, and emerging deep subsurface activities such as geologic carbon storage and unconventional resources recovery. With recent advances in sensor technology, a large volume of hydro-geophysical and chemical data can be obtained to achieve high-resolution images of subsurface properties, which can be used for accurate subsurface flow and reactive transport predictions. However, subsurface characterization with a plethora of information requires high, often prohibitive, computational costs associated with "big data" processing and large-scale numerical simulations. As a result, traditional inversion techniques are not well-suited for problems that require coupled multi-physics simulation models with massive data. In this work, we apply a scalable inversion method called Principal Component Geostatistical Approach (PCGA) for characterizing heterogeneous hydraulic conductivity (K) distribution in a 3-D sand box. The PCGA is a Jacobian-free geostatistical inversion approach that uses the leading principal components of the prior information to reduce computational costs, sometimes dramatically, and can be easily linked with any simulation software. Sequential images of transient tracer concentrations in the sand box were obtained using magnetic resonance imaging (MRI) technique, resulting in 6 million tracer-concentration data [Yoon et. al., 2008]. Since each individual tracer observation has little information on the K distribution, the dimension of the data was reduced using temporal moments and discrete cosine transform (DCT). Consequently, 100,000 unknown K values consistent with the scale of MRI data (at a scale of 0.25^3 cm^3) were estimated by matching temporal moments and DCT coefficients of the original tracer data. Estimated K fields are close to the true K field, and even small
Image segmentation using fuzzy LVQ clustering networks
NASA Technical Reports Server (NTRS)
Tsao, Eric Chen-Kuo; Bezdek, James C.; Pal, Nikhil R.
1992-01-01
In this note we formulate image segmentation as a clustering problem. Feature vectors extracted from a raw image are clustered into subregions, thereby segmenting the image. A fuzzy generalization of a Kohonen learning vector quantization (LVQ) which integrates the Fuzzy c-Means (FCM) model with the learning rate and updating strategies of the LVQ is used for this task. This network, which segments images in an unsupervised manner, is thus related to the FCM optimization problem. Numerical examples on photographic and magnetic resonance images are given to illustrate this approach to image segmentation.
Hierarchical modeling of cluster size in wildlife surveys
Royle, J. Andrew
2008-01-01
Clusters or groups of individuals are the fundamental unit of observation in many wildlife sampling problems, including aerial surveys of waterfowl, marine mammals, and ungulates. Explicit accounting of cluster size in models for estimating abundance is necessary because detection of individuals within clusters is not independent and detectability of clusters is likely to increase with cluster size. This induces a cluster size bias in which the average cluster size in the sample is larger than in the population at large. Thus, failure to account for the relationship between delectability and cluster size will tend to yield a positive bias in estimates of abundance or density. I describe a hierarchical modeling framework for accounting for cluster-size bias in animal sampling. The hierarchical model consists of models for the observation process conditional on the cluster size distribution and the cluster size distribution conditional on the total number of clusters. Optionally, a spatial model can be specified that describes variation in the total number of clusters per sample unit. Parameter estimation, model selection, and criticism may be carried out using conventional likelihood-based methods. An extension of the model is described for the situation where measurable covariates at the level of the sample unit are available. Several candidate models within the proposed class are evaluated for aerial survey data on mallard ducks (Anas platyrhynchos).
Spin alignment of stars in old open clusters
NASA Astrophysics Data System (ADS)
Corsaro, Enrico; Lee, Yueh-Ning; García, Rafael A.; Hennebelle, Patrick; Mathur, Savita; Beck, Paul G.; Mathis, Stephane; Stello, Dennis; Bouvier, Jérôme
2017-03-01
Stellar clusters form by gravitational collapse of turbulent molecular clouds, with up to several thousand stars per cluster1. They are thought to be the birthplace of most stars and therefore play an important role in our understanding of star formation, a fundamental problem in astrophysics2,3. The initial conditions of the molecular cloud establish its dynamical history until the stellar cluster is born. However, the evolution of the cloud's angular momentum during cluster formation is not well understood4. Current observations have suggested that turbulence scrambles the angular momentum of the cluster-forming cloud, preventing spin alignment among stars within a cluster5. Here we use asteroseismology6-8 to measure the inclination angles of spin axes in 48 stars from the two old open clusters NGC 6791 and NGC 6819. The stars within each cluster show strong alignment. Three-dimensional hydrodynamical simulations of proto-cluster formation show that at least 50% of the initial proto-cluster kinetic energy has to be rotational in order to obtain strong stellar-spin alignment within a cluster. Our result indicates that the global angular momentum of the cluster-forming clouds was efficiently transferred to each star and that its imprint has survived several gigayears since the clusters formed.
Unconventional methods for clustering
NASA Astrophysics Data System (ADS)
Kotyrba, Martin
2016-06-01
Cluster analysis or clustering is a task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is the main task of exploratory data mining and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. The topic of this paper is one of the modern methods of clustering namely SOM (Self Organising Map). The paper describes the theory needed to understand the principle of clustering and descriptions of algorithm used with clustering in our experiments.
Full Text Clustering and Relationship Network Analysis of Biomedical Publications
Guan, Renchu; Yang, Chen; Marchese, Maurizio; Liang, Yanchun; Shi, Xiaohu
2014-01-01
Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers. PMID:25250864
Lie, Octavian V; van Mierlo, Pieter
2017-01-01
The visual interpretation of intracranial EEG (iEEG) is the standard method used in complex epilepsy surgery cases to map the regions of seizure onset targeted for resection. Still, visual iEEG analysis is labor-intensive and biased due to interpreter dependency. Multivariate parametric functional connectivity measures using adaptive autoregressive (AR) modeling of the iEEG signals based on the Kalman filter algorithm have been used successfully to localize the electrographic seizure onsets. Due to their high computational cost, these methods have been applied to a limited number of iEEG time-series (<60). The aim of this study was to test two Kalman filter implementations, a well-known multivariate adaptive AR model (Arnold et al. 1998) and a simplified, computationally efficient derivation of it, for their potential application to connectivity analysis of high-dimensional (up to 192 channels) iEEG data. When used on simulated seizures together with a multivariate connectivity estimator, the partial directed coherence, the two AR models were compared for their ability to reconstitute the designed seizure signal connections from noisy data. Next, focal seizures from iEEG recordings (73-113 channels) in three patients rendered seizure-free after surgery were mapped with the outdegree, a graph-theory index of outward directed connectivity. Simulation results indicated high levels of mapping accuracy for the two models in the presence of low-to-moderate noise cross-correlation. Accordingly, both AR models correctly mapped the real seizure onset to the resection volume. This study supports the possibility of conducting fully data-driven multivariate connectivity estimations on high-dimensional iEEG datasets using the Kalman filter approach.
Modeling and visualizing uncertainty in gene expression clusters using dirichlet process mixtures.
Rasmussen, Carl Edward; de la Cruz, Bernard J; Ghahramani, Zoubin; Wild, David L
2009-01-01
Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data.
THE STELLAR MASS GROWTH OF BRIGHTEST CLUSTER GALAXIES IN THE IRAC SHALLOW CLUSTER SURVEY
Lin, Yen-Ting; Brodwin, Mark; Gonzalez, Anthony H.; Bode, Paul; Eisenhardt, Peter R. M.; Stanford, S. A.; Vikhlinin, Alexey
2013-07-01
The details of the stellar mass assembly of brightest cluster galaxies (BCGs) remain an unresolved problem in galaxy formation. We have developed a novel approach that allows us to construct a sample of clusters that form an evolutionary sequence, and have applied it to the Spitzer IRAC Shallow Cluster Survey (ISCS) to examine the evolution of BCGs in progenitors of present-day clusters with mass of (2.5-4.5) Multiplication-Sign 10{sup 14} M{sub Sun }. We follow the cluster mass growth history extracted from a high resolution cosmological simulation, and then use an empirical method that infers the cluster mass based on the ranking of cluster luminosity to select high-z clusters of appropriate mass from ISCS to be progenitors of the given set of z = 0 clusters. We find that, between z = 1.5 and 0.5, the BCGs have grown in stellar mass by a factor of 2.3, which is well-matched by the predictions from a state-of-the-art semi-analytic model. Below z = 0.5 we see hints of differences in behavior between the model and observation.
Alleviating Comprehension Problems in Movies. Working Paper.
ERIC Educational Resources Information Center
Tatsuki, Donna
This paper describes the various barriers to comprehension that learners may encounter when viewing feature films in a second language. Two clusters of interfacing factors that may contribute to comprehension hot spots emerged from a quantitative analysis of problems noted in student logbooks. One cluster had a strong acoustic basis, whereas the…
Clusters of polyhedra in spherical confinement
Teich, Erin G.; van Anders, Greg; Klotsa, Daphne; Dshemuchadse, Julia; Glotzer, Sharon C.
2016-01-01
Dense particle packing in a confining volume remains a rich, largely unexplored problem, despite applications in blood clotting, plasmonics, industrial packaging and transport, colloidal molecule design, and information storage. Here, we report densest found clusters of the Platonic solids in spherical confinement, for up to N=60 constituent polyhedral particles. We examine the interplay between anisotropic particle shape and isotropic 3D confinement. Densest clusters exhibit a wide variety of symmetry point groups and form in up to three layers at higher N. For many N values, icosahedra and dodecahedra form clusters that resemble sphere clusters. These common structures are layers of optimal spherical codes in most cases, a surprising fact given the significant faceting of the icosahedron and dodecahedron. We also investigate cluster density as a function of N for each particle shape. We find that, in contrast to what happens in bulk, polyhedra often pack less densely than spheres. We also find especially dense clusters at so-called magic numbers of constituent particles. Our results showcase the structural diversity and experimental utility of families of solutions to the packing in confinement problem. PMID:26811458
Clusters of polyhedra in spherical confinement.
Teich, Erin G; van Anders, Greg; Klotsa, Daphne; Dshemuchadse, Julia; Glotzer, Sharon C
2016-02-09
Dense particle packing in a confining volume remains a rich, largely unexplored problem, despite applications in blood clotting, plasmonics, industrial packaging and transport, colloidal molecule design, and information storage. Here, we report densest found clusters of the Platonic solids in spherical confinement, for up to [Formula: see text] constituent polyhedral particles. We examine the interplay between anisotropic particle shape and isotropic 3D confinement. Densest clusters exhibit a wide variety of symmetry point groups and form in up to three layers at higher N. For many N values, icosahedra and dodecahedra form clusters that resemble sphere clusters. These common structures are layers of optimal spherical codes in most cases, a surprising fact given the significant faceting of the icosahedron and dodecahedron. We also investigate cluster density as a function of N for each particle shape. We find that, in contrast to what happens in bulk, polyhedra often pack less densely than spheres. We also find especially dense clusters at so-called magic numbers of constituent particles. Our results showcase the structural diversity and experimental utility of families of solutions to the packing in confinement problem.
NASA Astrophysics Data System (ADS)
Borgelt, Christian
In clustering we often face the situation that only a subset of the available attributes is relevant for forming clusters, even though this may not be known beforehand. In such cases it is desirable to have a clustering algorithm that automatically weights attributes or even selects a proper subset. In this paper I study such an approach for fuzzy clustering, which is based on the idea to transfer an alternative to the fuzzifier (Klawonn and Höppner, What is fuzzy about fuzzy clustering? Understanding and improving the concept of the fuzzifier, In: Proc. 5th Int. Symp. on Intelligent Data Analysis, 254-264, Springer, Berlin, 2003) to attribute weighting fuzzy clustering (Keller and Klawonn, Int J Uncertain Fuzziness Knowl Based Syst 8:735-746, 2000). In addition, by reformulating Gustafson-Kessel fuzzy clustering, a scheme for weighting and selecting principal axes can be obtained. While in Borgelt (Feature weighting and feature selection in fuzzy clustering, In: Proc. 17th IEEE Int. Conf. on Fuzzy Systems, IEEE Press, Piscataway, NJ, 2008) I already presented such an approach for a global selection of attributes and principal axes, this paper extends it to a cluster-specific selection, thus arriving at a fuzzy subspace clustering algorithm (Parsons, Haque, and Liu, 2004).
Formation and Assembly of Massive Star Clusters
NASA Astrophysics Data System (ADS)
McMillan, Stephen
The formation of stars and star clusters is a major unresolved problem in astrophysics. It is central to modeling stellar populations and understanding galaxy luminosity distributions in cosmological models. Young massive clusters are major components of starburst galaxies, while globular clusters are cornerstones of the cosmic distance scale and represent vital laboratories for studies of stellar dynamics and stellar evolution. Yet how these clusters form and how rapidly and efficiently they expel their natal gas remain unclear, as do the consequences of this gas expulsion for cluster structure and survival. Also unclear is how the properties of low-mass clusters, which form from small-scale instabilities in galactic disks and inform much of our understanding of cluster formation and star-formation efficiency, differ from those of more massive clusters, which probably formed in starburst events driven by fast accretion at high redshift, or colliding gas flows in merging galaxies. Modeling cluster formation requires simulating many simultaneous physical processes, placing stringent demands on both software and hardware. Simulations of galaxies evolving in cosmological contexts usually lack the numerical resolution to simulate star formation in detail. They do not include detailed treatments of important physical effects such as magnetic fields, radiation pressure, ionization, and supernova feedback. Simulations of smaller clusters include these effects, but fall far short of the mass of even single young globular clusters. With major advances in computing power and software, we can now directly address this problem. We propose to model the formation of massive star clusters by integrating the FLASH adaptive mesh refinement magnetohydrodynamics (MHD) code into the Astrophysical Multi-purpose Software Environment (AMUSE) framework, to work with existing stellar-dynamical and stellar evolution modules in AMUSE. All software will be freely distributed on-line, allowing
Growth of Pt Clusters from Mixture Film of Pt-C and Dynamics of Pt Clusters
NASA Astrophysics Data System (ADS)
Shintaku, Masayuki; Kumamoto, Akihito; Suzuki, Hitoshi; Kaito, Chihiro
2007-06-01
A complete mixture film of carbon and platinum produced by coevaporation in a vacuum was directly heated in a transmission electron microscope. It was found that the diffusion and crystal growth of Pt clusters in the mixture film take place at approximately 500 °C. Pt clusters with a size of 2-5 nm were connected with each other in a parallel orientation or twin-crystal configuration in the mixture film. The growth of onion-like carbon with a hole at the center also occurred. The grown Pt clusters with twin-crystal structures appeared on and in the carbon film. The diffusion of Pt atoms in carbon was discussed as the problem of elusion in fuel cells. Direct observation of the movement of Pt clusters on and in the carbon film was carried out. The movement difference of Pt clusters in and on carbon film has been directly presented.
Cluster Size Optimization in Sensor Networks with Decentralized Cluster-Based Protocols
Amini, Navid; Vahdatpour, Alireza; Xu, Wenyao; Gerla, Mario; Sarrafzadeh, Majid
2011-01-01
Network lifetime and energy-efficiency are viewed as the dominating considerations in designing cluster-based communication protocols for wireless sensor networks. This paper analytically provides the optimal cluster size that minimizes the total energy expenditure in such networks, where all sensors communicate data through their elected cluster heads to the base station in a decentralized fashion. LEACH, LEACH-Coverage, and DBS comprise three cluster-based protocols investigated in this paper that do not require any centralized support from a certain node. The analytical outcomes are given in the form of closed-form expressions for various widely-used network configurations. Extensive simulations on different networks are used to confirm the expectations based on the analytical results. To obtain a thorough understanding of the results, cluster number variability problem is identified and inspected from the energy consumption point of view. PMID:22267882
NASA Astrophysics Data System (ADS)
Chen, Jian
Available from UMI in association with The British Library. Requires signed TDF. In this thesis, we apply the tight-binding Hubbard model to alkali metal clusters with Hartree-Fock self-consistent methods and perturbation methods for the numerical calculations. We have studied the relation between the equilibrium structures and the range of the hopping matrix elements in the Hubbard Hamiltonian. The results show that the structures are not sensitive to the interaction range but are determined by the number of valence electrons each atom has. Inertia tensors are used to analyse the symmetries of the clusters. The principal axes of the clusters are determined and they are the axes of rotational symmetries of clusters if the clusters have any. The eigenvalues of inertia tensors which are the indication of the deformation of clusters are compared between our model and the ellipsoidal jellium model. The agreement is good for large clusters. At a finite temperature, the thermal motion fluctuates the structures. We defined a fluctuation function with the distance matrix of a cluster. The fluctuation has been studied with the Monte-Carlo simulation method. Our studies show that the clusters remain in the solid state when temperature is low. The small values of fluctuation functions indicates the thermal vibration of atoms around their equilibrium positions. If the temperature is high, the atoms are delocalized. The cluster melts and enters the liquid region. The cluster melting is simulated by the Monte-Carlo simulation with the fluctuation function we defined. Energy levels of clusters are calculated from the Hubbard model. Ionization potentials and magic numbers are also obtained from these energy levels. The results confirm that the Hubbard model is a good approximation for a small cluster. The excitation energy is presented by the difference between the original level and excited level, and the electron-hole interactions. We also have studied cooling of clusters
... Parkinson's disease Diseases such as arthritis or multiple sclerosis Vision or balance problems Treatment of walking problems depends on the cause. Physical therapy, surgery, or mobility aids may help.
Convalescing Cluster Configuration Using a Superlative Framework
Sabitha, R.; Karthik, S.
2015-01-01
Competent data mining methods are vital to discover knowledge from databases which are built as a result of enormous growth of data. Various techniques of data mining are applied to obtain knowledge from these databases. Data clustering is one such descriptive data mining technique which guides in partitioning data objects into disjoint segments. K-means algorithm is a versatile algorithm among the various approaches used in data clustering. The algorithm and its diverse adaptation methods suffer certain problems in their performance. To overcome these issues a superlative algorithm has been proposed in this paper to perform data clustering. The specific feature of the proposed algorithm is discretizing the dataset, thereby improving the accuracy of clustering, and also adopting the binary search initialization method to generate cluster centroids. The generated centroids are fed as input to K-means approach which iteratively segments the data objects into respective clusters. The clustered results are measured for accuracy and validity. Experiments conducted by testing the approach on datasets from the UC Irvine Machine Learning Repository evidently show that the accuracy and validity measure is higher than the other two approaches, namely, simple K-means and Binary Search method. Thus, the proposed approach proves that discretization process will improve the efficacy of descriptive data mining tasks. PMID:26543895
Quantum cluster algebras and quantum nilpotent algebras
Goodearl, Kenneth R.; Yakimov, Milen T.
2014-01-01
A major direction in the theory of cluster algebras is to construct (quantum) cluster algebra structures on the (quantized) coordinate rings of various families of varieties arising in Lie theory. We prove that all algebras in a very large axiomatically defined class of noncommutative algebras possess canonical quantum cluster algebra structures. Furthermore, they coincide with the corresponding upper quantum cluster algebras. We also establish analogs of these results for a large class of Poisson nilpotent algebras. Many important families of coordinate rings are subsumed in the class we are covering, which leads to a broad range of applications of the general results to the above-mentioned types of problems. As a consequence, we prove the Berenstein–Zelevinsky conjecture [Berenstein A, Zelevinsky A (2005) Adv Math 195:405–455] for the quantized coordinate rings of double Bruhat cells and construct quantum cluster algebra structures on all quantum unipotent groups, extending the theorem of Geiß et al. [Geiß C, et al. (2013) Selecta Math 19:337–397] for the case of symmetric Kac–Moody groups. Moreover, we prove that the upper cluster algebras of Berenstein et al. [Berenstein A, et al. (2005) Duke Math J 126:1–52] associated with double Bruhat cells coincide with the corresponding cluster algebras. PMID:24982197
Sample Size Determination for Clustered Count Data
Amatya, A.; Bhaumik, D.; Gibbons, R.D.
2013-01-01
We consider the problem of sample size determination for count data. Such data arise naturally in the context of multi-center (or cluster) randomized clinical trials, where patients are nested within research centers. We consider cluster-specific and population-average estimators (maximum likelihood based on generalized mixed-effects regression and generalized estimating equations respectively) for subject-level and cluster-level randomized designs respectively. We provide simple expressions for calculating number of clusters when comparing event rates of two groups in cross-sectional studies. The expressions we derive have closed form solutions and are based on either between-cluster variation or inter-cluster correlation for cross-sectional studies. We provide both theoretical and numerical comparisons of our methods with other existing methods. We specifically show that the performance of the proposed method is better for subject-level randomized designs, whereas the comparative performance depends on the rate ratio for the cluster-level randomized designs. We also provide a versatile method for longitudinal studies. Results are illustrated by three real data examples. PMID:23589228
Fast and accurate estimation for astrophysical problems in large databases
NASA Astrophysics Data System (ADS)
Richards, Joseph W.
2010-10-01
A recent flood of astronomical data has created much demand for sophisticated statistical and machine learning tools that can rapidly draw accurate inferences from large databases of high-dimensional data. In this Ph.D. thesis, methods for statistical inference in such databases will be proposed, studied, and applied to real data. I use methods for low-dimensional parametrization of complex, high-dimensional data that are based on the notion of preserving the connectivity of data points in the context of a Markov random walk over the data set. I show how this simple parameterization of data can be exploited to: define appropriate prototypes for use in complex mixture models, determine data-driven eigenfunctions for accurate nonparametric regression, and find a set of suitable features to use in a statistical classifier. In this thesis, methods for each of these tasks are built up from simple principles, compared to existing methods in the literature, and applied to data from astronomical all-sky surveys. I examine several important problems in astrophysics, such as estimation of star formation history parameters for galaxies, prediction of redshifts of galaxies using photometric data, and classification of different types of supernovae based on their photometric light curves. Fast methods for high-dimensional data analysis are crucial in each of these problems because they all involve the analysis of complicated high-dimensional data in large, all-sky surveys. Specifically, I estimate the star formation history parameters for the nearly 800,000 galaxies in the Sloan Digital Sky Survey (SDSS) Data Release 7 spectroscopic catalog, determine redshifts for over 300,000 galaxies in the SDSS photometric catalog, and estimate the types of 20,000 supernovae as part of the Supernova Photometric Classification Challenge. Accurate predictions and classifications are imperative in each of these examples because these estimates are utilized in broader inference problems
NASA Astrophysics Data System (ADS)
Vikhlinin, A. A.; Kravtsov, A. V.; Markevich, M. L.; Sunyaev, R. A.; Churazov, E. M.
2014-04-01
Galaxy clusters are formed via nonlinear growth of primordial density fluctuations and are the most massive gravitationally bound objects in the present Universe. Their number density at different epochs and their properties depend strongly on the properties of dark matter and dark energy, making clusters a powerful tool for observational cosmology. Observations of the hot gas filling the gravitational potential well of a cluster allows studying gasdynamic and plasma effects and the effect of supermassive black holes on the heating and cooling of gas on cluster scales. The work of Yakov Borisovich Zeldovich has had a profound impact on virtually all cosmological and astrophysical studies of galaxy clusters, introducing concepts such as the Harrison-Zeldovich spectrum, the Zeldovich approximation, baryon acoustic peaks, and the Sunyaev-Zeldovich effect. Here, we review the most basic properties of clusters and their role in modern astrophysics and cosmology.
Zhang, Zhaoyang; Fang, Hua; Wang, Honggang
2016-06-01
Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services.
Chemistry Within Molecular Clusters
1990-01-01
DME )nCH3OCH 2 +). We speculate that this is due to the fragments being consumed by an ion-molecule reaction within the cluster. One likely candidate is...the ion-molecule reaction of the fragment cations with a neutral DME , within the bulk cluster to form a trimethyloxonlum cation intermediate. This...the observed products. We therefore speculate that the DME cluster reactions leading to the same products, should involve the same mechanism found to