Science.gov

Sample records for high-dimensional clustering problems

  1. Clustering high dimensional data using RIA

    SciTech Connect

    Aziz, Nazrina

    2015-05-15

    Clustering may simply represent a convenient method for organizing a large data set so that it can easily be understood and information can efficiently be retrieved. However, identifying cluster in high dimensionality data sets is a difficult task because of the curse of dimensionality. Another challenge in clustering is some traditional functions cannot capture the pattern dissimilarity among objects. In this article, we used an alternative dissimilarity measurement called Robust Influence Angle (RIA) in the partitioning method. RIA is developed using eigenstructure of the covariance matrix and robust principal component score. We notice that, it can obtain cluster easily and hence avoid the curse of dimensionality. It is also manage to cluster large data sets with mixed numeric and categorical value.

  2. Clustering high dimensional data using RIA

    NASA Astrophysics Data System (ADS)

    Aziz, Nazrina

    2015-05-01

    Clustering may simply represent a convenient method for organizing a large data set so that it can easily be understood and information can efficiently be retrieved. However, identifying cluster in high dimensionality data sets is a difficult task because of the curse of dimensionality. Another challenge in clustering is some traditional functions cannot capture the pattern dissimilarity among objects. In this article, we used an alternative dissimilarity measurement called Robust Influence Angle (RIA) in the partitioning method. RIA is developed using eigenstructure of the covariance matrix and robust principal component score. We notice that, it can obtain cluster easily and hence avoid the curse of dimensionality. It is also manage to cluster large data sets with mixed numeric and categorical value.

  3. Enabling the Discovery of Recurring Anomalies in Aerospace System Problem Reports using High-Dimensional Clustering Techniques

    NASA Technical Reports Server (NTRS)

    Srivastava, Ashok, N.; Akella, Ram; Diev, Vesselin; Kumaresan, Sakthi Preethi; McIntosh, Dawn M.; Pontikakis, Emmanuel D.; Xu, Zuobing; Zhang, Yi

    2006-01-01

    This paper describes the results of a significant research and development effort conducted at NASA Ames Research Center to develop new text mining techniques to discover anomalies in free-text reports regarding system health and safety of two aerospace systems. We discuss two problems of significant importance in the aviation industry. The first problem is that of automatic anomaly discovery about an aerospace system through the analysis of tens of thousands of free-text problem reports that are written about the system. The second problem that we address is that of automatic discovery of recurring anomalies, i.e., anomalies that may be described m different ways by different authors, at varying times and under varying conditions, but that are truly about the same part of the system. The intent of recurring anomaly identification is to determine project or system weakness or high-risk issues. The discovery of recurring anomalies is a key goal in building safe, reliable, and cost-effective aerospace systems. We address the anomaly discovery problem on thousands of free-text reports using two strategies: (1) as an unsupervised learning problem where an algorithm takes free-text reports as input and automatically groups them into different bins, where each bin corresponds to a different unknown anomaly category; and (2) as a supervised learning problem where the algorithm classifies the free-text reports into one of a number of known anomaly categories. We then discuss the application of these methods to the problem of discovering recurring anomalies. In fact the special nature of recurring anomalies (very small cluster sizes) requires incorporating new methods and measures to enhance the original approach for anomaly detection. ?& pant 0-

  4. Adaptive dimension reduction for clustering high dimensional data

    SciTech Connect

    Ding, Chris; He, Xiaofeng; Zha, Hongyuan; Simon, Horst

    2002-10-01

    It is well-known that for high dimensional data clustering, standard algorithms such as EM and the K-means are often trapped in local minimum. many initialization methods were proposed to tackle this problem, but with only limited success. In this paper they propose a new approach to resolve this problem by repeated dimension reductions such that K-means or EM are performed only in very low dimensions. Cluster membership is utilized as a bridge between the reduced dimensional sub-space and the original space, providing flexibility and ease of implementation. Clustering analysis performed on highly overlapped Gaussians, DNA gene expression profiles and internet newsgroups demonstrate the effectiveness of the proposed algorithm.

  5. Semi-supervised high-dimensional clustering by tight wavelet frames

    NASA Astrophysics Data System (ADS)

    Dong, Bin; Hao, Ning

    2015-08-01

    High-dimensional clustering arises frequently from many areas in natural sciences, technical disciplines and social medias. In this paper, we consider the problem of binary clustering of high-dimensional data, i.e. classification of a data set into 2 classes. We assume that the correct (or mostly correct) classification of a small portion of the given data is known. Based on such partial classification, we design optimization models that complete the clustering of the entire data set using the recently introduced tight wavelet frames on graphs.1 Numerical experiments of the proposed models applied to some real data sets are conducted. In particular, the performance of the models on some very high-dimensional data sets are examined; and combinations of the models with some existing dimension reduction techniques are also considered.

  6. Modification of DIRECT for high-dimensional design problems

    NASA Astrophysics Data System (ADS)

    Tavassoli, Arash; Haji Hajikolaei, Kambiz; Sadeqi, Soheil; Wang, G. Gary; Kjeang, Erik

    2014-06-01

    DIviding RECTangles (DIRECT), as a well-known derivative-free global optimization method, has been found to be effective and efficient for low-dimensional problems. When facing high-dimensional black-box problems, however, DIRECT's performance deteriorates. This work proposes a series of modifications to DIRECT for high-dimensional problems (dimensionality d>10). The principal idea is to increase the convergence speed by breaking its single initialization-to-convergence approach into several more intricate steps. Specifically, starting with the entire feasible area, the search domain will shrink gradually and adaptively to the region enclosing the potential optimum. Several stopping criteria have been introduced to avoid premature convergence. A diversification subroutine has also been developed to prevent the algorithm from being trapped in local minima. The proposed approach is benchmarked using nine standard high-dimensional test functions and one black-box engineering problem. All these tests show a significant efficiency improvement over the original DIRECT for high-dimensional design problems.

  7. Model-based Clustering of High-Dimensional Data in Astrophysics

    NASA Astrophysics Data System (ADS)

    Bouveyron, C.

    2016-05-01

    The nature of data in Astrophysics has changed, as in other scientific fields, in the past decades due to the increase of the measurement capabilities. As a consequence, data are nowadays frequently of high dimensionality and available in mass or stream. Model-based techniques for clustering are popular tools which are renowned for their probabilistic foundations and their flexibility. However, classical model-based techniques show a disappointing behavior in high-dimensional spaces which is mainly due to their dramatical over-parametrization. The recent developments in model-based classification overcome these drawbacks and allow to efficiently classify high-dimensional data, even in the "small n / large p" situation. This work presents a comprehensive review of these recent approaches, including regularization-based techniques, parsimonious modeling, subspace classification methods and classification methods based on variable selection. The use of these model-based methods is also illustrated on real-world classification problems in Astrophysics using R packages.

  8. Visualization of high-dimensional clusters using nonlinear magnification

    SciTech Connect

    Keahey, T.A.

    1998-12-31

    This paper describes a cluster visualization system used for data-mining fraud detection. The system can simultaneously show 6 dimensions of data, and a unique technique of 3D nonlinear magnification allows individual clusters of data points to be magnified while still maintaining a view of the global context. The author first describes the fraud detection problem, along with the data which is to be visualized. Then he describes general characteristics of the visualization system, and shows how nonlinear magnification can be used in this system. Finally he concludes and describes options for further work.

  9. Penalized mixtures of factor analyzers with application to clustering high-dimensional microarray data

    PubMed Central

    Xie, Benhuai; Pan, Wei; Shen, Xiaotong

    2010-01-01

    Motivation: Model-based clustering has been widely used, e.g. in microarray data analysis. Since for high-dimensional data variable selection is necessary, several penalized model-based clustering methods have been proposed tørealize simultaneous variable selection and clustering. However, the existing methods all assume that the variables are independent with the use of diagonal covariance matrices. Results: To model non-independence of variables (e.g. correlated gene expressions) while alleviating the problem with the large number of unknown parameters associated with a general non-diagonal covariance matrix, we generalize the mixture of factor analyzers to that with penalization, which, among others, can effectively realize variable selection. We use simulated data and real microarray data to illustrate the utility and advantages of the proposed method over several existing ones. Contact: weip@biostat.umn.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20031967

  10. A multistage mathematical approach to automated clustering of high-dimensional noisy data

    PubMed Central

    Friedman, Alexander; Keselman, Michael D.; Gibb, Leif G.; Graybiel, Ann M.

    2015-01-01

    A critical problem faced in many scientific fields is the adequate separation of data derived from individual sources. Often, such datasets require analysis of multiple features in a highly multidimensional space, with overlap of features and sources. The datasets generated by simultaneous recording from hundreds of neurons emitting phasic action potentials have produced the challenge of separating the recorded signals into independent data subsets (clusters) corresponding to individual signal-generating neurons. Mathematical methods have been developed over the past three decades to achieve such spike clustering, but a complete solution with fully automated cluster identification has not been achieved. We propose here a fully automated mathematical approach that identifies clusters in multidimensional space through recursion, which combats the multidimensionality of the data. Recursion is paired with an approach to dimensional evaluation, in which each dimension of a dataset is examined for its informational importance for clustering. The dimensions offering greater informational importance are given added weight during recursive clustering. To combat strong background activity, our algorithm takes an iterative approach of data filtering according to a signal-to-noise ratio metric. The algorithm finds cluster cores, which are thereafter expanded to include complete clusters. This mathematical approach can be extended from its prototype context of spike sorting to other datasets that suffer from high dimensionality and background activity. PMID:25831512

  11. Scalable Clustering of High-Dimensional Data Technique Using SPCM with Ant Colony Optimization Intelligence.

    PubMed

    Srinivasan, Thenmozhi; Palanisamy, Balasubramanie

    2015-01-01

    Clusters of high-dimensional data techniques are emerging, according to data noisy and poor quality challenges. This paper has been developed to cluster data using high-dimensional similarity based PCM (SPCM), with ant colony optimization intelligence which is effective in clustering nonspatial data without getting knowledge about cluster number from the user. The PCM becomes similarity based by using mountain method with it. Though this is efficient clustering, it is checked for optimization using ant colony algorithm with swarm intelligence. Thus the scalable clustering technique is obtained and the evaluation results are checked with synthetic datasets. PMID:26495413

  12. Scalable Clustering of High-Dimensional Data Technique Using SPCM with Ant Colony Optimization Intelligence

    PubMed Central

    Srinivasan, Thenmozhi; Palanisamy, Balasubramanie

    2015-01-01

    Clusters of high-dimensional data techniques are emerging, according to data noisy and poor quality challenges. This paper has been developed to cluster data using high-dimensional similarity based PCM (SPCM), with ant colony optimization intelligence which is effective in clustering nonspatial data without getting knowledge about cluster number from the user. The PCM becomes similarity based by using mountain method with it. Though this is efficient clustering, it is checked for optimization using ant colony algorithm with swarm intelligence. Thus the scalable clustering technique is obtained and the evaluation results are checked with synthetic datasets. PMID:26495413

  13. Improved Cluster Identification and Visualization in High-Dimensional Data Using Self-Organizing Maps

    NASA Astrophysics Data System (ADS)

    Manukyan, N.; Eppstein, M. J.; Rizzo, D. M.

    2011-12-01

    data to demonstrate how the proposed methods facilitate automatic identification and visualization of clusters in real-world, high-dimensional biogeochemical data with complex relationships. The proposed methods are quite general and are applicable to a wide range of geophysical problems. [1] Pearce, A., Rizzo, D., and Mouser, P., "Subsurface characterization of groundwater contaminated by landfill leachate using microbial community profile data and a nonparametric decision-making process", Water Resources Research, 47:W06511, 11 pp, 2011. [2] Mouser, P., Rizzo, D., Druschel, G., Morales, S, O'Grady, P., Hayden, N., Stevens, L., "Enhanced detection of groundwater contamination from a leaking waste disposal site by microbial community profiles", Water Resources Research, 46:W12506, 12 pp., 2010.

  14. Visualization of high-dimensional clusters using nonlinear magnification

    NASA Astrophysics Data System (ADS)

    Keahey, T. A.

    1999-03-01

    This paper describes a visualization system which has been used as part of a data-mining effort to detect fraud and abuse within state medicare programs. The data-mining process generates a set of N attributes for each medicare provider and beneficiary in the state; these attributes can be numeric, categorical, or derived from the scoring proces of the data- mining routines. The attribute list can be considered as an N- dimensional space, which is subsequently partitioned into some fixed number of cluster partitions. The sparse nature of the clustered space provides room for the simultaneous visualization of more than 3 dimensions; examples in the paper will show 6-dimensional visualization. This ability to view higher dimensional data allows the data-mining researcher to compare the clustering effectiveness of the different attributes. Transparency based rendering is also used in conjunction with filtering techniques to provide selective rendering of only those data which are of greatest interest. Nonlinear magnification techniques are used to stretch the N- dimensional space to allow focus on one or more regions of interest while still allowing a view of the global context. The magnification can either be applied globally, or in a constrained fashion to expand individual clusters within the space.

  15. High dimensional data clustering by partitioning the hypergraphs using dense subgraph partition

    NASA Astrophysics Data System (ADS)

    Sun, Xili; Tian, Shoucai; Lu, Yonggang

    2015-12-01

    Due to the curse of dimensionality, traditional clustering methods usually fail to produce meaningful results for the high dimensional data. Hypergraph partition is believed to be a promising method for dealing with this challenge. In this paper, we first construct a graph G from the data by defining an adjacency relationship between the data points using Shared Reverse k Nearest Neighbors (SRNN). Then a hypergraph is created from the graph G by defining the hyperedges to be all the maximal cliques in the graph G. After the hypergraph is produced, a powerful hypergraph partitioning method called dense subgraph partition (DSP) combined with the k-medoids method is used to produce the final clustering results. The proposed method is evaluated on several real high-dimensional datasets, and the experimental results show that the proposed method can improve the clustering results of the high dimensional data compared with applying k-medoids method directly on the original data.

  16. Variational Bayesian strategies for high-dimensional, stochastic design problems

    NASA Astrophysics Data System (ADS)

    Koutsourelakis, P. S.

    2016-03-01

    This paper is concerned with a lesser-studied problem in the context of model-based, uncertainty quantification (UQ), that of optimization/design/control under uncertainty. The solution of such problems is hindered not only by the usual difficulties encountered in UQ tasks (e.g. the high computational cost of each forward simulation, the large number of random variables) but also by the need to solve a nonlinear optimization problem involving large numbers of design variables and potentially constraints. We propose a framework that is suitable for a class of such problems and is based on the idea of recasting them as probabilistic inference tasks. To that end, we propose a Variational Bayesian (VB) formulation and an iterative VB-Expectation-Maximization scheme that is capable of identifying a local maximum as well as a low-dimensional set of directions in the design space, along which, the objective exhibits the largest sensitivity. We demonstrate the validity of the proposed approach in the context of two numerical examples involving thousands of random and design variables. In all cases considered the cost of the computations in terms of calls to the forward model was of the order of 100 or less. The accuracy of the approximations provided is assessed by information-theoretic metrics.

  17. Nuclear Potential Clustering As a New Tool to Detect Patterns in High Dimensional Datasets

    NASA Astrophysics Data System (ADS)

    Tonkova, V.; Paulus, D.; Neeb, H.

    2013-02-01

    We present a new approach for the clustering of high dimensional data without prior assumptions about the structure of the underlying distribution. The proposed algorithm is based on a concept adapted from nuclear physics. To partition the data, we model the dynamic behaviour of nucleons interacting in an N-dimensional space. An adaptive nuclear potential, comprised of a short-range attractive (strong interaction) and a long-range repulsive term (Coulomb force) is assigned to each data point. By modelling the dynamics, nucleons that are densely distributed in space fuse to build nuclei (clusters) whereas single point clusters repel each other. The formation of clusters is completed when the system reaches the state of minimal potential energy. The data are then grouped according to the particles' final effective potential energy level. The performance of the algorithm is tested with several synthetic datasets showing that the proposed method can robustly identify clusters even when complex configurations are present. Furthermore, quantitative MRI data from 43 multiple sclerosis patients were analyzed, showing a reasonable splitting into subgroups according to the individual patients' disease grade. The good performance of the algorithm on such highly correlated non-spherical datasets, which are typical for MRI derived image features, shows that Nuclear Potential Clustering is a valuable tool for automated data analysis, not only in the MRI domain.

  18. Clustering High-Dimensional Landmark-based Two-dimensional Shape Data‡

    PubMed Central

    Huang, Chao; Styner, Martin; Zhu, Hongtu

    2015-01-01

    An important goal in image analysis is to cluster and recognize objects of interest according to the shapes of their boundaries. Clustering such objects faces at least four major challenges including a curved shape space, a high-dimensional feature space, a complex spatial correlation structure, and shape variation associated with some covariates (e.g., age or gender). The aim of this paper is to develop a penalized model-based clustering framework to cluster landmark-based planar shape data, while explicitly addressing these challenges. Specifically, a mixture of offset-normal shape factor analyzers (MOSFA) is proposed with mixing proportions defined through a regression model (e.g., logistic) and an offset-normal shape distribution in each component for data in the curved shape space. A latent factor analysis model is introduced to explicitly model the complex spatial correlation. A penalized likelihood approach with both adaptive pairwise fusion Lasso penalty function and L2 penalty function is used to automatically realize variable selection via thresholding and deliver a sparse solution. Our real data analysis has confirmed the excellent finite-sample performance of MOSFA in revealing meaningful clusters in the corpus callosum shape data obtained from the Attention Deficit Hyperactivity Disorder-200 (ADHD-200) study. PMID:26604425

  19. Mining High-Dimensional Data

    NASA Astrophysics Data System (ADS)

    Wang, Wei; Yang, Jiong

    With the rapid growth of computational biology and e-commerce applications, high-dimensional data becomes very common. Thus, mining high-dimensional data is an urgent problem of great practical importance. However, there are some unique challenges for mining data of high dimensions, including (1) the curse of dimensionality and more crucial (2) the meaningfulness of the similarity measure in the high dimension space. In this chapter, we present several state-of-art techniques for analyzing high-dimensional data, e.g., frequent pattern mining, clustering, and classification. We will discuss how these methods deal with the challenges of high dimensionality.

  20. CHARACTERIZATION OF DISCONTINUITIES IN HIGH-DIMENSIONAL STOCHASTIC PROBLEMS ON ADAPTIVE SPARSE GRIDS

    SciTech Connect

    Jakeman, John D; Archibald, Richard K; Xiu, Dongbin

    2011-01-01

    In this paper we present a set of efficient algorithms for detection and identification of discontinuities in high dimensional space. The method is based on extension of polynomial annihilation for edge detection in low dimensions. Compared to the earlier work, the present method poses significant improvements for high dimensional problems. The core of the algorithms relies on adaptive refinement of sparse grids. It is demonstrated that in the commonly encountered cases where a discontinuity resides on a small subset of the dimensions, the present method becomes optimal , in the sense that the total number of points required for function evaluations depends linearly on the dimensionality of the space. The details of the algorithms will be presented and various numerical examples are utilized to demonstrate the efficacy of the method.

  1. An interactive visual testbed system for dimension reduction and clustering of large-scale high-dimensional data

    NASA Astrophysics Data System (ADS)

    Choo, Jaegul; Lee, Hanseung; Liu, Zhicheng; Stasko, John; Park, Haesun

    2013-01-01

    Many of the modern data sets such as text and image data can be represented in high-dimensional vector spaces and have benefited from computational methods that utilize advanced computational methods. Visual analytics approaches have contributed greatly to data understanding and analysis due to their capability of leveraging humans' ability for quick visual perception. However, visual analytics targeting large-scale data such as text and image data has been challenging due to the limited screen space in terms of both the numbers of data points and features to represent. Among various computational methods supporting visual analytics, dimension reduction and clustering have played essential roles by reducing these numbers in an intelligent way to visually manageable sizes. Given numerous dimension reduction and clustering methods available, however, the decision on the choice of algorithms and their parameters becomes difficult. In this paper, we present an interactive visual testbed system for dimension reduction and clustering in a large-scale high-dimensional data analysis. The testbed system enables users to apply various dimension reduction and clustering methods with different settings, visually compare the results from different algorithmic methods to obtain rich knowledge for the data and tasks at hand, and eventually choose the most appropriate path for a collection of algorithms and parameters. Using various data sets such as documents, images, and others that are already encoded in vectors, we demonstrate how the testbed system can support these tasks.

  2. A comparative study of three simulation optimization algorithms for solving high dimensional multi-objective optimization problems in water resources

    NASA Astrophysics Data System (ADS)

    Schütze, Niels; Wöhling, Thomas; de Play, Michael

    2010-05-01

    Some real-world optimization problems in water resources have a high-dimensional space of decision variables and more than one objective function. In this work, we compare three general-purpose, multi-objective simulation optimization algorithms, namely NSGA-II, AMALGAM, and CMA-ES-MO when solving three real case Multi-objective Optimization Problems (MOPs): (i) a high-dimensional soil hydraulic parameter estimation problem; (ii) a multipurpose multi-reservoir operation problem; and (iii) a scheduling problem in deficit irrigation. We analyze the behaviour of the three algorithms on these test problems considering their formulations ranging from 40 up to 120 decision variables and 2 to 4 objectives. The computational effort required by each algorithm in order to reach the true Pareto front is also analyzed.

  3. SWIFT—Scalable Clustering for Automated Identification of Rare Cell Populations in Large, High-Dimensional Flow Cytometry Datasets, Part 2: Biological Evaluation

    PubMed Central

    Mosmann, Tim R; Naim, Iftekhar; Rebhahn, Jonathan; Datta, Suprakash; Cavenaugh, James S; Weaver, Jason M; Sharma, Gaurav

    2014-01-01

    A multistage clustering and data processing method, SWIFT (detailed in a companion manuscript), has been developed to detect rare subpopulations in large, high-dimensional flow cytometry datasets. An iterative sampling procedure initially fits the data to multidimensional Gaussian distributions, then splitting and merging stages use a criterion of unimodality to optimize the detection of rare subpopulations, to converge on a consistent cluster number, and to describe non-Gaussian distributions. Probabilistic assignment of cells to clusters, visualization, and manipulation of clusters by their cluster medians, facilitate application of expert knowledge using standard flow cytometry programs. The dual problems of rigorously comparing similar complex samples, and enumerating absent or very rare cell subpopulations in negative controls, were solved by assigning cells in multiple samples to a cluster template derived from a single or combined sample. Comparison of antigen-stimulated and control human peripheral blood cell samples demonstrated that SWIFT could identify biologically significant subpopulations, such as rare cytokine-producing influenza-specific T cells. A sensitivity of better than one part per million was attained in very large samples. Results were highly consistent on biological replicates, yet the analysis was sensitive enough to show that multiple samples from the same subject were more similar than samples from different subjects. A companion manuscript (Part 1) details the algorithmic development of SWIFT. © 2014 The Authors. Published by Wiley Periodicals Inc. PMID:24532172

  4. Semi-Supervised Clustering for High-Dimensional and Sparse Features

    ERIC Educational Resources Information Center

    Yan, Su

    2010-01-01

    Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some "weak" form of side…

  5. L2-Boosting algorithm applied to high-dimensional problems in genomic selection.

    PubMed

    González-Recio, Oscar; Weigel, Kent A; Gianola, Daniel; Naya, Hugo; Rosa, Guilherme J M

    2010-06-01

    The L(2)-Boosting algorithm is one of the most promising machine-learning techniques that has appeared in recent decades. It may be applied to high-dimensional problems such as whole-genome studies, and it is relatively simple from a computational point of view. In this study, we used this algorithm in a genomic selection context to make predictions of yet to be observed outcomes. Two data sets were used: (1) productive lifetime predicted transmitting abilities from 4702 Holstein sires genotyped for 32 611 single nucleotide polymorphisms (SNPs) derived from the Illumina BovineSNP50 BeadChip, and (2) progeny averages of food conversion rate, pre-corrected by environmental and mate effects, in 394 broilers genotyped for 3481 SNPs. Each of these data sets was split into training and testing sets, the latter comprising dairy or broiler sires whose ancestors were in the training set. Two weak learners, ordinary least squares (OLS) and non-parametric (NP) regression were used for the L2-Boosting algorithm, to provide a stringent evaluation of the procedure. This algorithm was compared with BL [Bayesian LASSO (least absolute shrinkage and selection operator)] and BayesA regression. Learning tasks were carried out in the training set, whereas validation of the models was performed in the testing set. Pearson correlations between predicted and observed responses in the dairy cattle (broiler) data set were 0.65 (0.33), 0.53 (0.37), 0.66 (0.26) and 0.63 (0.27) for OLS-Boosting, NP-Boosting, BL and BayesA, respectively. The smallest bias and mean-squared errors (MSEs) were obtained with OLS-Boosting in both the dairy cattle (0.08 and 1.08, respectively) and broiler (-0.011 and 0.006) data sets, respectively. In the dairy cattle data set, the BL was more accurate (bias=0.10 and MSE=1.10) than BayesA (bias=1.26 and MSE=2.81), whereas no differences between these two methods were found in the broiler data set. L2-Boosting with a suitable learner was found to be a competitive

  6. Haplotyping Problem, A Clustering Approach

    SciTech Connect

    Eslahchi, Changiz; Sadeghi, Mehdi; Pezeshk, Hamid; Kargar, Mehdi; Poormohammadi, Hadi

    2007-09-06

    Construction of two haplotypes from a set of Single Nucleotide Polymorphism (SNP) fragments is called haplotype reconstruction problem. One of the most popular computational model for this problem is Minimum Error Correction (MEC). Since MEC is an NP-hard problem, here we propose a novel heuristic algorithm based on clustering analysis in data mining for haplotype reconstruction problem. Based on hamming distance and similarity between two fragments, our iterative algorithm produces two clusters of fragments; then, in each iteration, the algorithm assigns a fragment to one of the clusters. Our results suggest that the algorithm has less reconstruction error rate in comparison with other algorithms.

  7. A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality

    PubMed Central

    Wang, Xueyi

    2011-01-01

    The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 106 records and 104 dimensions, kMkNN shows a 2-to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces. PMID:22247818

  8. Cluster expression in fission and fusion in high-dimensional macroscopic-microscopic calculations

    SciTech Connect

    Iwamoto, A.; Ichikawa, T.; Moller, P.; Sierk, A. J.

    2004-01-01

    We discuss the relation between the fission-fusion potential-energy surfaces of very heavy nuclei and the formation process of these nuclei in cold-fusion reactions. In the potential-energy surfaces, we find a pronounced valley structure, with one valley corresponding to the cold-fusion reaction, the other to fission. As the touching point is approached in the cold-fusion entrance channel, an instability towards dynamical deformation of the projectile occurs, which enhances the fusion cross section. These two 'cluster effects' enhance the production of superheavy nuclei in cold-fusion reactions, in addition to the effect of the low compound-system excitation energy in these reactions. Heavy-ion fusion reactions have been used extensively to synthesize heavy elements beyond actinide nuclei. In order to proceed further in this direction, we need to understand the formation process more precisely, not just the decay process. The dynamics of the formation process are considerably more complex than the dynamics necessary to interpret the spontaneous-fission decay of heavy elements. However, before implementing a full dynamical description it is useful to understand the basic properties of the potential-energy landscape encountered in the initial stages of the collision. The collision process and entrance-channel landscape can conveniently be separated into two parts, namely the early-stage separated system before touching and the late-stage composite system after touching. The transition between these two stages is particularly important, but not very well understood until now. To understand better the transition between the two stages we analyze here in detail the potential energy landscape or 'collision surface' of the system both outside and inside the touching configuration of the target and projectile. In Sec. 2, we discuss calculated five-dimensional potential-energy landscapes inside touching and identify major features. In Sec. 3, we present calculated

  9. SWIFT-scalable clustering for automated identification of rare cell populations in large, high-dimensional flow cytometry datasets, part 1: algorithm design.

    PubMed

    Naim, Iftekhar; Datta, Suprakash; Rebhahn, Jonathan; Cavenaugh, James S; Mosmann, Tim R; Sharma, Gaurav

    2014-05-01

    We present a model-based clustering method, SWIFT (Scalable Weighted Iterative Flow-clustering Technique), for digesting high-dimensional large-sized datasets obtained via modern flow cytometry into more compact representations that are well-suited for further automated or manual analysis. Key attributes of the method include the following: (a) the analysis is conducted in the multidimensional space retaining the semantics of the data, (b) an iterative weighted sampling procedure is utilized to maintain modest computational complexity and to retain discrimination of extremely small subpopulations (hundreds of cells from datasets containing tens of millions), and (c) a splitting and merging procedure is incorporated in the algorithm to preserve distinguishability between biologically distinct populations, while still providing a significant compaction relative to the original data. This article presents a detailed algorithmic description of SWIFT, outlining the application-driven motivations for the different design choices, a discussion of computational complexity of the different steps, and results obtained with SWIFT for synthetic data and relatively simple experimental data that allow validation of the desirable attributes. A companion paper (Part 2) highlights the use of SWIFT, in combination with additional computational tools, for more challenging biological problems. PMID:24677621

  10. Constrained optimization by radial basis function interpolation for high-dimensional expensive black-box problems with infeasible initial points

    NASA Astrophysics Data System (ADS)

    Regis, Rommel G.

    2014-02-01

    This article develops two new algorithms for constrained expensive black-box optimization that use radial basis function surrogates for the objective and constraint functions. These algorithms are called COBRA and Extended ConstrLMSRBF and, unlike previous surrogate-based approaches, they can be used for high-dimensional problems where all initial points are infeasible. They both follow a two-phase approach where the first phase finds a feasible point while the second phase improves this feasible point. COBRA and Extended ConstrLMSRBF are compared with alternative methods on 20 test problems and on the MOPTA08 benchmark automotive problem (D.R. Jones, Presented at MOPTA 2008), which has 124 decision variables and 68 black-box inequality constraints. The alternatives include a sequential penalty derivative-free algorithm, a direct search method with kriging surrogates, and two multistart methods. Numerical results show that COBRA algorithms are competitive with Extended ConstrLMSRBF and they generally outperform the alternatives on the MOPTA08 problem and most of the test problems.

  11. Automated fit of high-dimensional potential energy surfaces using cluster analysis and interpolation over descriptors of chemical environment.

    PubMed

    Fournier, René; Orel, Slava

    2013-12-21

    We present a method for fitting high-dimensional potential energy surfaces that is almost fully automated, can be applied to systems with various chemical compositions, and involves no particular choice of function form. We tested it on four systems: Ag20, Sn6Pb6, Si10, and Li8. The cost for energy evaluation is smaller than the cost of a density functional theory (DFT) energy evaluation by a factor of 1500 for Li8, and 60,000 for Ag20. We achieved intermediate accuracy (errors of 0.4 to 0.8 eV on atomization energies, or, 1% to 3% on cohesive energies) with rather small datasets (between 240 and 1400 configurations). We demonstrate that this accuracy is sufficient to correctly screen the configurations with lowest DFT energy, making this function potentially very useful in a hybrid global optimization strategy. We show that, as expected, the accuracy of the function improves with an increase in the size of the fitting dataset. PMID:24359355

  12. On the complexity of some quadratic Euclidean 2-clustering problems

    NASA Astrophysics Data System (ADS)

    Kel'manov, A. V.; Pyatkin, A. V.

    2016-03-01

    Some problems of partitioning a finite set of points of Euclidean space into two clusters are considered. In these problems, the following criteria are minimized: (1) the sum over both clusters of the sums of squared pairwise distances between the elements of the cluster and (2) the sum of the (multiplied by the cardinalities of the clusters) sums of squared distances from the elements of the cluster to its geometric center, where the geometric center (or centroid) of a cluster is defined as the mean value of the elements in that cluster. Additionally, another problem close to (2) is considered, where the desired center of one of the clusters is given as input, while the center of the other cluster is unknown (is the variable to be optimized) as in problem (2). Two variants of the problems are analyzed, in which the cardinalities of the clusters are (1) parts of the input or (2) optimization variables. It is proved that all the considered problems are strongly NP-hard and that, in general, there is no fully polynomial-time approximation scheme for them (unless P = NP).

  13. A facility for using cluster research to study environmental problems

    SciTech Connect

    Not Available

    1991-11-01

    This report begins by describing the general application of cluster based research to environmental chemistry and the development of a Cluster Structure and Dynamics Research Facility (CSDRF). Next, four important areas of cluster research are described in more detail, including how they can impact environmental problems. These are: surface-supported clusters, water and contaminant interactions, time-resolved dynamic studies in clusters, and cluster structures and reactions. These facilities and equipment required for each area of research are then presented. The appendices contain workshop agenda and a listing of the researchers who participated in the workshop discussions that led to this report.

  14. Haitian adolescent personality clusters and their problem area correlates.

    PubMed

    McMahon, Robert C; Bryant, Vaughn E; Dévieux, Jessy G; Jean-Gilles, Michèle; Rosenberg, Rhonda; Malow, Robert M

    2013-04-01

    This study identified personality clusters among a community sample of adolescents of Haitian decent and related cluster subgroup membership to problems in the areas of substance abuse, mental and physical health, family and peer relationships, educational and vocational status, social skills, leisure and recreational pursuits, aggressive behavior-delinquency, and to sexual risk activity. Three cluster subgroups were identified: dependent/conforming (N = 68), high pathology (N = 30); and confident/extroverted/conforming (N = 111). Although the overall sample was relatively healthy based on low average endorsement of problems across areas of expressed concern, significant physical health, mental health, relationship, educational, and HIV risk problems were identified in a MACI identified high psychopathology cluster subgroup. A confident/extraverted/conforming cluster subgroup revealed few problems and appears to reflect a protective style. PMID:22362195

  15. ICANP2: Isoenergetic cluster algorithm for NP-complete Problems

    NASA Astrophysics Data System (ADS)

    Zhu, Zheng; Fang, Chao; Katzgraber, Helmut G.

    NP-complete optimization problems with Boolean variables are of fundamental importance in computer science, mathematics and physics. Most notably, the minimization of general spin-glass-like Hamiltonians remains a difficult numerical task. There has been a great interest in designing efficient heuristics to solve these computationally difficult problems. Inspired by the rejection-free isoenergetic cluster algorithm developed for Ising spin glasses, we present a generalized cluster update that can be applied to different NP-complete optimization problems with Boolean variables. The cluster updates allow for a wide-spread sampling of phase space, thus speeding up optimization. By carefully tuning the pseudo-temperature (needed to randomize the configurations) of the problem, we show that the method can efficiently tackle problems on topologies with a large site-percolation threshold. We illustrate the ICANP2 heuristic on paradigmatic optimization problems, such as the satisfiability problem and the vertex cover problem.

  16. The ordered clustered travelling salesman problem: a hybrid genetic algorithm.

    PubMed

    Ahmed, Zakir Hussain

    2014-01-01

    The ordered clustered travelling salesman problem is a variation of the usual travelling salesman problem in which a set of vertices (except the starting vertex) of the network is divided into some prespecified clusters. The objective is to find the least cost Hamiltonian tour in which vertices of any cluster are visited contiguously and the clusters are visited in the prespecified order. The problem is NP-hard, and it arises in practical transportation and sequencing problems. This paper develops a hybrid genetic algorithm using sequential constructive crossover, 2-opt search, and a local search for obtaining heuristic solution to the problem. The efficiency of the algorithm has been examined against two existing algorithms for some asymmetric and symmetric TSPLIB instances of various sizes. The computational results show that the proposed algorithm is very effective in terms of solution quality and computational time. Finally, we present solution to some more symmetric TSPLIB instances. PMID:24701148

  17. Solving global optimization problems on GPU cluster

    NASA Astrophysics Data System (ADS)

    Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya

    2016-06-01

    The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.

  18. Existence of a Non-Averaging Regime for the Self-Avoiding Walk on a High-Dimensional Infinite Percolation Cluster

    NASA Astrophysics Data System (ADS)

    Lacoin, Hubert

    2014-03-01

    Let be the number of self-avoiding paths of length starting from the origin on the infinite cluster obtained after performing Bernoulli percolation on with parameter . The object of this paper is to study the connective constant of the dilute lattice , which is a non-random quantity. We want to investigate if the inequality obtained with the Borel-Cantelli Lemma is strict or not. In other words, we want to know if the quenched and annealed versions of the connective constant are equal. On a heuristic level, this indicates whether or not localization of the trajectories occurs. We prove that when is sufficiently large there exists such that the inequality is strict for.

  19. CARE: Finding Local Linear Correlations in High Dimensional Data

    PubMed Central

    Zhang, Xiang; Pan, Feng; Wang, Wei

    2010-01-01

    Finding latent patterns in high dimensional data is an important research problem with numerous applications. Existing approaches can be summarized into 3 categories: feature selection, feature transformation (or feature projection) and projected clustering. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging biomedical applications, however, scientists are interested in the local latent patterns held by feature subsets, which may be invisible via any global transformation. In this paper, we investigate the problem of finding local linear correlations in high dimensional data. Our goal is to find the latent pattern structures that may exist only in some subspaces. We formalize this problem as finding strongly correlated feature subsets which are supported by a large portion of the data points. Due to the combinatorial nature of the problem and lack of monotonicity of the correlation measurement, it is prohibitively expensive to exhaustively explore the whole search space. In our algorithm, CARE, we utilize spectrum properties and effective heuristic to prune the search space. Extensive experimental results show that our approach is effective in finding local linear correlations that may not be identified by existing methods. PMID:20419037

  20. An agglomerative hierarchical clustering approach to visualisation in Bayesian clustering problems

    PubMed Central

    Dawson, Kevin J.; Belkhir, Khalid

    2009-01-01

    Clustering problems (including the clustering of individuals into outcrossing populations, hybrid generations, full-sib families and selfing lines) have recently received much attention in population genetics. In these clustering problems, the parameter of interest is a partition of the set of sampled individuals, - the sample partition. In a fully Bayesian approach to clustering problems of this type, our knowledge about the sample partition is represented by a probability distribution on the space of possible sample partitions. Since the number of possible partitions grows very rapidly with the sample size, we can not visualise this probability distribution in its entirety, unless the sample is very small. As a solution to this visualisation problem, we recommend using an agglomerative hierarchical clustering algorithm, which we call the exact linkage algorithm. This algorithm is a special case of the maximin clustering algorithm that we introduced previously. The exact linkage algorithm is now implemented in our software package Partition View. The exact linkage algorithm takes the posterior co-assignment probabilities as input, and yields as output a rooted binary tree, - or more generally, a forest of such trees. Each node of this forest defines a set of individuals, and the node height is the posterior co-assignment probability of this set. This provides a useful visual representation of the uncertainty associated with the assignment of individuals to categories. It is also a useful starting point for a more detailed exploration of the posterior distribution in terms of the co-assignment probabilities. PMID:19337306

  1. The Heterogeneous P-Median Problem for Categorization Based Clustering

    ERIC Educational Resources Information Center

    Blanchard, Simon J.; Aloise, Daniel; DeSarbo, Wayne S.

    2012-01-01

    The p-median offers an alternative to centroid-based clustering algorithms for identifying unobserved categories. However, existing p-median formulations typically require data aggregation into a single proximity matrix, resulting in masked respondent heterogeneity. A proposed three-way formulation of the p-median problem explicitly considers…

  2. Automated High-Dimensional Flow Cytometric Data Analysis

    NASA Astrophysics Data System (ADS)

    Pyne, Saumyadipta; Hu, Xinli; Wang, Kui; Rossin, Elizabeth; Lin, Tsung-I.; Maier, Lisa; Baecher-Allan, Clare; McLachlan, Geoffrey; Tamayo, Pablo; Hafler, David; de Jager, Philip; Mesirov, Jill

    Flow cytometry is widely used for single cell interrogation of surface and intracellular protein expression by measuring fluorescence intensity of fluorophore-conjugated reagents. We focus on the recently developed procedure of Pyne et al. (2009, Proceedings of the National Academy of Sciences USA 106, 8519-8524) for automated high- dimensional flow cytometric analysis called FLAME (FLow analysis with Automated Multivariate Estimation). It introduced novel finite mixture models of heavy-tailed and asymmetric distributions to identify and model cell populations in a flow cytometric sample. This approach robustly addresses the complexities of flow data without the need for transformation or projection to lower dimensions. It also addresses the critical task of matching cell populations across samples that enables downstream analysis. It thus facilitates application of flow cytometry to new biological and clinical problems. To facilitate pipelining with standard bioinformatic applications such as high-dimensional visualization, subject classification or outcome prediction, FLAME has been incorporated with the GenePattern package of the Broad Institute. Thereby analysis of flow data can be approached similarly as other genomic platforms. We also consider some new work that proposes a rigorous and robust solution to the registration problem by a multi-level approach that allows us to model and register cell populations simultaneously across a cohort of high-dimensional flow samples. This new approach is called JCM (Joint Clustering and Matching). It enables direct and rigorous comparisons across different time points or phenotypes in a complex biological study as well as for classification of new patient samples in a more clinical setting.

  3. Optimization of the K-means algorithm for the solution of high dimensional instances

    NASA Astrophysics Data System (ADS)

    Pérez, Joaquín; Pazos, Rodolfo; Olivares, Víctor; Hidalgo, Miguel; Ruiz, Jorge; Martínez, Alicia; Almanza, Nelva; González, Moisés

    2016-06-01

    This paper addresses the problem of clustering instances with a high number of dimensions. In particular, a new heuristic for reducing the complexity of the K-means algorithm is proposed. Traditionally, there are two approaches that deal with the clustering of instances with high dimensionality. The first executes a preprocessing step to remove those attributes of limited importance. The second, called divide and conquer, creates subsets that are clustered separately and later their results are integrated through post-processing. In contrast, this paper proposes a new solution which consists of the reduction of distance calculations from the objects to the centroids at the classification step. This heuristic is derived from the visual observation of the clustering process of K-means, in which it was found that the objects can only migrate to adjacent clusters without crossing distant clusters. Therefore, this heuristic can significantly reduce the number of distance calculations from an object to the centroids of the potential clusters that it may be classified to. To validate the proposed heuristic, it was designed a set of experiments with synthetic and high dimensional instances. One of the most notable results was obtained for an instance of 25,000 objects and 200 dimensions, where its execution time was reduced up to 96.5% and the quality of the solution decreased by only 0.24% when compared to the K-means algorithm.

  4. Manifold learning to interpret JET high-dimensional operational space

    NASA Astrophysics Data System (ADS)

    Cannas, B.; Fanni, A.; Murari, A.; Pau, A.; Sias, G.; JET EFDA Contributors, the

    2013-04-01

    In this paper, the problem of visualization and exploration of JET high-dimensional operational space is considered. The data come from plasma discharges selected from JET campaigns from C15 (year 2005) up to C27 (year 2009). The aim is to learn the possible manifold structure embedded in the data and to create some representations of the plasma parameters on low-dimensional maps, which are understandable and which preserve the essential properties owned by the original data. A crucial issue for the design of such mappings is the quality of the dataset. This paper reports the details of the criteria used to properly select suitable signals downloaded from JET databases in order to obtain a dataset of reliable observations. Moreover, a statistical analysis is performed to recognize the presence of outliers. Finally data reduction, based on clustering methods, is performed to select a limited and representative number of samples for the operational space mapping. The high-dimensional operational space of JET is mapped using a widely used manifold learning method, the self-organizing maps. The results are compared with other data visualization methods. The obtained maps can be used to identify characteristic regions of the plasma scenario, allowing to discriminate between regions with high risk of disruption and those with low risk of disruption.

  5. Scalable Nearest Neighbor Algorithms for High Dimensional Data.

    PubMed

    Muja, Marius; Lowe, David G

    2014-11-01

    For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching. PMID:26353063

  6. Information technology of clustering problem situations in computing and office equipment

    NASA Astrophysics Data System (ADS)

    Savchuk, T. O.; Petrishyn, S. I.; Kisała, Piotr; Imanbek, Baglan; Smailova, Saule

    2015-12-01

    The article contains information technology of clustering problem situations in computing and office equipment, which is based on an information model of clustering and modified clustering methods FOREL and K-MEANS such situations.

  7. Distributed Computation of the knn Graph for Large High-Dimensional Point Sets.

    PubMed

    Plaku, Erion; Kavraki, Lydia E

    2007-03-01

    High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318

  8. Distributed Computation of the knn Graph for Large High-Dimensional Point Sets

    PubMed Central

    Plaku, Erion; Kavraki, Lydia E.

    2009-01-01

    High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318

  9. Analysis of data separation and recovery problems using clustered sparsity

    NASA Astrophysics Data System (ADS)

    King, Emily J.; Kutyniok, Gitta; Zhuang, Xiaosheng

    2011-09-01

    Data often have two or more fundamental components, like cartoon-like and textured elements in images; point, filament, and sheet clusters in astronomical data; and tonal and transient layers in audio signals. For many applications, separating these components is of interest. Another issue in data analysis is that of incomplete data, for example a photograph with scratches or seismic data collected with fewer than necessary sensors. There exists a unified approach to solving these problems which is minimizing the l1 norm of the analysis coefficients with respect to particular frame(s). This approach using the concept of clustered sparsity leads to similar theoretical bounds and results, which are presented here. Furthermore, necessary conditions for the frames to lead to sufficiently good solutions are also shown.

  10. High dimensional feature reduction via projection pursuit

    NASA Technical Reports Server (NTRS)

    Jimenez, Luis; Landgrebe, David

    1994-01-01

    The recent development of more sophisticated remote sensing systems enables the measurement of radiation in many more spectral intervals than previously possible. An example of that technology is the AVIRIS system, which collects image data in 220 bands. As a result of this, new algorithms must be developed in order to analyze the more complex data effectively. Data in a high dimensional space presents a substantial challenge, since intuitive concepts valid in a 2-3 dimensional space to not necessarily apply in higher dimensional spaces. For example, high dimensional space is mostly empty. This results from the concentration of data in the corners of hypercubes. Other examples may be cited. Such observations suggest the need to project data to a subspace of a much lower dimension on a problem specific basis in such a manner that information is not lost. Projection Pursuit is a technique that will accomplish such a goal. Since it processes data in lower dimensions, it should avoid many of the difficulties of high dimensional spaces. In this paper, we begin the investigation of some of the properties of Projection Pursuit for this purpose.

  11. Statistical Physics of High Dimensional Inference

    NASA Astrophysics Data System (ADS)

    Advani, Madhu; Ganguli, Surya

    To model modern large-scale datasets, we need efficient algorithms to infer a set of P unknown model parameters from N noisy measurements. What are fundamental limits on the accuracy of parameter inference, given limited measurements, signal-to-noise ratios, prior information, and computational tractability requirements? How can we combine prior information with measurements to achieve these limits? Classical statistics gives incisive answers to these questions as the measurement density α =N/P --> ∞ . However, modern high-dimensional inference problems, in fields ranging from bio-informatics to economics, occur at finite α. We formulate and analyze high-dimensional inference analytically by applying the replica and cavity methods of statistical physics where data serves as quenched disorder and inferred parameters play the role of thermal degrees of freedom. Our analysis reveals that widely cherished Bayesian inference algorithms such as maximum likelihood and maximum a posteriori are suboptimal in the modern setting, and yields new tractable, optimal algorithms to replace them as well as novel bounds on the achievable accuracy of a large class of high-dimensional inference algorithms. Thanks to Stanford Graduate Fellowship and Mind Brain Computation IGERT grant for support.

  12. Problem decomposition by mutual information and force-based clustering

    NASA Astrophysics Data System (ADS)

    Otero, Richard Edward

    The scale of engineering problems has sharply increased over the last twenty years. Larger coupled systems, increasing complexity, and limited resources create a need for methods that automatically decompose problems into manageable sub-problems by discovering and leveraging problem structure. The ability to learn the coupling (inter-dependence) structure and reorganize the original problem could lead to large reductions in the time to analyze complex problems. Such decomposition methods could also provide engineering insight on the fundamental physics driving problem solution. This work forwards the current state of the art in engineering decomposition through the application of techniques originally developed within computer science and information theory. The work describes the current state of automatic problem decomposition in engineering and utilizes several promising ideas to advance the state of the practice. Mutual information is a novel metric for data dependence and works on both continuous and discrete data. Mutual information can measure both the linear and non-linear dependence between variables without the limitations of linear dependence measured through covariance. Mutual information is also able to handle data that does not have derivative information, unlike other metrics that require it. The value of mutual information to engineering design work is demonstrated on a planetary entry problem. This study utilizes a novel tool developed in this work for planetary entry system synthesis. A graphical method, force-based clustering, is used to discover related sub-graph structure as a function of problem structure and links ranked by their mutual information. This method does not require the stochastic use of neural networks and could be used with any link ranking method currently utilized in the field. Application of this method is demonstrated on a large, coupled low-thrust trajectory problem. Mutual information also serves as the basis for an

  13. Application of clustering global optimization to thin film design problems.

    PubMed

    Lemarchand, Fabien

    2014-03-10

    Refinement techniques usually calculate an optimized local solution, which is strongly dependent on the initial formula used for the thin film design. In the present study, a clustering global optimization method is used which can iteratively change this initial formula, thereby progressing further than in the case of local optimization techniques. A wide panel of local solutions is found using this procedure, resulting in a large range of optical thicknesses. The efficiency of this technique is illustrated by two thin film design problems, in particular an infrared antireflection coating, and a solar-selective absorber coating. PMID:24663856

  14. Statistical challenges of high-dimensional data

    PubMed Central

    Johnstone, Iain M.; Titterington, D. Michael

    2009-01-01

    Modern applications of statistical theory and methods can involve extremely large datasets, often with huge numbers of measurements on each of a comparatively small number of experimental units. New methodology and accompanying theory have emerged in response: the goal of this Theme Issue is to illustrate a number of these recent developments. This overview article introduces the difficulties that arise with high-dimensional data in the context of the very familiar linear statistical model: we give a taste of what can nevertheless be achieved when the parameter vector of interest is sparse, that is, contains many zero elements. We describe other ways of identifying low-dimensional subspaces of the data space that contain all useful information. The topic of classification is then reviewed along with the problem of identifying, from within a very large set, the variables that help to classify observations. Brief mention is made of the visualization of high-dimensional data and ways to handle computational problems in Bayesian analysis are described. At appropriate points, reference is made to the other papers in the issue. PMID:19805443

  15. Visual Exploration of High Dimensional Scalar Functions

    PubMed Central

    Gerber, Samuel; Bremer, Peer-Timo; Pascucci, Valerio; Whitaker, Ross

    2011-01-01

    An important goal of scientific data analysis is to understand the behavior of a system or process based on a sample of the system. In many instances it is possible to observe both input parameters and system outputs, and characterize the system as a high-dimensional function. Such data sets arise, for instance, in large numerical simulations, as energy landscapes in optimization problems, or in the analysis of image data relating to biological or medical parameters. This paper proposes an approach to analyze and visualizing such data sets. The proposed method combines topological and geometric techniques to provide interactive visualizations of discretely sampled high-dimensional scalar fields. The method relies on a segmentation of the parameter space using an approximate Morse-Smale complex on the cloud of point samples. For each crystal of the Morse-Smale complex, a regression of the system parameters with respect to the output yields a curve in the parameter space. The result is a simplified geometric representation of the Morse-Smale complex in the high dimensional input domain. Finally, the geometric representation is embedded in 2D, using dimension reduction, to provide a visualization platform. The geometric properties of the regression curves enable the visualization of additional information about each crystal such as local and global shape, width, length, and sampling densities. The method is illustrated on several synthetic examples of two dimensional functions. Two use cases, using data sets from the UCI machine learning repository, demonstrate the utility of the proposed approach on real data. Finally, in collaboration with domain experts the proposed method is applied to two scientific challenges. The analysis of parameters of climate simulations and their relationship to predicted global energy flux and the concentrations of chemical species in a combustion simulation and their integration with temperature. PMID:20975167

  16. A facility for using cluster research to study environmental problems. Workshop proceedings

    SciTech Connect

    Not Available

    1991-11-01

    This report begins by describing the general application of cluster based research to environmental chemistry and the development of a Cluster Structure and Dynamics Research Facility (CSDRF). Next, four important areas of cluster research are described in more detail, including how they can impact environmental problems. These are: surface-supported clusters, water and contaminant interactions, time-resolved dynamic studies in clusters, and cluster structures and reactions. These facilities and equipment required for each area of research are then presented. The appendices contain workshop agenda and a listing of the researchers who participated in the workshop discussions that led to this report.

  17. Six clustering algorithms applied to the WAIS-R: the problem of dissimilar cluster results.

    PubMed

    Fraboni, M; Cooper, D

    1989-11-01

    Clusterings of the Wechsler Adult Intelligence Scale-Revised subtests were obtained from the application of six hierarchical clustering methods (N = 113). These sets of clusters were compared for similarities using the Rand index. The calculated indices suggested similarities of cluster group membership between the Complete Linkage and Centroid methods; Complete Linkage and Ward's methods; Centroid and Ward's methods; and Single Linkage and Average Linkage Between Groups methods. Cautious use of single clustering methods is implied, though the authors suggest some advantages of knowing specific similarities and differences. If between-method comparisons consistently reveal similar cluster membership, a choice could be made from those algorithms that tend to produce similar partitions, thereby enhancing cluster interpretation. PMID:2613904

  18. Random rotation survival forest for high dimensional censored data.

    PubMed

    Zhou, Lifeng; Wang, Hong; Xu, Qingsong

    2016-01-01

    Recently, rotation forest has been extended to regression and survival analysis problems. However, due to intensive computation incurred by principal component analysis, rotation forest often fails when high-dimensional or big data are confronted. In this study, we extend rotation forest to high dimensional censored time-to-event data analysis by combing random subspace, bagging and rotation forest. Supported by proper statistical analysis, we show that the proposed method random rotation survival forest outperforms state-of-the-art survival ensembles such as random survival forest and popular regularized Cox models. PMID:27625979

  19. An approximation polynomial-time algorithm for a sequence bi-clustering problem

    NASA Astrophysics Data System (ADS)

    Kel'manov, A. V.; Khamidullin, S. A.

    2015-06-01

    We consider a strongly NP-hard problem of partitioning a finite sequence of vectors in Euclidean space into two clusters using the criterion of the minimal sum of the squared distances from the elements of the clusters to the centers of the clusters. The center of one of the clusters is to be optimized and is determined as the mean value over all vectors in this cluster. The center of the other cluster is fixed at the origin. Moreover, the partition is such that the difference between the indices of two successive vectors in the first cluster is bounded above and below by prescribed constants. A 2-approximation polynomial-time algorithm is proposed for this problem.

  20. Optimal M-estimation in high-dimensional regression.

    PubMed

    Bean, Derek; Bickel, Peter J; El Karoui, Noureddine; Yu, Bin

    2013-09-01

    We consider, in the modern setting of high-dimensional statistics, the classic problem of optimizing the objective function in regression using M-estimates when the error distribution is assumed to be known. We propose an algorithm to compute this optimal objective function that takes into account the dimensionality of the problem. Although optimality is achieved under assumptions on the design matrix that will not always be satisfied, our analysis reveals generally interesting families of dimension-dependent objective functions. PMID:23954907

  1. Optimal M-estimation in high-dimensional regression

    PubMed Central

    Bean, Derek; Bickel, Peter J.; El Karoui, Noureddine; Yu, Bin

    2013-01-01

    We consider, in the modern setting of high-dimensional statistics, the classic problem of optimizing the objective function in regression using M-estimates when the error distribution is assumed to be known. We propose an algorithm to compute this optimal objective function that takes into account the dimensionality of the problem. Although optimality is achieved under assumptions on the design matrix that will not always be satisfied, our analysis reveals generally interesting families of dimension-dependent objective functions. PMID:23954907

  2. An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets.

    ERIC Educational Resources Information Center

    Dimitriadou, Evgenia; Dolnicar, Sara; Weingessel, Andreas

    2002-01-01

    Explored the problem of choosing the correct number of clusters in cluster analysis of high dimensional empirical binary data. Findings from a simulation that included 162 binary data sets resulted in recommendations about the number of clusters for each index under consideration. Compared and analyzed the performance of index results. (SLD)

  3. Sparse High Dimensional Models in Economics

    PubMed Central

    Fan, Jianqing; Lv, Jinchi; Qi, Lei

    2010-01-01

    This paper reviews the literature on sparse high dimensional models and discusses some applications in economics and finance. Recent developments of theory, methods, and implementations in penalized least squares and penalized likelihood methods are highlighted. These variable selection methods are proved to be effective in high dimensional sparse modeling. The limits of dimensionality that regularization methods can handle, the role of penalty functions, and their statistical properties are detailed. Some recent advances in ultra-high dimensional sparse modeling are also briefly discussed. PMID:22022635

  4. Clusters of primordial black holes and reionization problem

    SciTech Connect

    Belotsky, K. M. Kirillov, A. A. Rubin, S. G.

    2015-05-15

    Clusters of primordial black holes may cause the formation of quasars in the early Universe. In turn, radiation from these quasars may lead to the reionization of the Universe. However, the evaporation of primordial black holes via Hawking’s mechanism may also contribute to the ionization of matter. The possibility of matter ionization via the evaporation of primordial black holes with allowance for existing constraints on their density is discussed. The contribution to ionization from the evaporation of primordial black holes characterized by their preset mass spectrum can roughly be estimated at about 10{sup −3}.

  5. Numerical methods for high-dimensional probability density function equations

    NASA Astrophysics Data System (ADS)

    Cho, H.; Venturi, D.; Karniadakis, G. E.

    2016-01-01

    In this paper we address the problem of computing the numerical solution to kinetic partial differential equations involving many phase variables. These types of equations arise naturally in many different areas of mathematical physics, e.g., in particle systems (Liouville and Boltzmann equations), stochastic dynamical systems (Fokker-Planck and Dostupov-Pugachev equations), random wave theory (Malakhov-Saichev equations) and coarse-grained stochastic systems (Mori-Zwanzig equations). We propose three different classes of new algorithms addressing high-dimensionality: The first one is based on separated series expansions resulting in a sequence of low-dimensional problems that can be solved recursively and in parallel by using alternating direction methods. The second class of algorithms relies on truncation of interaction in low-orders that resembles the Bogoliubov-Born-Green-Kirkwood-Yvon (BBGKY) framework of kinetic gas theory and it yields a hierarchy of coupled probability density function equations. The third class of algorithms is based on high-dimensional model representations, e.g., the ANOVA method and probabilistic collocation methods. A common feature of all these approaches is that they are reducible to the problem of computing the solution to high-dimensional equations via a sequence of low-dimensional problems. The effectiveness of the new algorithms is demonstrated in numerical examples involving nonlinear stochastic dynamical systems and partial differential equations, with up to 120 variables.

  6. Feature extraction and classification algorithms for high dimensional data

    NASA Technical Reports Server (NTRS)

    Lee, Chulhee; Landgrebe, David

    1993-01-01

    Feature extraction and classification algorithms for high dimensional data are investigated. Developments with regard to sensors for Earth observation are moving in the direction of providing much higher dimensional multispectral imagery than is now possible. In analyzing such high dimensional data, processing time becomes an important factor. With large increases in dimensionality and the number of classes, processing time will increase significantly. To address this problem, a multistage classification scheme is proposed which reduces the processing time substantially by eliminating unlikely classes from further consideration at each stage. Several truncation criteria are developed and the relationship between thresholds and the error caused by the truncation is investigated. Next an approach to feature extraction for classification is proposed based directly on the decision boundaries. It is shown that all the features needed for classification can be extracted from decision boundaries. A characteristic of the proposed method arises by noting that only a portion of the decision boundary is effective in discriminating between classes, and the concept of the effective decision boundary is introduced. The proposed feature extraction algorithm has several desirable properties: it predicts the minimum number of features necessary to achieve the same classification accuracy as in the original space for a given pattern recognition problem; and it finds the necessary feature vectors. The proposed algorithm does not deteriorate under the circumstances of equal means or equal covariances as some previous algorithms do. In addition, the decision boundary feature extraction algorithm can be used both for parametric and non-parametric classifiers. Finally, some problems encountered in analyzing high dimensional data are studied and possible solutions are proposed. First, the increased importance of the second order statistics in analyzing high dimensional data is recognized

  7. An Extended Membrane System with Active Membranes to Solve Automatic Fuzzy Clustering Problems.

    PubMed

    Peng, Hong; Wang, Jun; Shi, Peng; Pérez-Jiménez, Mario J; Riscos-Núñez, Agustín

    2016-05-01

    This paper focuses on automatic fuzzy clustering problem and proposes a novel automatic fuzzy clustering method that employs an extended membrane system with active membranes that has been designed as its computing framework. The extended membrane system has a dynamic membrane structure; since membranes can evolve, it is particularly suitable for processing the automatic fuzzy clustering problem. A modification of a differential evolution (DE) mechanism was developed as evolution rules for objects according to membrane structure and object communication mechanisms. Under the control of both the object's evolution-communication mechanism and the membrane evolution mechanism, the extended membrane system can effectively determine the most appropriate number of clusters as well as the corresponding optimal cluster centers. The proposed method was evaluated over 13 benchmark problems and was compared with four state-of-the-art automatic clustering methods, two recently developed clustering methods and six classification techniques. The comparison results demonstrate the superiority of the proposed method in terms of effectiveness and robustness. PMID:26790484

  8. Sparse representation approaches for the classification of high-dimensional biological data

    PubMed Central

    2013-01-01

    Background High-throughput genomic and proteomic data have important applications in medicine including prevention, diagnosis, treatment, and prognosis of diseases, and molecular biology, for example pathway identification. Many of such applications can be formulated to classification and dimension reduction problems in machine learning. There are computationally challenging issues with regards to accurately classifying such data, and which due to dimensionality, noise and redundancy, to name a few. The principle of sparse representation has been applied to analyzing high-dimensional biological data within the frameworks of clustering, classification, and dimension reduction approaches. However, the existing sparse representation methods are inefficient. The kernel extensions are not well addressed either. Moreover, the sparse representation techniques have not been comprehensively studied yet in bioinformatics. Results In this paper, a Bayesian treatment is presented on sparse representations. Various sparse coding and dictionary learning models are discussed. We propose fast parallel active-set optimization algorithm for each model. Kernel versions are devised based on their dimension-free property. These models are applied for classifying high-dimensional biological data. Conclusions In our experiment, we compared our models with other methods on both accuracy and computing time. It is shown that our models can achieve satisfactory accuracy, and their performance are very efficient. PMID:24565287

  9. Problem-Solving Environments (PSEs) to Support Innovation Clustering

    NASA Technical Reports Server (NTRS)

    Gill, Zann

    1999-01-01

    This paper argues that there is need for high level concepts to inform the development of Problem-Solving Environment (PSE) capability. A traditional approach to PSE implementation is to: (1) assemble a collection of tools; (2) integrate the tools; and (3) assume that collaborative work begins after the PSE is assembled. I argue for the need to start from the opposite premise, that promoting human collaboration and observing that process comes first, followed by the development of supporting tools, and finally evolution of PSE capability through input from collaborating project teams.

  10. Identifying the number of population clusters with structure: problems and solutions.

    PubMed

    Gilbert, Kimberly J

    2016-05-01

    The program structure has been used extensively to understand and visualize population genetic structure. It is one of the most commonly used clustering algorithms, cited over 11 500 times in Web of Science since its introduction in 2000. The method estimates ancestry proportions to assign individuals to clusters, and post hoc analyses of results may indicate the most likely number of clusters, or populations, on the landscape. However, as has been shown in this issue of Molecular Ecology Resources by Puechmaille (), when sampling is uneven across populations or across hierarchical levels of population structure, these post hoc analyses can be inaccurate and identify an incorrect number of population clusters. To solve this problem, Puechmaille () presents strategies for subsampling and new analysis methods that are robust to uneven sampling to improve inferences of the number of population clusters. PMID:27062588

  11. Automated high-dimensional flow cytometric data analysis

    PubMed Central

    Pyne, Saumyadipta; Hu, Xinli; Wang, Kui; Rossin, Elizabeth; Lin, Tsung-I; Maier, Lisa M.; Baecher-Allan, Clare; McLachlan, Geoffrey J.; Tamayo, Pablo; Hafler, David A.; De Jager, Philip L.; Mesirov, Jill P.

    2009-01-01

    Flow cytometric analysis allows rapid single cell interrogation of surface and intracellular determinants by measuring fluorescence intensity of fluorophore-conjugated reagents. The availability of new platforms, allowing detection of increasing numbers of cell surface markers, has challenged the traditional technique of identifying cell populations by manual gating and resulted in a growing need for the development of automated, high-dimensional analytical methods. We present a direct multivariate finite mixture modeling approach, using skew and heavy-tailed distributions, to address the complexities of flow cytometric analysis and to deal with high-dimensional cytometric data without the need for projection or transformation. We demonstrate its ability to detect rare populations, to model robustly in the presence of outliers and skew, and to perform the critical task of matching cell populations across samples that enables downstream analysis. This advance will facilitate the application of flow cytometry to new, complex biological and clinical problems. PMID:19443687

  12. HIGH DIMENSIONAL COVARIANCE MATRIX ESTIMATION IN APPROXIMATE FACTOR MODELS

    PubMed Central

    Fan, Jianqing; Liao, Yuan; Mincheva, Martina

    2012-01-01

    The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu (2011), taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied. PMID:22661790

  13. ANISOTROPIC THERMAL CONDUCTION AND THE COOLING FLOW PROBLEM IN GALAXY CLUSTERS

    SciTech Connect

    Parrish, Ian J.; Sharma, Prateek; Quataert, Eliot

    2009-09-20

    We examine the long-standing cooling flow problem in galaxy clusters with three-dimensional magnetohydrodynamics simulations of isolated clusters including radiative cooling and anisotropic thermal conduction along magnetic field lines. The central regions of the intracluster medium (ICM) can have cooling timescales of {approx}200 Myr or shorter-in order to prevent a cooling catastrophe the ICM must be heated by some mechanism such as active galactic nucleus feedback or thermal conduction from the thermal reservoir at large radii. The cores of galaxy clusters are linearly unstable to the heat-flux-driven buoyancy instability (HBI), which significantly changes the thermodynamics of the cluster core. The HBI is a convective, buoyancy-driven instability that rearranges the magnetic field to be preferentially perpendicular to the temperature gradient. For a wide range of parameters, our simulations demonstrate that in the presence of the HBI, the effective radial thermal conductivity is reduced to {approx}<10% of the full Spitzer conductivity. With this suppression of conductive heating, the cooling catastrophe occurs on a timescale comparable to the central cooling time of the cluster. Thermal conduction alone is thus unlikely to stabilize clusters with low central entropies and short central cooling timescales. High central entropy clusters have sufficiently long cooling times that conduction can help stave off the cooling catastrophe for cosmologically interesting timescales.

  14. Role of peculiar veocity of galaxy clusters in gravitational clustering of cosmological many body problem

    NASA Astrophysics Data System (ADS)

    Masood, Tabasum

    2016-07-01

    The distribution of galaxies in the universe can be well understood by correlation function analysis. The lowest order two point auto correlation function has remained a successful tool for understanding the galaxy clustering phenomena. The two point correlation function is a probability of finding two galaxies in a given volume separated by some particular distance. Given a random galaxy in a location, the correlation function describes the probability that another galaxy will be found within a given distance .The correlation function tool is important for theoretical models of physical cosmology because it provides means of testing models which assume different things about the contents of the universe Correlation function is one of the way to characterize the distribution of galaxies in the space . This can be done by observations and can be extracted from numerical N-body experiments. Correlation function is a natural quantity in theoretical dynamical description of gravitating systems. These correlations can answer many interesting questions about the evolution and the distribution of galaxies.

  15. High dimensional cohomology of discrete groups.

    PubMed

    Brown, K S

    1976-06-01

    For a large class of discrete groups Gamma, relations are established between the high dimensional cohomology of Gamma and the cohomology of the normalizers of the finite subgroups of Gamma. The results are stated in terms of a generalization of Tate cohomology recently constructed by F. T. Farrell. As an illustration of these results, it is shown that one can recover a cohomology calculation of Lee and Szczarba, which they used to calculate the odd torsion in K(3)(Z). PMID:16592322

  16. High dimensional cohomology of discrete groups

    PubMed Central

    Brown, Kenneth S.

    1976-01-01

    For a large class of discrete groups Γ, relations are established between the high dimensional cohomology of Γ and the cohomology of the normalizers of the finite subgroups of Γ. The results are stated in terms of a generalization of Tate cohomology recently constructed by F. T. Farrell. As an illustration of these results, it is shown that one can recover a cohomology calculation of Lee and Szczarba, which they used to calculate the odd torsion in K3(Z). PMID:16592322

  17. Clustering Qualitative Data Based on Binary Equivalence Relations: Neighborhood Search Heuristics for the Clique Partitioning Problem

    ERIC Educational Resources Information Center

    Brusco, Michael J.; Kohn, Hans-Friedrich

    2009-01-01

    The clique partitioning problem (CPP) requires the establishment of an equivalence relation for the vertices of a graph such that the sum of the edge costs associated with the relation is minimized. The CPP has important applications for the social sciences because it provides a framework for clustering objects measured on a collection of nominal…

  18. Locating landmarks on high-dimensional free energy surfaces.

    PubMed

    Chen, Ming; Yu, Tang-Qing; Tuckerman, Mark E

    2015-03-17

    Coarse graining of complex systems possessing many degrees of freedom can often be a useful approach for analyzing and understanding key features of these systems in terms of just a few variables. The relevant energy landscape in a coarse-grained description is the free energy surface as a function of the coarse-grained variables, which, despite the dimensional reduction, can still be an object of high dimension. Consequently, navigating and exploring this high-dimensional free energy surface is a nontrivial task. In this paper, we use techniques from multiscale modeling, stochastic optimization, and machine learning to devise a strategy for locating minima and saddle points (termed "landmarks") on a high-dimensional free energy surface "on the fly" and without requiring prior knowledge of or an explicit form for the surface. In addition, we propose a compact graph representation of the landmarks and connections between them, and we show that the graph nodes can be subsequently analyzed and clustered based on key attributes that elucidate important properties of the system. Finally, we show that knowledge of landmark locations allows for the efficient determination of their relative free energies via enhanced sampling techniques. PMID:25737545

  19. Using cluster analysis to identify patterns in students' responses to contextually different conceptual problems

    NASA Astrophysics Data System (ADS)

    Stewart, John; Miller, Mayo; Audo, Christine; Stewart, Gay

    2012-12-01

    This study examined the evolution of student responses to seven contextually different versions of two Force Concept Inventory questions in an introductory physics course at the University of Arkansas. The consistency in answering the closely related questions evolved little over the seven-question exam. A model for the state of student knowledge involving the probability of selecting one of the multiple-choice answers was developed. Criteria for using clustering algorithms to extract model parameters were explored and it was found that the overlap between the probability distributions of the model vectors was an important parameter in characterizing the cluster models. The course data were then clustered and the extracted model showed that students largely fit into two groups both pre- and postinstruction: one that answered all questions correctly with high probability and one that selected the distracter representing the same misconception with high probability. For the course studied, 14% of the students were left with persistent misconceptions post instruction on a static force problem and 30% on a dynamic Newton’s third law problem. These students selected the answer representing the predominant misconception slightly more consistently postinstruction, indicating that the course studied had been ineffective at moving this subgroup of students nearer a Newtonian force concept and had instead moved them slightly farther away from a correct conceptual understanding of these two problems. The consistency in answering pairs of problems with varied physical contexts is shown to be an important supplementary statistic to the score on the problems and suggests that the inclusion of such problem pairs in future conceptual inventories would be efficacious. Multiple, contextually varied questions further probe the structure of students’ knowledge. To allow working instructors to make use of the additional insight gained from cluster analysis, it is our hope that the

  20. Technical Report: Scalable Parallel Algorithms for High Dimensional Numerical Integration

    SciTech Connect

    Masalma, Yahya; Jiao, Yu

    2010-10-01

    We implemented a scalable parallel quasi-Monte Carlo numerical high-dimensional integration for tera-scale data points. The implemented algorithm uses the Sobol s quasi-sequences to generate random samples. Sobol s sequence was used to avoid clustering effects in the generated random samples and to produce low-discrepancy random samples which cover the entire integration domain. The performance of the algorithm was tested. Obtained results prove the scalability and accuracy of the implemented algorithms. The implemented algorithm could be used in different applications where a huge data volume is generated and numerical integration is required. We suggest using the hyprid MPI and OpenMP programming model to improve the performance of the algorithms. If the mixed model is used, attention should be paid to the scalability and accuracy.

  1. Performance of Extended Local Clustering Organization (LCO) for Large Scale Job-Shop Scheduling Problem (JSP)

    NASA Astrophysics Data System (ADS)

    Konno, Yohko; Suzuki, Keiji

    This paper describes an approach to development of a solution algorithm of a general-purpose for large scale problems using “Local Clustering Organization (LCO)” as a new solution for Job-shop scheduling problem (JSP). Using a performance effective large scale scheduling in the study of usual LCO, a solving JSP keep stability induced better solution is examined. In this study for an improvement of a performance of a solution for JSP, processes to a optimization by LCO is examined, and a scheduling solution-structure is extended to a new solution-structure based on machine-division. A solving method introduced into effective local clustering for the solution-structure is proposed as an extended LCO. An extended LCO has an algorithm which improves scheduling evaluation efficiently by clustering of parallel search which extends over plural machines. A result verified by an application of extended LCO on various scale of problems proved to conduce to minimizing make-span and improving on the stable performance.

  2. Mode Estimation for High Dimensional Discrete Tree Graphical Models

    PubMed Central

    Chen, Chao; Liu, Han; Metaxas, Dimitris N.; Zhao, Tianqi

    2014-01-01

    This paper studies the following problem: given samples from a high dimensional discrete distribution, we want to estimate the leading (δ, ρ)-modes of the underlying distributions. A point is defined to be a (δ, ρ)-mode if it is a local optimum of the density within a δ-neighborhood under metric ρ. As we increase the “scale” parameter δ, the neighborhood size increases and the total number of modes monotonically decreases. The sequence of the (δ, ρ)-modes reveal intrinsic topographical information of the underlying distributions. Though the mode finding problem is generally intractable in high dimensions, this paper unveils that, if the distribution can be approximated well by a tree graphical model, mode characterization is significantly easier. An efficient algorithm with provable theoretical guarantees is proposed and is applied to applications like data analysis and multiple predictions. PMID:25620859

  3. A cluster-analytic study of substance problems and mental health among street youths.

    PubMed

    Adlaf, E M; Zdanowicz, Y M

    1999-11-01

    Based on a cluster analysis of 211 street youths aged 13-24 years interviewed in 1992 in Toronto, Ontario, Canada, we describe the configuration of mental health and substance use outcomes. Eight clusters were suggested: Entrepreneurs (n = 19) were frequently involved in delinquent activity and were highly entrenched in the street lifestyle; Drifters (n = 35) had infrequent social contact, displayed lower than average family dysfunction, and were not highly entrenched in the street lifestyle; Partiers (n = 40) were distinguished by their recreational motivation for alcohol and drug use and their below average entrenchment in the street lifestyle; Retreatists (n = 32) were distinguished by their high coping motivation for substance use; Fringers (n = 48) were involved marginally in the street lifestyle and showed lower than average family dysfunction; Transcenders (n = 21), despite above average physical and sexual abuse, reported below average mental health or substance use problems; Vulnerables (n = 12) were characterized by high family dysfunction (including physical and sexual abuse), elevated mental health outcomes, and use of alcohol and other drugs motivated by coping and escapism; Sex Workers (n = 4) were highly entrenched in the street lifestyle and reported frequent commercial sexual work, above average sexual abuse, and extensive use of crack cocaine. The results showed that distress, self-esteem, psychotic thoughts, attempted suicide, alcohol problems, drug problems, dual substance problems, and dual disorders varied significantly among the eight clusters. Overall, the findings suggest the need for differential programming. The data showed that risk factors, mental health, and substance use outcomes vary among this population. Also, for some the web of mental health and substance use problems is inseparable. PMID:10548440

  4. GLOBALLY ADAPTIVE QUANTILE REGRESSION WITH ULTRA-HIGH DIMENSIONAL DATA

    PubMed Central

    Zheng, Qi; Peng, Limin; He, Xuming

    2015-01-01

    Quantile regression has become a valuable tool to analyze heterogeneous covaraite-response associations that are often encountered in practice. The development of quantile regression methodology for high dimensional covariates primarily focuses on examination of model sparsity at a single or multiple quantile levels, which are typically prespecified ad hoc by the users. The resulting models may be sensitive to the specific choices of the quantile levels, leading to difficulties in interpretation and erosion of confidence in the results. In this article, we propose a new penalization framework for quantile regression in the high dimensional setting. We employ adaptive L1 penalties, and more importantly, propose a uniform selector of the tuning parameter for a set of quantile levels to avoid some of the potential problems with model selection at individual quantile levels. Our proposed approach achieves consistent shrinkage of regression quantile estimates across a continuous range of quantiles levels, enhancing the flexibility and robustness of the existing penalized quantile regression methods. Our theoretical results include the oracle rate of uniform convergence and weak convergence of the parameter estimators. We also use numerical studies to confirm our theoretical findings and illustrate the practical utility of our proposal. PMID:26604424

  5. Spatially Weighted Principal Component Regression for High-dimensional Prediction

    PubMed Central

    Shen, Dan; Zhu, Hongtu

    2015-01-01

    We consider the problem of using high dimensional data residing on graphs to predict a low-dimensional outcome variable, such as disease status. Examples of data include time series and genetic data measured on linear graphs and imaging data measured on triangulated graphs (or lattices), among many others. Many of these data have two key features including spatial smoothness and intrinsically low dimensional structure. We propose a simple solution based on a general statistical framework, called spatially weighted principal component regression (SWPCR). In SWPCR, we introduce two sets of weights including importance score weights for the selection of individual features at each node and spatial weights for the incorporation of the neighboring pattern on the graph. We integrate the importance score weights with the spatial weights in order to recover the low dimensional structure of high dimensional data. We demonstrate the utility of our methods through extensive simulations and a real data analysis based on Alzheimer’s disease neuroimaging initiative data. PMID:26213452

  6. An empirical demonstration of the problem of cluster dissimilarity from different clustering methods in a single sample.

    PubMed

    Saltstone, R; Fraboni, M

    1990-11-01

    This study utilized the four most commonly employed clustering techniques (CLINK, SLINK, UPGMA, and Ward's) to illustrate the dissimilarity of cluster group membership (based upon short-form MMPI scale scores and a measure of alcohol dependency) between partitions in a sample of 113 impaired driving offenders. Results, examined with the Rand index of cluster comparison, demonstrated that cluster group membership can be so different between alternative clustering methods as to equal chance assignment. Cautions are given with regard to the use of cluster analysis for other than exploratory work. In particular, psychologists are cautioned against attempting to use cluster analysis based upon personality inventory scores (which can never be wholly reliable or discrete) for patient classification. PMID:2286695

  7. A Selective Overview of Variable Selection in High Dimensional Feature Space.

    PubMed

    Fan, Jianqing; Lv, Jinchi

    2010-01-01

    High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976

  8. A Selective Overview of Variable Selection in High Dimensional Feature Space

    PubMed Central

    Fan, Jianqing

    2010-01-01

    High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976

  9. BOOK REVIEW: The Gravitational Million-Body Problem: A Multidisciplinary Approach to Star Cluster Dynamics

    NASA Astrophysics Data System (ADS)

    Heggie, D.; Hut, P.

    2003-10-01

    focus on N = 106 for two main reasons: first, direct numerical integrations of N-body systems are beginning to approach this threshold, and second, globular star clusters provide remarkably accurate physical instantiations of the idealized N-body problem with N = 105 - 106. The authors are distinguished contributors to the study of star-cluster dynamics and the gravitational N-body problem. The book contains lucid and concise descriptions of most of the important tools in the subject, with only a modest bias towards the authors' own interests. These tools include the two-body relaxation approximation, the Vlasov and Fokker-Planck equations, regularization of close encounters, conducting fluid models, Hill's approximation, Heggie's law for binary star evolution, symplectic integration algorithms, Liapunov exponents, and so on. The book also provides an up-to-date description of the principal processes that drive the evolution of idealized N-body systems - two-body relaxation, mass segregation, escape, core collapse and core bounce, binary star hardening, gravothermal oscillations - as well as additional processes such as stellar collisions and tidal shocks that affect real star clusters but not idealized N-body systems. In a relatively short (300 pages plus appendices) book such as this, many topics have to be omitted. The reader who is hoping to learn about the phenomenology of star clusters will be disappointed, as the description of their properties is limited to only a page of text; there is also almost no discussion of other, equally interesting N-body systems such as galaxies(N approx 106 - 1012), open clusters (N simeq 102 - 104), planetary systems, or the star clusters surrounding black holes that are found in the centres of most galaxies. All of these omissions are defensible decisions. Less defensible is the uneven set of references in the text; for example, nowhere is the reader informed that the classic predecessor to this work was Spitzer's 1987 monograph

  10. The Effects of Cumulative Violence Clusters on Young Mothers' School Participation: Examining Attention and Behavior Problems as Mediators.

    PubMed

    Kennedy, Angie C; Adams, Adrienne E

    2016-04-01

    Using a cluster analysis approach with a sample of 205 young mothers recruited from community sites in an urban Midwestern setting, we examined the effects of cumulative violence exposure (community violence exposure, witnessing intimate partner violence, physical abuse by a caregiver, and sexual victimization, all with onset prior to age 13) on school participation, as mediated by attention and behavior problems in school. We identified five clusters of cumulative exposure, and found that the HiAll cluster (high levels of exposure to all four types) consistently fared the worst, with significantly higher attention and behavior problems, and lower school participation, in comparison with the LoAll cluster (low levels of exposure to all types). Behavior problems were a significant mediator of the effects of cumulative violence exposure on school participation, but attention problems were not. PMID:25538121

  11. Graphics Processing Units and High-Dimensional Optimization

    PubMed Central

    Zhou, Hua; Lange, Kenneth; Suchard, Marc A.

    2011-01-01

    This paper discusses the potential of graphics processing units (GPUs) in high-dimensional optimization problems. A single GPU card with hundreds of arithmetic cores can be inserted in a personal computer and dramatically accelerates many statistical algorithms. To exploit these devices fully, optimization algorithms should reduce to multiple parallel tasks, each accessing a limited amount of data. These criteria favor EM and MM algorithms that separate parameters and data. To a lesser extent block relaxation and coordinate descent and ascent also qualify. We demonstrate the utility of GPUs in nonnegative matrix factorization, PET image reconstruction, and multidimensional scaling. Speedups of 100 fold can easily be attained. Over the next decade, GPUs will fundamentally alter the landscape of computational statistics. It is time for more statisticians to get on-board. PMID:21847315

  12. High-dimensional bolstered error estimation

    PubMed Central

    Sima, Chao; Braga-Neto, Ulisses M.; Dougherty, Edward R.

    2011-01-01

    Motivation: In small-sample settings, bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a non-parametric manner from the data. Although bolstering based on this variance setting works well for small feature sets, results can deteriorate for high-dimensional feature spaces. Results: This article computes an optimal kernel variance depending on the classification rule, sample size, model and feature space, both the original number and the number remaining after feature selection. A key point is that the optimal variance is robust relative to the model. This allows us to develop a method for selecting a suitable variance to use in real-world applications where the model is not known, but the other factors in determining the optimal kernel are known. Availability: Companion website at http://compbio.tgen.org/paper_supp/high_dim_bolstering Contact: edward@mail.ece.tamu.edu PMID:21914630

  13. Solution of relativistic quantum optics problems using clusters of graphical processing units

    SciTech Connect

    Gordon, D.F. Hafizi, B.; Helle, M.H.

    2014-06-15

    Numerical solution of relativistic quantum optics problems requires high performance computing due to the rapid oscillations in a relativistic wavefunction. Clusters of graphical processing units are used to accelerate the computation of a time dependent relativistic wavefunction in an arbitrary external potential. The stationary states in a Coulomb potential and uniform magnetic field are determined analytically and numerically, so that they can used as initial conditions in fully time dependent calculations. Relativistic energy levels in extreme magnetic fields are recovered as a means of validation. The relativistic ionization rate is computed for an ion illuminated by a laser field near the usual barrier suppression threshold, and the ionizing wavefunction is displayed.

  14. Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences

    NASA Technical Reports Server (NTRS)

    Budalakoti, Suratna; Srivastava, Ashok N.; Akella, Ram; Turkov, Eugene

    2006-01-01

    This paper addresses the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. The approach taken uses unsupervised clustering of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by detailed analysis of outliers to detect anomalies. As the LCS measure is expensive to compute, the first part of the paper discusses existing algorithms, such as the Hunt-Szymanski algorithm, that have low time-complexity. We then discuss why these algorithms often do not work well in practice and present a new hybrid algorithm for computing the LCS that, in our tests, outperforms the Hunt-Szymanski algorithm by a factor of five. The second part of the paper presents new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an outlier. The algorithms provide a coherent description to an analyst of the anomalies in the sequence, compared to more normal sequences. The algorithms we present are general and domain-independent, so we discuss applications in related areas such as anomaly detection.

  15. Optimal control problem for the three-sector economic model of a cluster

    NASA Astrophysics Data System (ADS)

    Murzabekov, Zainel; Aipanov, Shamshi; Usubalieva, Saltanat

    2016-08-01

    The problem of optimal control for the three-sector economic model of a cluster is considered. Task statement is to determine the optimal distribution of investment and manpower in moving the system from a given initial state to desired final state. To solve the optimal control problem with finite-horizon planning, in case of fixed ends of trajectories, with box constraints, the method of Lagrange multipliers of a special type is used. This approach allows to represent the desired control in the form of synthesis control, depending on state of the system and current time. The results of numerical calculations for an instance of three-sector model of the economy show the effectiveness of the proposed method.

  16. Solving the inverse Ising problem by mean-field methods in a clustered phase space with many states

    NASA Astrophysics Data System (ADS)

    Decelle, Aurélien; Ricci-Tersenghi, Federico

    2016-07-01

    In this work we explain how to properly use mean-field methods to solve the inverse Ising problem when the phase space is clustered, that is, many states are present. The clustering of the phase space can occur for many reasons, e.g., when a system undergoes a phase transition, but also when data are collected in different regimes (e.g., quiescent and spiking regimes in neural networks). Mean-field methods for the inverse Ising problem are typically used without taking into account the eventual clustered structure of the input configurations and may lead to very poor inference (e.g., in the low-temperature phase of the Curie-Weiss model). In this work we explain how to modify mean-field approaches when the phase space is clustered and we illustrate the effectiveness of our method on different clustered structures (low-temperature phases of Curie-Weiss and Hopfield models).

  17. Solving the inverse Ising problem by mean-field methods in a clustered phase space with many states.

    PubMed

    Decelle, Aurélien; Ricci-Tersenghi, Federico

    2016-07-01

    In this work we explain how to properly use mean-field methods to solve the inverse Ising problem when the phase space is clustered, that is, many states are present. The clustering of the phase space can occur for many reasons, e.g., when a system undergoes a phase transition, but also when data are collected in different regimes (e.g., quiescent and spiking regimes in neural networks). Mean-field methods for the inverse Ising problem are typically used without taking into account the eventual clustered structure of the input configurations and may lead to very poor inference (e.g., in the low-temperature phase of the Curie-Weiss model). In this work we explain how to modify mean-field approaches when the phase space is clustered and we illustrate the effectiveness of our method on different clustered structures (low-temperature phases of Curie-Weiss and Hopfield models). PMID:27575082

  18. Fully polynomial-time approximation scheme for a special case of a quadratic Euclidean 2-clustering problem

    NASA Astrophysics Data System (ADS)

    Kel'manov, A. V.; Khandeev, V. I.

    2016-02-01

    The strongly NP-hard problem of partitioning a finite set of points of Euclidean space into two clusters of given sizes (cardinalities) minimizing the sum (over both clusters) of the intracluster sums of squared distances from the elements of the clusters to their centers is considered. It is assumed that the center of one of the sought clusters is specified at the desired (arbitrary) point of space (without loss of generality, at the origin), while the center of the other one is unknown and determined as the mean value over all elements of this cluster. It is shown that unless P = NP, there is no fully polynomial-time approximation scheme for this problem, and such a scheme is substantiated in the case of a fixed space dimension.

  19. High dimensional decision dilemmas in climate models

    NASA Astrophysics Data System (ADS)

    Bracco, A.; Neelin, J. D.; Luo, H.; McWilliams, J. C.; Meyerson, J. E.

    2013-05-01

    An important source of uncertainty in climate models is linked to the calibration of model parameters. Interest in systematic and automated parameter optimization procedures stems from the desire to improve the model climatology and to quantify the average sensitivity associated with potential changes in the climate system. Neelin et al. (2010) used a quadratic metamodel to objectively calibrate an atmospheric circulation model (AGCM) around four adjustable parameters. The metamodel accurately estimates global spatial averages of common fields of climatic interest, from precipitation, to low and high level winds, from temperature at various levels to sea level pressure and geopotential height, while providing a computationally cheap strategy to explore the influence of parameter settings. Here, guided by the metamodel, the ambiguities or dilemmas related to the decision making process in relation to model sensitivity and optimization are examined. Simulations of current climate are subject to considerable regional-scale biases. Those biases may vary substantially depending on the climate variable considered, and/or on the performance metric adopted. Common dilemmas are associated with model revisions yielding improvement in one field or regional pattern or season, but degradation in another, or improvement in the model climatology but degradation in the interannual variability representation. Challenges are posed to the modeler by the high dimensionality of the model output fields and by the large number of adjustable parameters. The use of the metamodel in the optimization strategy helps visualize trade-offs at a regional level, e.g. how mismatches between sensitivity and error spatial fields yield regional errors under minimization of global objective functions.

  20. High dimensional decision dilemmas in climate models

    NASA Astrophysics Data System (ADS)

    Bracco, A.; Neelin, J. D.; Luo, H.; McWilliams, J. C.; Meyerson, J. E.

    2013-10-01

    An important source of uncertainty in climate models is linked to the calibration of model parameters. Interest in systematic and automated parameter optimization procedures stems from the desire to improve the model climatology and to quantify the average sensitivity associated with potential changes in the climate system. Building upon on the smoothness of the response of an atmospheric circulation model (AGCM) to changes of four adjustable parameters, Neelin et al. (2010) used a quadratic metamodel to objectively calibrate the AGCM. The metamodel accurately estimates global spatial averages of common fields of climatic interest, from precipitation, to low and high level winds, from temperature at various levels to sea level pressure and geopotential height, while providing a computationally cheap strategy to explore the influence of parameter settings. Here, guided by the metamodel, the ambiguities or dilemmas related to the decision making process in relation to model sensitivity and optimization are examined. Simulations of current climate are subject to considerable regional-scale biases. Those biases may vary substantially depending on the climate variable considered, and/or on the performance metric adopted. Common dilemmas are associated with model revisions yielding improvement in one field or regional pattern or season, but degradation in another, or improvement in the model climatology but degradation in the interannual variability representation. Challenges are posed to the modeler by the high dimensionality of the model output fields and by the large number of adjustable parameters. The use of the metamodel in the optimization strategy helps visualize trade-offs at a regional level, e.g., how mismatches between sensitivity and error spatial fields yield regional errors under minimization of global objective functions.

  1. Toward a nonlinear ensemble filter for high-dimensional systems

    NASA Astrophysics Data System (ADS)

    Bengtsson, Thomas; Snyder, Chris; Nychka, Doug

    2003-12-01

    Many geophysical problems are characterized by high-dimensional, nonlinear systems and pose difficult challenges for real-time data assimilation (updating) and forecasting. The present work builds on the ensemble Kalman filter (EnsKF), with the goal of producing ensemble filtering techniques applicable to non-Gaussian densities and high-dimensional systems. Three filtering algorithms, based on representing the prior density as a Gaussian mixture, are presented. The first, referred to as a mixture ensemble Kalman filter (XEnsF), models local covariance structures adaptively using nearest neighbors. The XEnsF is effective in a three-dimensional system, but the required ensemble grows rapidly with the dimension and, even in a 40-dimensional system, we find the XEnsF to be unstable and inferior to the EnsKF for all computationally feasible ensemble sizes. A second algorithm, the local-local ensemble filter (LLEnsF), combines localizations in physical as well as phase space, allowing the update step in high-dimensional systems to be decomposed into a sequence of lower-dimensional updates tractable by the XEnsF. Given the same prior forecasts in a 40-dimensional system, the LLEnsF update produces more accurate state estimates than the EnsKF if the forecast distributions are sufficiently non-Gaussian. Cycling the LLEnsF for long times, however, produces results inferior to the EnsKF because the LLEnsF ignores spatial continuity or smoothness between local state estimates. To address this weakness of the LLEnsF, we consider ways of enforcing spatial smoothness by conditioning the local updates on the prior estimates outside the localization in physical space. These considerations yield a third algorithm, which is a hybrid of the LLEnsF and the EnsKF. The hybrid uses information from the EnsKF to ensure spatial continuity of local updates and outperforms the EnsKF by 5.7% in RMS error in the 40-dimensional system.

  2. Visualization of High-Dimensionality Data Using Virtual Reality

    NASA Astrophysics Data System (ADS)

    Djorgovski, S. G.; Donalek, C.; Davidoff, S.; Lombeyda, S.

    2015-12-01

    An effective visualization of complex and high-dimensionality data sets is now a critical bottleneck on the path from data to discovery in all fields. Visual pattern recognition is the bridge between human intuition and understanding, and the quantitative content of the data and the relationships present there (correlations, outliers, clustering, etc.). We are developing a novel platform for visualization of complex, multi-dimensional data, using immersive virtual reality (VR), that leverages the recent rapid developments in the availability of commodity hardware and development software. VR immersion has been shown to significantly increase the effective visual perception and intuition, compared to the traditional flat-screen tools. This allows to more easily perceive higher dimensional spaces, with an advantage for a visual exploration of complex data compared to the traditional visualization methods. Immersive VR also offers a natural way for a collaborative visual exploration of data, with multiple users interacting with each other and with their data in the same perceptive data space.

  3. Engineering two-photon high-dimensional states through quantum interference

    PubMed Central

    Zhang, Yingwen; Roux, Filippus S.; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew

    2016-01-01

    Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits. PMID:26933685

  4. Engineering two-photon high-dimensional states through quantum interference.

    PubMed

    Zhang, Yingwen; Roux, Filippus S; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew

    2016-02-01

    Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits. PMID:26933685

  5. Collaborative Care Outcomes for Pediatric Behavioral Health Problems: A Cluster Randomized Trial

    PubMed Central

    Campo, John; Kilbourne, Amy M.; Hart, Jonathan; Sakolsky, Dara; Wisniewski, Stephen

    2014-01-01

    OBJECTIVE: To assess the efficacy of collaborative care for behavior problems, attention-deficit/hyperactivity disorder (ADHD), and anxiety in pediatric primary care (Doctor Office Collaborative Care; DOCC). METHODS: Children and their caregivers participated from 8 pediatric practices that were cluster randomized to DOCC (n = 160) or enhanced usual care (EUC; n = 161). In DOCC, a care manager delivered a personalized, evidence-based intervention. EUC patients received psychoeducation and a facilitated specialty care referral. Care processes measures were collected after the 6-month intervention period. Family outcome measures included the Vanderbilt ADHD Diagnostic Parent Rating Scale, Parenting Stress Index-Short Form, Individualized Goal Attainment Ratings, and Clinical Global Impression-Improvement Scale. Most measures were collected at baseline, and 6-, 12-, and 18-month assessments. Provider outcome measures examined perceived treatment change, efficacy, and obstacles, and practice climate. RESULTS: DOCC (versus EUC) was associated with higher rates of treatment initiation (99.4% vs 54.2%; P < .001) and completion (76.6% vs 11.6%, P < .001), improvement in behavior problems, hyperactivity, and internalizing problems (P < .05 to .01), and parental stress (P < .05–.001), remission in behavior and internalizing problems (P < .01, .05), goal improvement (P < .05 to .001), treatment response (P < .05), and consumer satisfaction (P < .05). DOCC pediatricians reported greater perceived practice change, efficacy, and skill use to treat ADHD (P < .05 to .01). CONCLUSIONS: Implementing a collaborative care intervention for behavior problems in community pediatric practices is feasible and broadly effective, supporting the utility of integrated behavioral health care services. PMID:24664093

  6. Inverse regression-based uncertainty quantification algorithms for high-dimensional models: Theory and practice

    NASA Astrophysics Data System (ADS)

    Li, Weixuan; Lin, Guang; Li, Bing

    2016-09-01

    Many uncertainty quantification (UQ) approaches suffer from the curse of dimensionality, that is, their computational costs become intractable for problems involving a large number of uncertainty parameters. In these situations, the classic Monte Carlo often remains the preferred method of choice because its convergence rate O (n - 1 / 2), where n is the required number of model simulations, does not depend on the dimension of the problem. However, many high-dimensional UQ problems are intrinsically low-dimensional, because the variation of the quantity of interest (QoI) is often caused by only a few latent parameters varying within a low-dimensional subspace, known as the sufficient dimension reduction (SDR) subspace in the statistics literature. Motivated by this observation, we propose two inverse regression-based UQ algorithms (IRUQ) for high-dimensional problems. Both algorithms use inverse regression to convert the original high-dimensional problem to a low-dimensional one, which is then efficiently solved by building a response surface for the reduced model, for example via the polynomial chaos expansion. The first algorithm, which is for the situations where an exact SDR subspace exists, is proved to converge at rate O (n-1), hence much faster than MC. The second algorithm, which doesn't require an exact SDR, employs the reduced model as a control variate to reduce the error of the MC estimate. The accuracy gain could still be significant, depending on how well the reduced model approximates the original high-dimensional one. IRUQ also provides several additional practical advantages: it is non-intrusive; it does not require computing the high-dimensional gradient of the QoI; and it reports an error bar so the user knows how reliable the result is.

  7. Extremely high-dimensional feature selection via feature generating samplings.

    PubMed

    Li, Shutao; Wei, Dan

    2014-06-01

    To select informative features on extremely high-dimensional problems, in this paper, a sampling scheme is proposed to enhance the efficiency of recently developed feature generating machines (FGMs). Note that in FGMs O(mlogr) time complexity should be taken to order the features by their scores; the entire computational cost of feature ordering will become unbearable when m is very large, for example, m > 10(11) , where m is the feature dimensionality and r is the size of the selected feature subset. To solve this problem, in this paper, we propose a feature generating sampling method, which can reduce this computational complexity to O(Gslog(G)+G(G+log(G))) while preserving the most informative features in a feature buffer, where Gs is the maximum number of nonzero features for each instance and G is the buffer size. Moreover, we show that our proposed sampling scheme can be deemed as the birth-death process based on random processes theory, which guarantees to include most of the informative features for feature selections. Empirical studies on real-world datasets show the effectiveness of the proposed sampling method. PMID:23864272

  8. Avoiding common pitfalls when clustering biological data.

    PubMed

    Ronan, Tom; Qi, Zhijie; Naegle, Kristen M

    2016-01-01

    Clustering is an unsupervised learning method, which groups data points based on similarity, and is used to reveal the underlying structure of data. This computational approach is essential to understanding and visualizing the complex data that are acquired in high-throughput multidimensional biological experiments. Clustering enables researchers to make biological inferences for further experiments. Although a powerful technique, inappropriate application can lead biological researchers to waste resources and time in experimental follow-up. We review common pitfalls identified from the published molecular biology literature and present methods to avoid them. Commonly encountered pitfalls relate to the high-dimensional nature of biological data from high-throughput experiments, the failure to consider more than one clustering method for a given problem, and the difficulty in determining whether clustering has produced meaningful results. We present concrete examples of problems and solutions (clustering results) in the form of toy problems and real biological data for these issues. We also discuss ensemble clustering as an easy-to-implement method that enables the exploration of multiple clustering solutions and improves robustness of clustering solutions. Increased awareness of common clustering pitfalls will help researchers avoid overinterpreting or misinterpreting the results and missing valuable insights when clustering biological data. PMID:27303057

  9. Experiences modeling ocean circulation problems on a 30 node commodity cluster with 3840 GPU processor cores.

    NASA Astrophysics Data System (ADS)

    Hill, C.

    2008-12-01

    Low cost graphic cards today use many, relatively simple, compute cores to deliver support for memory bandwidth of more than 100GB/s and theoretical floating point performance of more than 500 GFlop/s. Right now this performance is, however, only accessible to highly parallel algorithm implementations that, (i) can use a hundred or more, 32-bit floating point, concurrently executing cores, (ii) can work with graphics memory that resides on the graphics card side of the graphics bus and (iii) can be partially expressed in a language that can be compiled by a graphics programming tool. In this talk we describe our experiences implementing a complete, but relatively simple, time dependent shallow-water equations simulation targeting a cluster of 30 computers each hosting one graphics card. The implementation takes into account the considerations (i), (ii) and (iii) listed previously. We code our algorithm as a series of numerical kernels. Each kernel is designed to be executed by multiple threads of a single process. Kernels are passed memory blocks to compute over which can be persistent blocks of memory on a graphics card. Each kernel is individually implemented using the NVidia CUDA language but driven from a higher level supervisory code that is almost identical to a standard model driver. The supervisory code controls the overall simulation timestepping, but is written to minimize data transfer between main memory and graphics memory (a massive performance bottle-neck on current systems). Using the recipe outlined we can boost the performance of our cluster by nearly an order of magnitude, relative to the same algorithm executing only on the cluster CPU's. Achieving this performance boost requires that many threads are available to each graphics processor for execution within each numerical kernel and that the simulations working set of data can fit into the graphics card memory. As we describe, this puts interesting upper and lower bounds on the problem sizes

  10. High-dimensional mode analyzers for spatial quantum entanglement

    SciTech Connect

    Oemrawsingh, S. S. R.; Jong, J. A. de; Ma, X.; Aiello, A.; Eliel, E. R.; Hooft, G. W. 't; Woerdman, J. P.

    2006-03-15

    By analyzing entangled photon states in terms of high-dimensional spatial mode superpositions, it becomes feasible to expose high-dimensional entanglement, and even the nonlocality of twin photons. To this end, a proper analyzer should be designed that is capable of handling a large number of spatial modes, while still being convenient to use in an experiment. We compare two variants of a high-dimensional spatial mode analyzer on the basis of classical and quantum considerations. These analyzers have been tested in classical optical experiments.

  11. Statistical mechanics of complex neural systems and high dimensional data

    NASA Astrophysics Data System (ADS)

    Advani, Madhu; Lahiri, Subhaneil; Ganguli, Surya

    2013-03-01

    Recent experimental advances in neuroscience have opened new vistas into the immense complexity of neuronal networks. This proliferation of data challenges us on two parallel fronts. First, how can we form adequate theoretical frameworks for understanding how dynamical network processes cooperate across widely disparate spatiotemporal scales to solve important computational problems? Second, how can we extract meaningful models of neuronal systems from high dimensional datasets? To aid in these challenges, we give a pedagogical review of a collection of ideas and theoretical methods arising at the intersection of statistical physics, computer science and neurobiology. We introduce the interrelated replica and cavity methods, which originated in statistical physics as powerful ways to quantitatively analyze large highly heterogeneous systems of many interacting degrees of freedom. We also introduce the closely related notion of message passing in graphical models, which originated in computer science as a distributed algorithm capable of solving large inference and optimization problems involving many coupled variables. We then show how both the statistical physics and computer science perspectives can be applied in a wide diversity of contexts to problems arising in theoretical neuroscience and data analysis. Along the way we discuss spin glasses, learning theory, illusions of structure in noise, random matrices, dimensionality reduction and compressed sensing, all within the unified formalism of the replica method. Moreover, we review recent conceptual connections between message passing in graphical models, and neural computation and learning. Overall, these ideas illustrate how statistical physics and computer science might provide a lens through which we can uncover emergent computational functions buried deep within the dynamical complexities of neuronal networks.

  12. Alternative single-reference coupled cluster approaches for multireference problems: The simpler, the better

    NASA Astrophysics Data System (ADS)

    Evangelista, Francesco A.

    2011-06-01

    We report a general implementation of alternative formulations of single-reference coupled cluster theory (extended, unitary, and variational) with arbitrary-order truncation of the cluster operator. These methods are applied to compute the energy of Ne and the equilibrium properties of HF and C2. Potential energy curves for the dissociation of HF and the BeH2 model computed with the extended, variational, and unitary coupled cluster approaches are compared to those obtained from the multireference coupled cluster approach of Mukherjee et al. [J. Chem. Phys. 110, 6171 (1999)] and the internally contracted multireference coupled cluster approach [F. A. Evangelista and J. Gauss, J. Chem. Phys. 134, 114102 (2011), 10.1063/1.3559149]. In the case of Ne, HF, and C2, the alternative coupled cluster approaches yield almost identical bond length, harmonic vibrational frequency, and anharmonic constant, which are more accurate than those from traditional coupled cluster theory. For potential energy curves, the alternative coupled cluster methods are found to be more accurate than traditional coupled cluster theory, but are three to ten times less accurate than multireference coupled cluster approaches. The most challenging benchmark, the BeH2 model, highlights the strong dependence of the alternative coupled cluster theories on the choice of the Fermi vacuum. When evaluated by the accuracy to cost ratio, the alternative coupled cluster methods are not competitive with respect to traditional CC theory, in other words, the simplest theory is found to be the most effective one.

  13. Optimization of High-Dimensional Functions through Hypercube Evaluation

    PubMed Central

    Abiyev, Rahib H.; Tunay, Mustafa

    2015-01-01

    A novel learning algorithm for solving global numerical optimization problems is proposed. The proposed learning algorithm is intense stochastic search method which is based on evaluation and optimization of a hypercube and is called the hypercube optimization (HO) algorithm. The HO algorithm comprises the initialization and evaluation process, displacement-shrink process, and searching space process. The initialization and evaluation process initializes initial solution and evaluates the solutions in given hypercube. The displacement-shrink process determines displacement and evaluates objective functions using new points, and the search area process determines next hypercube using certain rules and evaluates the new solutions. The algorithms for these processes have been designed and presented in the paper. The designed HO algorithm is tested on specific benchmark functions. The simulations of HO algorithm have been performed for optimization of functions of 1000-, 5000-, or even 10000 dimensions. The comparative simulation results with other approaches demonstrate that the proposed algorithm is a potential candidate for optimization of both low and high dimensional functions. PMID:26339237

  14. HYPOTHESIS TESTING FOR HIGH-DIMENSIONAL SPARSE BINARY REGRESSION

    PubMed Central

    Mukherjee, Rajarshi; Pillai, Natesh S.; Lin, Xihong

    2015-01-01

    In this paper, we study the detection boundary for minimax hypothesis testing in the context of high-dimensional, sparse binary regression models. Motivated by genetic sequencing association studies for rare variant effects, we investigate the complexity of the hypothesis testing problem when the design matrix is sparse. We observe a new phenomenon in the behavior of detection boundary which does not occur in the case of Gaussian linear regression. We derive the detection boundary as a function of two components: a design matrix sparsity index and signal strength, each of which is a function of the sparsity of the alternative. For any alternative, if the design matrix sparsity index is too high, any test is asymptotically powerless irrespective of the magnitude of signal strength. For binary design matrices with the sparsity index that is not too high, our results are parallel to those in the Gaussian case. In this context, we derive detection boundaries for both dense and sparse regimes. For the dense regime, we show that the generalized likelihood ratio is rate optimal; for the sparse regime, we propose an extended Higher Criticism Test and show it is rate optimal and sharp. We illustrate the finite sample properties of the theoretical results using simulation studies. PMID:26246645

  15. High-dimensional genomic data bias correction and data integration using MANCIE

    PubMed Central

    Zang, Chongzhi; Wang, Tao; Deng, Ke; Li, Bo; Hu, Sheng'en; Qin, Qian; Xiao, Tengfei; Zhang, Shihua; Meyer, Clifford A.; He, Housheng Hansen; Brown, Myles; Liu, Jun S.; Xie, Yang; Liu, X. Shirley

    2016-01-01

    High-dimensional genomic data analysis is challenging due to noises and biases in high-throughput experiments. We present a computational method matrix analysis and normalization by concordant information enhancement (MANCIE) for bias correction and data integration of distinct genomic profiles on the same samples. MANCIE uses a Bayesian-supported principal component analysis-based approach to adjust the data so as to achieve better consistency between sample-wise distances in the different profiles. MANCIE can improve tissue-specific clustering in ENCODE data, prognostic prediction in Molecular Taxonomy of Breast Cancer International Consortium and The Cancer Genome Atlas data, copy number and expression agreement in Cancer Cell Line Encyclopedia data, and has broad applications in cross-platform, high-dimensional data integration. PMID:27072482

  16. High-dimensional genomic data bias correction and data integration using MANCIE.

    PubMed

    Zang, Chongzhi; Wang, Tao; Deng, Ke; Li, Bo; Hu, Sheng'en; Qin, Qian; Xiao, Tengfei; Zhang, Shihua; Meyer, Clifford A; He, Housheng Hansen; Brown, Myles; Liu, Jun S; Xie, Yang; Liu, X Shirley

    2016-01-01

    High-dimensional genomic data analysis is challenging due to noises and biases in high-throughput experiments. We present a computational method matrix analysis and normalization by concordant information enhancement (MANCIE) for bias correction and data integration of distinct genomic profiles on the same samples. MANCIE uses a Bayesian-supported principal component analysis-based approach to adjust the data so as to achieve better consistency between sample-wise distances in the different profiles. MANCIE can improve tissue-specific clustering in ENCODE data, prognostic prediction in Molecular Taxonomy of Breast Cancer International Consortium and The Cancer Genome Atlas data, copy number and expression agreement in Cancer Cell Line Encyclopedia data, and has broad applications in cross-platform, high-dimensional data integration. PMID:27072482

  17. Visual Exploration of High-Dimensional Data through Subspace Analysis and Dynamic Projections

    SciTech Connect

    Liu, S.; Wang, B.; Thiagarajan, Jayaraman J.; Bremer, Peer -Timo; Pascucci, Valerio

    2015-06-01

    We introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that create smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.

  18. Weighted Distance Functions Improve Analysis of High-Dimensional Data: Application to Molecular Dynamics Simulations.

    PubMed

    Blöchliger, Nicolas; Caflisch, Amedeo; Vitalis, Andreas

    2015-11-10

    Data mining techniques depend strongly on how the data are represented and how distance between samples is measured. High-dimensional data often contain a large number of irrelevant dimensions (features) for a given query. These features act as noise and obfuscate relevant information. Unsupervised approaches to mine such data require distance measures that can account for feature relevance. Molecular dynamics simulations produce high-dimensional data sets describing molecules observed in time. Here, we propose to globally or locally weight simulation features based on effective rates. This emphasizes, in a data-driven manner, slow degrees of freedom that often report on the metastable states sampled by the molecular system. We couple this idea to several unsupervised learning protocols. Our approach unmasks slow side chain dynamics within the native state of a miniprotein and reveals additional metastable conformations of a protein. The approach can be combined with most algorithms for clustering or dimensionality reduction. PMID:26574336

  19. Bit-table based biclustering and frequent closed itemset mining in high-dimensional binary data.

    PubMed

    Király, András; Gyenesei, Attila; Abonyi, János

    2014-01-01

    During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers. PMID:24616651

  20. ClusterSculptor: Software for Expert-Steered Classification of Single Particle Mass Spectra

    SciTech Connect

    Zelenyuk, Alla; Imre, Dan G.; Nam, Eun Ju; Han, Yiping; Mueller, Klaus

    2008-08-01

    To take full advantage of the vast amount of highly detailed data acquired by single particle mass spectrometers requires that the data be organized according to some rules that have the potential to be insightful. Most commonly statistical tools are used to cluster the individual particle mass spectra on the basis of their similarity. Cluster analysis is a powerful strategy for the exploration of high-dimensional data in the absence of a-priori hypotheses or data classification models, and the results of cluster analysis can then be used to form such models. More often than not, when examining the data clustering results we find that many clusters contain particles of different types and that many particles of one type end up in a number of separate clusters. Our experience with cluster analysis shows that we have a vast amount of non-compiled knowledge and intuition that should be brought to bear in this effort. We will present new software we call ClusterSculptor that provides comprehensive and intuitive framework to aid scientists in data classification. ClusterSculptor uses k-means as the overall clustering engine, but allows tuning its parameters interactively, based on a non-distorted compact visual presentation of the inherent characteristics of the data in high-dimensional space. ClusterSculptor provides all the tools necessary for a high-dimensional activity we call cluster sculpting. ClusterSculptor is designed to be coupled to SpectraMiner, our data mining and visualization software package. The data are first visualized with SpectraMiner and identified problems are exported to ClusterSculptor, where the user steers the reclassification and recombination of clusters of tens of thousands particle mass spectra in real-time. The resulting sculpted clusters can be then imported back into SpectraMiner. Here we will greatly improved single particle chemical speciation in an example of application of this new tool to a number of particle types of atmospheric

  1. Convex Clustering: An Attractive Alternative to Hierarchical Clustering

    PubMed Central

    Chen, Gary K.; Chi, Eric C.; Ranola, John Michael O.; Lange, Kenneth

    2015-01-01

    The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/ PMID:25965340

  2. An Overview of Air Pollution Problem in Megacities and City Clusters in China

    NASA Astrophysics Data System (ADS)

    Tang, X.

    2007-05-01

    China has experienced the rapid economic growth in last twenty years. City clusters, which consist of one or several mega cities in close vicinity and many satellite cities and towns, are playing a leading role in Chinese economic growth, owing to their collective economic capacity and interdependency. However, accompanying with the economic boom, population growth and increased energy consumption, the air quality has been degrading in the past two decades. Air pollution in those areas is characterized by concurrent occurrence of high concentrations of multiple primary pollutants leading to form complex secondary pollution problem. After decades long efforts to control air pollution, both the government and scientific communities have realized that to control regional scale air pollution, regional efforts are needed. Field experiments covering the regions like Pearl River Delta region and Beijing City with surrounding areas are critical to understand the chemical and physical processes leading to the formation of regional scale air pollution. In order to formulate policy suggestions for air quality attainment during 2008 Beijing Olympic game and to propose objectives of air quality attainment in 2010 in Beijing, CAREBEIJING (Campaigns of Air Quality Research in Beijing and Surrounding Region) was organized by Peking University in 2006 to learn current air pollution situation of the region, and to identify the transport and transformation processes that lead to the impact of the surrounding area on air quality in Beijing. Same as the purpose for understanding the chemical and physical processes happened in regional scale, the fall and summer campaigns in 2004 and 2006 were carried out in Pearl River Delta. More than 16 domestic and foreign institutions were involved in these campaigns. The background, current status, problems, and some results of these campaigns will be introduced in this presentation.

  3. A Structure-Based Distance Metric for High-Dimensional Space Exploration with Multi-Dimensional Scaling

    SciTech Connect

    Lee, Hyun Jung; McDonnell, Kevin T.; Zelenyuk, Alla; Imre, D.; Mueller, Klaus

    2014-03-01

    Although the Euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging inter-cluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multi-dimensional scaling (MDS) where one can often observe non-intuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our MDS plots also exhibit similar visual relationships as the method of parallel coordinates which is often used alongside to visualize the high-dimensional data in raw form. We then cast our metric into a bi-scale framework which distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate Euclidean distance.

  4. Choosing ℓp norms in high-dimensional spaces based on hub analysis

    PubMed Central

    Flexer, Arthur; Schnitzer, Dominik

    2015-01-01

    The hubness phenomenon is a recently discovered aspect of the curse of dimensionality. Hub objects have a small distance to an exceptionally large number of data points while anti-hubs lie far from all other data points. A closely related problem is the concentration of distances in high-dimensional spaces. Previous work has already advocated the use of fractional ℓp norms instead of the ubiquitous Euclidean norm to avoid the negative effects of distance concentration. However, which exact fractional norm to use is a largely unsolved problem. The contribution of this work is an empirical analysis of the relation of different ℓp norms and hubness. We propose an unsupervised approach for choosing an ℓp norm which minimizes hubs while simultaneously maximizing nearest neighbor classification. Our approach is evaluated on seven high-dimensional data sets and compared to three approaches that re-scale distances to avoid hubness. PMID:26640321

  5. A quasi-Newton acceleration for high-dimensional optimization algorithms

    PubMed Central

    Alexander, David; Lange, Kenneth

    2010-01-01

    In many statistical problems, maximum likelihood estimation by an EM or MM algorithm suffers from excruciatingly slow convergence. This tendency limits the application of these algorithms to modern high-dimensional problems in data mining, genomics, and imaging. Unfortunately, most existing acceleration techniques are ill-suited to complicated models involving large numbers of parameters. The squared iterative methods (SQUAREM) recently proposed by Varadhan and Roland constitute one notable exception. This paper presents a new quasi-Newton acceleration scheme that requires only modest increments in computation per iteration and overall storage and rivals or surpasses the performance of SQUAREM on several representative test problems. PMID:21359052

  6. Autonomous mental development in high dimensional context and action spaces.

    PubMed

    Joshi, Ameet; Weng, Juyang

    2003-01-01

    Autonomous Mental Development (AMD) of robots opened a new paradigm for developing machine intelligence, using neural network type of techniques and it fundamentally changed the way an intelligent machine is developed from manual to autonomous. The work presented here is a part of SAIL (Self-Organizing Autonomous Incremental Learner) project which deals with autonomous development of humanoid robot with vision, audition, manipulation and locomotion. The major issue addressed here is the challenge of high dimensional action space (5-10) in addition to the high dimensional context space (hundreds to thousands and beyond), typically required by an AMD machine. This is the first work that studies a high dimensional (numeric) action space in conjunction with a high dimensional perception (context state) space, under the AMD mode. Two new learning algorithms, Direct Update on Direction Cosines (DUDC) and High-Dimensional Conjugate Gradient Search (HCGS), are developed, implemented and tested. The convergence properties of both the algorithms and their targeted applications are discussed. Autonomous learning of speech production under reinforcement learning is studied as an example. PMID:12850025

  7. Harnessing high-dimensional hyperentanglement through a biphoton frequency comb

    NASA Astrophysics Data System (ADS)

    Xie, Zhenda; Zhong, Tian; Shrestha, Sajan; Xu, Xinan; Liang, Junlin; Gong, Yan-Xiao; Bienfang, Joshua C.; Restelli, Alessandro; Shapiro, Jeffrey H.; Wong, Franco N. C.; Wei Wong, Chee

    2015-08-01

    Quantum entanglement is a fundamental resource for secure information processing and communications, and hyperentanglement or high-dimensional entanglement has been separately proposed for its high data capacity and error resilience. The continuous-variable nature of the energy-time entanglement makes it an ideal candidate for efficient high-dimensional coding with minimal limitations. Here, we demonstrate the first simultaneous high-dimensional hyperentanglement using a biphoton frequency comb to harness the full potential in both the energy and time domain. Long-postulated Hong-Ou-Mandel quantum revival is exhibited, with up to 19 time-bins and 96.5% visibilities. We further witness the high-dimensional energy-time entanglement through Franson revivals, observed periodically at integer time-bins, with 97.8% visibility. This qudit state is observed to simultaneously violate the generalized Bell inequality by up to 10.95 standard deviations while observing recurrent Clauser-Horne-Shimony-Holt S-parameters up to 2.76. Our biphoton frequency comb provides a platform for photon-efficient quantum communications towards the ultimate channel capacity through energy-time-polarization high-dimensional encoding.

  8. Optimally splitting cases for training and testing high dimensional classifiers

    PubMed Central

    2011-01-01

    Background We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? Results We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. Conclusions By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split. PMID:21477282

  9. An Effective Parameter Screening Strategy for High Dimensional Watershed Models

    NASA Astrophysics Data System (ADS)

    Khare, Y. P.; Martinez, C. J.; Munoz-Carpena, R.

    2014-12-01

    Watershed simulation models can assess the impacts of natural and anthropogenic disturbances on natural systems. These models have become important tools for tackling a range of water resources problems through their implementation in the formulation and evaluation of Best Management Practices, Total Maximum Daily Loads, and Basin Management Action Plans. For accurate applications of watershed models they need to be thoroughly evaluated through global uncertainty and sensitivity analyses (UA/SA). However, due to the high dimensionality of these models such evaluation becomes extremely time- and resource-consuming. Parameter screening, the qualitative separation of important parameters, has been suggested as an essential step before applying rigorous evaluation techniques such as the Sobol' and Fourier Amplitude Sensitivity Test (FAST) methods in the UA/SA framework. The method of elementary effects (EE) (Morris, 1991) is one of the most widely used screening methodologies. Some of the common parameter sampling strategies for EE, e.g. Optimized Trajectories [OT] (Campolongo et al., 2007) and Modified Optimized Trajectories [MOT] (Ruano et al., 2012), suffer from inconsistencies in the generated parameter distributions, infeasible sample generation time, etc. In this work, we have formulated a new parameter sampling strategy - Sampling for Uniformity (SU) - for parameter screening which is based on the principles of the uniformity of the generated parameter distributions and the spread of the parameter sample. A rigorous multi-criteria evaluation (time, distribution, spread and screening efficiency) of OT, MOT, and SU indicated that SU is superior to other sampling strategies. Comparison of the EE-based parameter importance rankings with those of Sobol' helped to quantify the qualitativeness of the EE parameter screening approach, reinforcing the fact that one should use EE only to reduce the resource burden required by FAST/Sobol' analyses but not to replace it.

  10. Understanding 3D human torso shape via manifold clustering

    NASA Astrophysics Data System (ADS)

    Li, Sheng; Li, Peng; Fu, Yun

    2013-05-01

    Discovering the variations in human torso shape plays a key role in many design-oriented applications, such as suit designing. With recent advances in 3D surface imaging technologies, people can obtain 3D human torso data that provide more information than traditional measurements. However, how to find different human shapes from 3D torso data is still an open problem. In this paper, we propose to use spectral clustering approach on torso manifold to address this problem. We first represent high-dimensional torso data in a low-dimensional space using manifold learning algorithm. Then the spectral clustering method is performed to get several disjoint clusters. Experimental results show that the clusters discovered by our approach can describe the discrepancies in both genders and human shapes, and our approach achieves better performance than the compared clustering method.

  11. A general abundance problem for all self-enrichment scenarios for the origin of multiple populations in globular clusters

    NASA Astrophysics Data System (ADS)

    Bastian, Nate; Cabrera-Ziri, Ivan; Salaris, Maurizio

    2015-05-01

    A number of stellar sources have been advocated as the origin of the enriched material required to explain the abundance anomalies seen in ancient globular clusters (GCs). Most studies to date have compared the yields from potential sources [asymptotic giant branch stars (AGBs), fast rotating massive stars (FRMS), high-mass interacting binaries (IBs), and very massive stars (VMS)] with observations of specific elements that are observed to vary from star-to-star in GCs, focusing on extreme GCs such as NGC 2808, which display large He variations. However, a consistency check between the results of fitting extreme cases with the requirements of more typical clusters, has rarely been done. Such a check is particularly timely given the constraints on He abundances in GCs now available. Here, we show that all of the popular enrichment sources fail to reproduce the observed trends in GCs, focusing primarily on Na, O and He. In particular, we show that any model that can fit clusters like NGC 2808, will necessarily fail (by construction) to fit more typical clusters like 47 Tuc or NGC 288. All sources severely overproduce He for most clusters. Additionally, given the large differences in He spreads between clusters, but similar spreads observed in Na-O, only sources with large degrees of stochasticity in the resulting yields will be able to fit the observations. We conclude that no enrichment source put forward so far (AGBs, FRMS, IBs, VMS - or combinations thereof) is consistent with the observations of GCs. Finally, the observed trends of increasing [N/Fe] and He spread with increasing cluster mass cannot be resolved within a self-enrichment framework, without further exacerbating the mass-budget problem.

  12. Hypergraph-based anomaly detection of high-dimensional co-occurrences.

    PubMed

    Silva, Jorge; Willett, Rebecca

    2009-03-01

    This paper addresses the problem of detecting anomalous multivariate co-occurrences using a limited number of unlabeled training observations. A novel method based on using a hypergraph representation of the data is proposed to deal with this very high-dimensional problem. Hypergraphs constitute an important extension of graphs which allow edges to connect more than two vertices simultaneously. A variational Expectation-Maximization algorithm for detecting anomalies directly on the hypergraph domain without any feature selection or dimensionality reduction is presented. The resulting estimate can be used to calculate a measure of anomalousness based on the False Discovery Rate. The algorithm has O(np) computational complexity, where n is the number of training observations and p is the number of potential participants in each co-occurrence event. This efficiency makes the method ideally suited for very high-dimensional settings, and requires no tuning, bandwidth or regularization parameters. The proposed approach is validated on both high-dimensional synthetic data and the Enron email database, where p > 75,000, and it is shown that it can outperform other state-of-the-art methods. PMID:19147882

  13. Querying Patterns in High-Dimensional Heterogenous Datasets

    ERIC Educational Resources Information Center

    Singh, Vishwakarma

    2012-01-01

    The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a…

  14. High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries

    PubMed Central

    Zollanvari, Amin

    2015-01-01

    High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical–statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject. PMID:27081307

  15. High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries.

    PubMed

    Zollanvari, Amin

    2015-01-01

    High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical-statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject. PMID:27081307

  16. Partially supervised speaker clustering.

    PubMed

    Tang, Hao; Chu, Stephen Mingyu; Hasegawa-Johnson, Mark; Huang, Thomas S

    2012-05-01

    Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm—linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical

  17. Regularization Methods for High-Dimensional Instrumental Variables Regression With an Application to Genetical Genomics

    PubMed Central

    Lin, Wei; Feng, Rui; Li, Hongzhe

    2014-01-01

    In genetical genomics studies, it is important to jointly analyze gene expression data and genetic variants in exploring their associations with complex traits, where the dimensionality of gene expressions and genetic variants can both be much larger than the sample size. Motivated by such modern applications, we consider the problem of variable selection and estimation in high-dimensional sparse instrumental variables models. To overcome the difficulty of high dimensionality and unknown optimal instruments, we propose a two-stage regularization framework for identifying and estimating important covariate effects while selecting and estimating optimal instruments. The methodology extends the classical two-stage least squares estimator to high dimensions by exploiting sparsity using sparsity-inducing penalty functions in both stages. The resulting procedure is efficiently implemented by coordinate descent optimization. For the representative L1 regularization and a class of concave regularization methods, we establish estimation, prediction, and model selection properties of the two-stage regularized estimators in the high-dimensional setting where the dimensionality of co-variates and instruments are both allowed to grow exponentially with the sample size. The practical performance of the proposed method is evaluated by simulation studies and its usefulness is illustrated by an analysis of mouse obesity data. Supplementary materials for this article are available online. PMID:26392642

  18. An adaptive high-dimensional stochastic model representation technique for the solution of stochastic partial differential equations

    SciTech Connect

    Ma Xiang; Zabaras, Nicholas

    2010-05-20

    A computational methodology is developed to address the solution of high-dimensional stochastic problems. It utilizes high-dimensional model representation (HDMR) technique in the stochastic space to represent the model output as a finite hierarchical correlated function expansion in terms of the stochastic inputs starting from lower-order to higher-order component functions. HDMR is efficient at capturing the high-dimensional input-output relationship such that the behavior for many physical systems can be modeled to good accuracy only by the first few lower-order terms. An adaptive version of HDMR is also developed to automatically detect the important dimensions and construct higher-order terms using only the important dimensions. The newly developed adaptive sparse grid collocation (ASGC) method is incorporated into HDMR to solve the resulting sub-problems. By integrating HDMR and ASGC, it is computationally possible to construct a low-dimensional stochastic reduced-order model of the high-dimensional stochastic problem and easily perform various statistic analysis on the output. Several numerical examples involving elementary mathematical functions and fluid mechanics problems are considered to illustrate the proposed method. The cases examined show that the method provides accurate results for stochastic dimensionality as high as 500 even with large-input variability. The efficiency of the proposed method is examined by comparing with Monte Carlo (MC) simulation.

  19. High-dimensional statistical inference: From vector to matrix

    NASA Astrophysics Data System (ADS)

    Zhang, Anru

    Statistical inference for sparse signals or low-rank matrices in high-dimensional settings is of significant interest in a range of contemporary applications. It has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. In this thesis, we consider several problems in including sparse signal recovery (compressed sensing under restricted isometry) and low-rank matrix recovery (matrix recovery via rank-one projections and structured matrix completion). The first part of the thesis discusses compressed sensing and affine rank minimization in both noiseless and noisy cases and establishes sharp restricted isometry conditions for sparse signal and low-rank matrix recovery. The analysis relies on a key technical tool which represents points in a polytope by convex combinations of sparse vectors. The technique is elementary while leads to sharp results. It is shown that, in compressed sensing, delta kA < 1/3, deltak A+ thetak,kA < 1, or deltatkA < √( t - 1)/t for any given constant t ≥ 4/3 guarantee the exact recovery of all k sparse signals in the noiseless case through the constrained ℓ1 minimization, and similarly in affine rank minimization delta rM < 1/3, deltar M + thetar, rM < 1, or deltatrM< √( t - 1)/t ensure the exact reconstruction of all matrices with rank at most r in the noiseless case via the constrained nuclear norm minimization. Moreover, for any epsilon > 0, delta kA < 1/3 + epsilon, deltak A + thetak,kA < 1 + epsilon, or deltatkA< √(t - 1) / t + epsilon are not sufficient to guarantee the exact recovery of all k-sparse signals for large k. Similar result also holds for matrix recovery. In addition, the conditions delta kA<1/3, deltak A+ thetak,kA<1, delta tkA < √(t - 1)/t and deltarM<1/3, delta rM+ thetar,rM<1, delta trM< √(t - 1)/ t are also shown to be sufficient respectively for stable recovery of approximately sparse signals and low-rank matrices in the noisy case

  20. Ensemble of sparse classifiers for high-dimensional biological data.

    PubMed

    Kim, Sunghan; Scalzo, Fabien; Telesca, Donatello; Hu, Xiao

    2015-01-01

    Biological data are often high in dimension while the number of samples is small. In such cases, the performance of classification can be improved by reducing the dimension of data, which is referred to as feature selection. Recently, a novel feature selection method has been proposed utilising the sparsity of high-dimensional biological data where a small subset of features accounts for most variance of the dataset. In this study we propose a new classification method for high-dimensional biological data, which performs both feature selection and classification within a single framework. Our proposed method utilises a sparse linear solution technique and the bootstrap aggregating algorithm. We tested its performance on four public mass spectrometry cancer datasets along with two other conventional classification techniques such as Support Vector Machines and Adaptive Boosting. The results demonstrate that our proposed method performs more accurate classification across various cancer datasets than those conventional classification techniques. PMID:26510301

  1. Censored Rank Independence Screening for High-dimensional Survival Data

    PubMed Central

    Song, Rui; Lu, Wenbin; Ma, Shuangge; Jeng, X. Jessie

    2014-01-01

    Summary In modern statistical applications, the dimension of covariates can be much larger than the sample size. In the context of linear models, correlation screening (Fan and Lv, 2008) has been shown to reduce the dimension of such data effectively while achieving the sure screening property, i.e., all of the active variables can be retained with high probability. However, screening based on the Pearson correlation does not perform well when applied to contaminated covariates and/or censored outcomes. In this paper, we study censored rank independence screening of high-dimensional survival data. The proposed method is robust to predictors that contain outliers, works for a general class of survival models, and enjoys the sure screening property. Simulations and an analysis of real data demonstrate that the proposed method performs competitively on survival data sets of moderate size and high-dimensional predictors, even when these are contaminated. PMID:25663709

  2. Why neurons mix: high dimensionality for higher cognition.

    PubMed

    Fusi, Stefano; Miller, Earl K; Rigotti, Mattia

    2016-04-01

    Neurons often respond to diverse combinations of task-relevant variables. This form of mixed selectivity plays an important computational role which is related to the dimensionality of the neural representations: high-dimensional representations with mixed selectivity allow a simple linear readout to generate a huge number of different potential responses. In contrast, neural representations based on highly specialized neurons are low dimensional and they preclude a linear readout from generating several responses that depend on multiple task-relevant variables. Here we review the conceptual and theoretical framework that explains the importance of mixed selectivity and the experimental evidence that recorded neural representations are high-dimensional. We end by discussing the implications for the design of future experiments. PMID:26851755

  3. Some Unsolved Problems, Questions, and Applications of the Brightsen Nucleon Cluster Model

    NASA Astrophysics Data System (ADS)

    Smarandache, Florentin

    2010-10-01

    Brightsen Model is opposite to the Standard Model, and it was build on John Weeler's Resonating Group Structure Model and on Linus Pauling's Close-Packed Spheron Model. Among Brightsen Model's predictions and applications we cite the fact that it derives the average number of prompt neutrons per fission event, it provides a theoretical way for understanding the low temperature / low energy reactions and for approaching the artificially induced fission, it predicts that forces within nucleon clusters are stronger than forces between such clusters within isotopes; it predicts the unmatter entities inside nuclei that result from stable and neutral union of matter and antimatter, and so on. But these predictions have to be tested in the future at the new CERN laboratory.

  4. Quantum Teleportation of High-dimensional Atomic Momenta State

    NASA Astrophysics Data System (ADS)

    Qurban, Misbah; Abbas, Tasawar; Rameez-ul-Islam; Ikram, Manzoor

    2016-06-01

    Atomic momenta states of the neutral atoms are known to be decoherence resistant and therefore present a viable solution for most of the quantum information tasks including the quantum teleportation. We present a systematic protocol for the teleportation of high-dimensional quantized momenta atomic states to the field state inside the cavities by applying standard cavity QED techniques. The proposal can be executed under prevailing experimental scenario.

  5. The primordial and evolutionary abundance variations in globular-cluster stars: a problem with two unknowns

    NASA Astrophysics Data System (ADS)

    Denissenkov, P. A.; VandenBerg, D. A.; Hartwick, F. D. A.; Herwig, F.; Weiss, A.; Paxton, B.

    2015-04-01

    We demonstrate that among the potential sources of the primordial abundance variations of the proton-capture elements in globular-cluster stars proposed so far, such as the hot-bottom burning in massive asymptotic giant branch stars and H burning in the convective cores of supermassive and fast-rotating massive main-sequence (MS) stars, only the supermassive MS stars with M > 104 M⊙ can explain all the observed abundance correlations without any fine-tuning of model parameters. We use our assumed chemical composition for the pristine gas in M13 (NGC 6205) and its mixtures with 50 and 90 per cent of the material partially processed in H burning in the 6 × 104 M⊙ MS model star as the initial compositions for the normal, intermediate, and extreme populations of low-mass stars in this globular cluster, as suggested by its O-Na anticorrelation. We evolve these stars from the zero-age MS to the red giant branch (RGB) tip with the thermohaline and parametric prescriptions for the RGB extra mixing. We find that the 3He-driven thermohaline convection cannot explain the evolutionary decline of [C/Fe] in M13 RGB stars, which, on the other hand, is well reproduced with the universal values for the mixing depth and rate calibrated using the observed decrease of [C/Fe] with MV in the globular cluster NGC5466 that does not have the primordial abundance variations.

  6. TreeSOM: Cluster analysis in the self-organizing map.

    PubMed

    Samsonova, Elena V; Kok, Joost N; Ijzerman, Ad P

    2006-01-01

    Clustering problems arise in various domains of science and engineering. A large number of methods have been developed to date. The Kohonen self-organizing map (SOM) is a popular tool that maps a high-dimensional space onto a small number of dimensions by placing similar elements close together, forming clusters. Cluster analysis is often left to the user. In this paper we present the method TreeSOM and a set of tools to perform unsupervised SOM cluster analysis, determine cluster confidence and visualize the result as a tree facilitating comparison with existing hierarchical classifiers. We also introduce a distance measure for cluster trees that allows one to select a SOM with the most confident clusters. PMID:16781116

  7. High dimensional model representation method for fuzzy structural dynamics

    NASA Astrophysics Data System (ADS)

    Adhikari, S.; Chowdhury, R.; Friswell, M. I.

    2011-03-01

    Uncertainty propagation in multi-parameter complex structures possess significant computational challenges. This paper investigates the possibility of using the High Dimensional Model Representation (HDMR) approach when uncertain system parameters are modeled using fuzzy variables. In particular, the application of HDMR is proposed for fuzzy finite element analysis of linear dynamical systems. The HDMR expansion is an efficient formulation for high-dimensional mapping in complex systems if the higher order variable correlations are weak, thereby permitting the input-output relationship behavior to be captured by the terms of low-order. The computational effort to determine the expansion functions using the α-cut method scales polynomically with the number of variables rather than exponentially. This logic is based on the fundamental assumption underlying the HDMR representation that only low-order correlations among the input variables are likely to have significant impacts upon the outputs for most high-dimensional complex systems. The proposed method is first illustrated for multi-parameter nonlinear mathematical test functions with fuzzy variables. The method is then integrated with a commercial finite element software (ADINA). Modal analysis of a simplified aircraft wing with fuzzy parameters has been used to illustrate the generality of the proposed approach. In the numerical examples, triangular membership functions have been used and the results have been validated against direct Monte Carlo simulations. It is shown that using the proposed HDMR approach, the number of finite element function calls can be reduced without significantly compromising the accuracy.

  8. A latent modeling approach to genotype-phenotype relationships: maternal problem behavior clusters, prenatal smoking, and MAOA genotype.

    PubMed

    McGrath, L M; Mustanski, B; Metzger, A; Pine, D S; Kistner-Griffin, E; Cook, E; Wakschlag, L S

    2012-08-01

    This study illustrates the application of a latent modeling approach to genotype-phenotype relationships and gene × environment interactions, using a novel, multidimensional model of adult female problem behavior, including maternal prenatal smoking. The gene of interest is the monoamine oxidase A (MAOA) gene which has been well studied in relation to antisocial behavior. Participants were adult women (N = 192) who were sampled from a prospective pregnancy cohort of non-Hispanic, white individuals recruited from a neighborhood health clinic. Structural equation modeling was used to model a female problem behavior phenotype, which included conduct problems, substance use, impulsive-sensation seeking, interpersonal aggression, and prenatal smoking. All of the female problem behavior dimensions clustered together strongly, with the exception of prenatal smoking. A main effect of MAOA genotype and a MAOA × physical maltreatment interaction were detected with the Conduct Problems factor. Our phenotypic model showed that prenatal smoking is not simply a marker of other maternal problem behaviors. The risk variant in the MAOA main effect and interaction analyses was the high activity MAOA genotype, which is discrepant from consensus findings in male samples. This result contributes to an emerging literature on sex-specific interaction effects for MAOA. PMID:22610759

  9. Improving clustering by imposing network information

    PubMed Central

    Gerber, Susanne; Horenko, Illia

    2015-01-01

    Cluster analysis is one of the most popular data analysis tools in a wide range of applied disciplines. We propose and justify a computationally efficient and straightforward-to-implement way of imposing the available information from networks/graphs (a priori available in many application areas) on a broad family of clustering methods. The introduced approach is illustrated on the problem of a noninvasive unsupervised brain signal classification. This task is faced with several challenging difficulties such as nonstationary noisy signals and a small sample size, combined with a high-dimensional feature space and huge noise-to-signal ratios. Applying this approach results in an exact unsupervised classification of very short signals, opening new possibilities for clustering methods in the area of a noninvasive brain-computer interface. PMID:26601225

  10. An integral formula adapted to different boundary conditions for arbitrarily high-dimensional nonlinear Klein-Gordon equations with its applications

    NASA Astrophysics Data System (ADS)

    Wu, Xinyuan; Liu, Changying

    2016-02-01

    In this paper, we are concerned with the initial boundary value problem of arbitrarily high-dimensional Klein-Gordon equations, posed on a bounded domain Ω ⊂ ℝd for d ≥ 1 and equipped with the requirement of boundary conditions. We derive and analyze an integral formula which is proved to be adapted to different boundary conditions for general Klein-Gordon equations in arbitrarily high-dimensional spaces. The formula gives a closed-form solution to arbitrarily high-dimensional homogeneous linear Klein-Gordon equations, which is totally different from the well-known D'Alembert, Poisson, and Kirchhoff formulas. Some applications are included as well.

  11. Clustering as a tool of reinforced rejecting in pattern recognition problem

    NASA Astrophysics Data System (ADS)

    Ciecierski, Jakub; Dybisz, Bartlomiej; Homenda, Wladyslaw; Jastrzebska, Agnieszka

    2016-06-01

    In this paper pattern recognition problem with rejecting option is discussed. The problem is aimed at classification patterns from given classes (native patterns) and rejecting ones not belonging to these classes (foreign patterns). In practice the characteristics of the native patters are given, while no information about foreign ones is known. A rejecting tool is aimed at enclosing native patterns in compact geometrical figures and excluding foreign ones from them.

  12. Detecting unstable periodic orbits in high-dimensional chaotic systems from time series: reconstruction meeting with adaptation.

    PubMed

    Ma, Huanfei; Lin, Wei; Lai, Ying-Cheng

    2013-05-01

    Detecting unstable periodic orbits (UPOs) in chaotic systems based solely on time series is a fundamental but extremely challenging problem in nonlinear dynamics. Previous approaches were applicable but mostly for low-dimensional chaotic systems. We develop a framework, integrating approximation theory of neural networks and adaptive synchronization, to address the problem of time-series-based detection of UPOs in high-dimensional chaotic systems. An example of finding UPOs from the classic Mackey-Glass equation is presented. PMID:23767476

  13. Distance phenomena in high-dimensional chemical descriptor spaces: consequences for similarity-based approaches.

    PubMed

    Rupp, Matthias; Schneider, Petra; Schneider, Gisbert

    2009-11-15

    Measuring the (dis)similarity of molecules is important for many cheminformatics applications like compound ranking, clustering, and property prediction. In this work, we focus on real-valued vector representations of molecules (as opposed to the binary spaces of fingerprints). We demonstrate the influence which the choice of (dis)similarity measure can have on results, and provide recommendations for such choices. We review the mathematical concepts used to measure (dis)similarity in vector spaces, namely norms, metrics, inner products, and, similarity coefficients, as well as the relationships between them, employing (dis)similarity measures commonly used in cheminformatics as examples. We present several phenomena (empty space phenomenon, sphere volume related phenomena, distance concentration) in high-dimensional descriptor spaces which are not encountered in two and three dimensions. These phenomena are theoretically characterized and illustrated on both artificial and real (bioactivity) data. PMID:19266481

  14. Exploring High-Dimensional Data Space: Identifying Optimal Process Conditions in Photovoltaics: Preprint

    SciTech Connect

    Suh, C.; Glynn, S.; Scharf, J.; Contreras, M. A.; Noufi, R.; Jones, W. B.; Biagioni, D.

    2011-07-01

    We demonstrate how advanced exploratory data analysis coupled to data-mining techniques can be used to scrutinize the high-dimensional data space of photovoltaics in the context of thin films of Al-doped ZnO (AZO), which are essential materials as a transparent conducting oxide (TCO) layer in CuInxGa1-xSe2 (CIGS) solar cells. AZO data space, wherein each sample is synthesized from a different process history and assessed with various characterizations, is transformed, reorganized, and visualized in order to extract optimal process conditions. The data-analysis methods used include parallel coordinates, diffusion maps, and hierarchical agglomerative clustering algorithms combined with diffusion map embedding.

  15. Exploring High-Dimensional Data Space: Identifying Optimal Process Conditions in Photovoltaics

    SciTech Connect

    Suh, C.; Biagioni, D.; Glynn, S.; Scharf, J.; Contreras, M. A.; Noufi, R.; Jones, W. B.

    2011-01-01

    We demonstrate how advanced exploratory data analysis coupled to data-mining techniques can be used to scrutinize the high-dimensional data space of photovoltaics in the context of thin films of Al-doped ZnO (AZO), which are essential materials as a transparent conducting oxide (TCO) layer in CuIn{sub x}Ga{sub 1-x}Se{sub 2} (CIGS) solar cells. AZO data space, wherein each sample is synthesized from a different process history and assessed with various characterizations, is transformed, reorganized, and visualized in order to extract optimal process conditions. The data-analysis methods used include parallel coordinates, diffusion maps, and hierarchical agglomerative clustering algorithms combined with diffusion map embedding.

  16. Random Projection for Fast and Efficient Multivariate Correlation Analysis of High-Dimensional Data: A New Approach

    PubMed Central

    Grellmann, Claudia; Neumann, Jane; Bitzer, Sebastian; Kovacs, Peter; Tönjes, Anke; Westlye, Lars T.; Andreassen, Ole A.; Stumvoll, Michael; Villringer, Arno; Horstmann, Annette

    2016-01-01

    In recent years, the advent of great technological advances has produced a wealth of very high-dimensional data, and combining high-dimensional information from multiple sources is becoming increasingly important in an extending range of scientific disciplines. Partial Least Squares Correlation (PLSC) is a frequently used method for multivariate multimodal data integration. It is, however, computationally expensive in applications involving large numbers of variables, as required, for example, in genetic neuroimaging. To handle high-dimensional problems, dimension reduction might be implemented as pre-processing step. We propose a new approach that incorporates Random Projection (RP) for dimensionality reduction into PLSC to efficiently solve high-dimensional multimodal problems like genotype-phenotype associations. We name our new method PLSC-RP. Using simulated and experimental data sets containing whole genome SNP measures as genotypes and whole brain neuroimaging measures as phenotypes, we demonstrate that PLSC-RP is drastically faster than traditional PLSC while providing statistically equivalent results. We also provide evidence that dimensionality reduction using RP is data type independent. Therefore, PLSC-RP opens up a wide range of possible applications. It can be used for any integrative analysis that combines information from multiple sources. PMID:27375677

  17. TimeSeer: Scagnostics for high-dimensional time series.

    PubMed

    Dang, Tuan Nhon; Anand, Anushka; Wilkinson, Leland

    2013-03-01

    We introduce a method (Scagnostic time series) and an application (TimeSeer) for organizing multivariate time series and for guiding interactive exploration through high-dimensional data. The method is based on nine characterizations of the 2D distributions of orthogonal pairwise projections on a set of points in multidimensional euclidean space. These characterizations include measures, such as, density, skewness, shape, outliers, and texture. Working directly with these Scagnostic measures, we can locate anomalous or interesting subseries for further analysis. Our application is designed to handle the types of doubly multivariate data series that are often found in security, financial, social, and other sectors. PMID:23307611

  18. Hawking radiation of a high-dimensional rotating black hole

    NASA Astrophysics Data System (ADS)

    Ren, Zhao; Lichun, Zhang; Huaifan, Li; Yueqin, Wu

    2010-01-01

    We extend the classical Damour-Ruffini method and discuss Hawking radiation spectrum of high-dimensional rotating black hole using Tortoise coordinate transformation defined by taking the reaction of the radiation to the spacetime into consideration. Under the condition that the energy and angular momentum are conservative, taking self-gravitation action into account, we derive Hawking radiation spectrums which satisfy unitary principle in quantum mechanics. It is shown that the process that the black hole radiates particles with energy ω is a continuous tunneling process. We provide a theoretical basis for further studying the physical mechanism of black-hole radiation.

  19. An Adaptive ANOVA-based PCKF for High-Dimensional Nonlinear Inverse Modeling

    SciTech Connect

    LI, Weixuan; Lin, Guang; Zhang, Dongxiao

    2014-02-01

    The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos bases in the expansion helps to capture uncertainty more accurately but increases computational cost. Bases selection is particularly important for high-dimensional stochastic problems because the number of polynomial chaos bases required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE bases are pre-set based on users’ experience. Also, for sequential data assimilation problems, the bases kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE bases for different problems and automatically adjusts the number of bases in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm is tested with different examples and demonstrated great effectiveness in comparison with non-adaptive PCKF and En

  20. An adaptive ANOVA-based PCKF for high-dimensional nonlinear inverse modeling

    SciTech Connect

    Li, Weixuan; Lin, Guang; Zhang, Dongxiao

    2014-02-01

    The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos basis functions in the expansion helps to capture uncertainty more accurately but increases computational cost. Selection of basis functions is particularly important for high-dimensional stochastic problems because the number of polynomial chaos basis functions required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE basis functions are pre-set based on users' experience. Also, for sequential data assimilation problems, the basis functions kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE basis functions for different problems and automatically adjusts the number of basis functions in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm was tested with different examples and demonstrated

  1. Node Detection Using High-Dimensional Fuzzy Parcellation Applied to the Insular Cortex.

    PubMed

    Vercelli, Ugo; Diano, Matteo; Costa, Tommaso; Nani, Andrea; Duca, Sergio; Geminiani, Giuliano; Vercelli, Alessandro; Cauda, Franco

    2016-01-01

    Several functional connectivity approaches require the definition of a set of regions of interest (ROIs) that act as network nodes. Different methods have been developed to define these nodes and to derive their functional and effective connections, most of which are rather complex. Here we aim to propose a relatively simple "one-step" border detection and ROI estimation procedure employing the fuzzy c-mean clustering algorithm. To test this procedure and to explore insular connectivity beyond the two/three-region model currently proposed in the literature, we parcellated the insular cortex of 20 healthy right-handed volunteers scanned in a resting state. By employing a high-dimensional functional connectivity-based clustering process, we confirmed the two patterns of connectivity previously described. This method revealed a complex pattern of functional connectivity where the two previously detected insular clusters are subdivided into several other networks, some of which are not commonly associated with the insular cortex, such as the default mode network and parts of the dorsal attentional network. Furthermore, the detection of nodes was reliable, as demonstrated by the confirmative analysis performed on a replication group of subjects. PMID:26881093

  2. Node Detection Using High-Dimensional Fuzzy Parcellation Applied to the Insular Cortex

    PubMed Central

    Vercelli, Ugo; Diano, Matteo; Costa, Tommaso; Nani, Andrea; Duca, Sergio; Geminiani, Giuliano; Vercelli, Alessandro; Cauda, Franco

    2016-01-01

    Several functional connectivity approaches require the definition of a set of regions of interest (ROIs) that act as network nodes. Different methods have been developed to define these nodes and to derive their functional and effective connections, most of which are rather complex. Here we aim to propose a relatively simple “one-step” border detection and ROI estimation procedure employing the fuzzy c-mean clustering algorithm. To test this procedure and to explore insular connectivity beyond the two/three-region model currently proposed in the literature, we parcellated the insular cortex of 20 healthy right-handed volunteers scanned in a resting state. By employing a high-dimensional functional connectivity-based clustering process, we confirmed the two patterns of connectivity previously described. This method revealed a complex pattern of functional connectivity where the two previously detected insular clusters are subdivided into several other networks, some of which are not commonly associated with the insular cortex, such as the default mode network and parts of the dorsal attentional network. Furthermore, the detection of nodes was reliable, as demonstrated by the confirmative analysis performed on a replication group of subjects. PMID:26881093

  3. Clustering Analysis for Graphs with Multiweighted Edges: A Unified Approach to the Threshold Problem.

    ERIC Educational Resources Information Center

    Goetschel, Roy, Jr.

    1987-01-01

    Multivalent relations, inferred as relationships with an added dimension of discernment, are realized as weighted graphs with multivalued edges. A unified treatment of the threshold problem is discussed and a reliability measure is produced to judge various partitions. (Author/EM)

  4. High dimensional biological data retrieval optimization with NoSQL technology

    PubMed Central

    2014-01-01

    Background High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data

  5. Reconstructing high-dimensional two-photon entangled states via compressive sensing

    PubMed Central

    Tonolini, Francesco; Chan, Susan; Agnew, Megan; Lindsay, Alan; Leach, Jonathan

    2014-01-01

    Accurately establishing the state of large-scale quantum systems is an important tool in quantum information science; however, the large number of unknown parameters hinders the rapid characterisation of such states, and reconstruction procedures can become prohibitively time-consuming. Compressive sensing, a procedure for solving inverse problems by incorporating prior knowledge about the form of the solution, provides an attractive alternative to the problem of high-dimensional quantum state characterisation. Using a modified version of compressive sensing that incorporates the principles of singular value thresholding, we reconstruct the density matrix of a high-dimensional two-photon entangled system. The dimension of each photon is equal to d = 17, corresponding to a system of 83521 unknown real parameters. Accurate reconstruction is achieved with approximately 2500 measurements, only 3% of the total number of unknown parameters in the state. The algorithm we develop is fast, computationally inexpensive, and applicable to a wide range of quantum states, thus demonstrating compressive sensing as an effective technique for measuring the state of large-scale quantum systems. PMID:25306850

  6. Similarity-dissimilarity plot for visualization of high dimensional data in biomedical pattern classification.

    PubMed

    Arif, Muhammad

    2012-06-01

    In pattern classification problems, feature extraction is an important step. Quality of features in discriminating different classes plays an important role in pattern classification problems. In real life, pattern classification may require high dimensional feature space and it is impossible to visualize the feature space if the dimension of feature space is greater than four. In this paper, we have proposed a Similarity-Dissimilarity plot which can project high dimensional space to a two dimensional space while retaining important characteristics required to assess the discrimination quality of the features. Similarity-dissimilarity plot can reveal information about the amount of overlap of features of different classes. Separable data points of different classes will also be visible on the plot which can be classified correctly using appropriate classifier. Hence, approximate classification accuracy can be predicted. Moreover, it is possible to know about whom class the misclassified data points will be confused by the classifier. Outlier data points can also be located on the similarity-dissimilarity plot. Various examples of synthetic data are used to highlight important characteristics of the proposed plot. Some real life examples from biomedical data are also used for the analysis. The proposed plot is independent of number of dimensions of the feature space. PMID:20734222

  7. New data assimilation system DNDAS for high-dimensional models

    NASA Astrophysics Data System (ADS)

    Qun-bo, Huang; Xiao-qun, Cao; Meng-bin, Zhu; Wei-min, Zhang; Bai-nian, Liu

    2016-05-01

    The tangent linear (TL) models and adjoint (AD) models have brought great difficulties for the development of variational data assimilation system. It might be impossible to develop them perfectly without great efforts, either by hand, or by automatic differentiation tools. In order to break these limitations, a new data assimilation system, dual-number data assimilation system (DNDAS), is designed based on the dual-number automatic differentiation principles. We investigate the performance of DNDAS with two different optimization schemes and subsequently give a discussion on whether DNDAS is appropriate for high-dimensional forecast models. The new data assimilation system can avoid the complicated reverse integration of the adjoint model, and it only needs the forward integration in the dual-number space to obtain the cost function and its gradient vector concurrently. To verify the correctness and effectiveness of DNDAS, we implemented DNDAS on a simple ordinary differential model and the Lorenz-63 model with different optimization methods. We then concentrate on the adaptability of DNDAS to the Lorenz-96 model with high-dimensional state variables. The results indicate that whether the system is simple or nonlinear, DNDAS can accurately reconstruct the initial condition for the forecast model and has a strong anti-noise characteristic. Given adequate computing resource, the quasi-Newton optimization method performs better than the conjugate gradient method in DNDAS. Project supported by the National Natural Science Foundation of China (Grant Nos. 41475094 and 41375113).

  8. Likelihood-Free Inference in High-Dimensional Models.

    PubMed

    Kousathanas, Athanasios; Leuenberger, Christoph; Helfer, Jonas; Quinodoz, Mathieu; Foll, Matthieu; Wegmann, Daniel

    2016-06-01

    Methods that bypass analytical evaluations of the likelihood function have become an indispensable tool for statistical inference in many fields of science. These so-called likelihood-free methods rely on accepting and rejecting simulations based on summary statistics, which limits them to low-dimensional models for which the value of the likelihood is large enough to result in manageable acceptance rates. To get around these issues, we introduce a novel, likelihood-free Markov chain Monte Carlo (MCMC) method combining two key innovations: updating only one parameter per iteration and accepting or rejecting this update based on subsets of statistics approximately sufficient for this parameter. This increases acceptance rates dramatically, rendering this approach suitable even for models of very high dimensionality. We further derive that for linear models, a one-dimensional combination of statistics per parameter is sufficient and can be found empirically with simulations. Finally, we demonstrate that our method readily scales to models of very high dimensionality, using toy models as well as by jointly inferring the effective population size, the distribution of fitness effects (DFE) of segregating mutations, and selection coefficients for each locus from data of a recent experiment on the evolution of drug resistance in influenza. PMID:27052569

  9. Power Enhancement in High Dimensional Cross-Sectional Tests

    PubMed Central

    Fan, Jianqing; Liao, Yuan; Yao, Jiawei

    2016-01-01

    We propose a novel technique to boost the power of testing a high-dimensional vector H : θ = 0 against sparse alternatives where the null hypothesis is violated only by a couple of components. Existing tests based on quadratic forms such as the Wald statistic often suffer from low powers due to the accumulation of errors in estimating high-dimensional parameters. More powerful tests for sparse alternatives such as thresholding and extreme-value tests, on the other hand, require either stringent conditions or bootstrap to derive the null distribution and often suffer from size distortions due to the slow convergence. Based on a screening technique, we introduce a “power enhancement component”, which is zero under the null hypothesis with high probability, but diverges quickly under sparse alternatives. The proposed test statistic combines the power enhancement component with an asymptotically pivotal statistic, and strengthens the power under sparse alternatives. The null distribution does not require stringent regularity conditions, and is completely determined by that of the pivotal statistic. As specific applications, the proposed methods are applied to testing the factor pricing models and validating the cross-sectional independence in panel data models. PMID:26778846

  10. Asymptotic Stability of High-dimensional Zakharov-Kuznetsov Solitons

    NASA Astrophysics Data System (ADS)

    Côte, Raphaël; Muñoz, Claudio; Pilod, Didier; Simpson, Gideon

    2016-05-01

    We prove that solitons (or solitary waves) of the Zakharov-Kuznetsov (ZK) equation, a physically relevant high dimensional generalization of the Korteweg-de Vries (KdV) equation appearing in Plasma Physics, and having mixed KdV and nonlinear Schrödinger (NLS) dynamics, are strongly asymptotically stable in the energy space. We also prove that the sum of well-arranged solitons is stable in the same space. Orbital stability of ZK solitons is well-known since the work of de Bouard [Proc R Soc Edinburgh 126:89-112, 1996]. Our proofs follow the ideas of Martel [SIAM J Math Anal 157:759-781, 2006] and Martel and Merle [Math Ann 341:391-427, 2008], applied for generalized KdV equations in one dimension. In particular, we extend to the high dimensional case several monotonicity properties for suitable half-portions of mass and energy; we also prove a new Liouville type property that characterizes ZK solitons, and a key Virial identity for the linear and nonlinear part of the ZK dynamics, obtained independently of the mixed KdV-NLS dynamics. This last Virial identity relies on a simple sign condition which is numerically tested for the two and three dimensional cases with no additional spectral assumptions required. Possible extensions to higher dimensions and different nonlinearities could be obtained after a suitable local well-posedness theory in the energy space, and the verification of a corresponding sign condition.

  11. Sample size requirements for training high-dimensional risk predictors

    PubMed Central

    Dobbin, Kevin K.; Song, Xiao

    2013-01-01

    A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the high-dimensional data and cannot handle a right-censored dependent variable. We present a new training sample size method that is non-parametric with respect to the high-dimensional vectors, and is developed for a right-censored response. The method can be applied to any prediction algorithm that satisfies a set of conditions. The sample size is chosen so that the expected performance of the predictor is within a user-defined tolerance of optimal. The central method is based on a pilot dataset. To quantify uncertainty, a method to construct a confidence interval for the tolerance is developed. Adequacy of the size of the pilot dataset is discussed. An alternative model-based version of our method for estimating the tolerance when no adequate pilot dataset is available is presented. The model-based method requires a covariance matrix be specified, but we show that the identity covariance matrix provides adequate sample size when the user specifies three key quantities. Application of the sample size method to two microarray datasets is discussed. PMID:23873895

  12. TripAdvisor^{N-D}: A Tourism-Inspired High-Dimensional Space Exploration Framework with Overview and Detail.

    PubMed

    Nam, Julia EunJu; Mueller, Klaus

    2013-02-01

    Gaining a true appreciation of high-dimensional space remains difficult since all of the existing high-dimensional space exploration techniques serialize the space travel in some way. This is not so foreign to us since we, when traveling, also experience the world in a serial fashion. But we typically have access to a map to help with positioning, orientation, navigation, and trip planning. Here, we propose a multivariate data exploration tool that compares high-dimensional space navigation with a sightseeing trip. It decomposes this activity into five major tasks: 1) Identify the sights: use a map to identify the sights of interest and their location; 2) Plan the trip: connect the sights of interest along a specifyable path; 3) Go on the trip: travel along the route; 4) Hop off the bus: experience the location, look around, zoom into detail; and 5) Orient and localize: regain bearings in the map. We describe intuitive and interactive tools for all of these tasks, both global navigation within the map and local exploration of the data distributions. For the latter, we describe a polygonal touchpad interface which enables users to smoothly tilt the projection plane in high-dimensional space to produce multivariate scatterplots that best convey the data relationships under investigation. Motion parallax and illustrative motion trails aid in the perception of these transient patterns. We describe the use of our system within two applications: 1) the exploratory discovery of data configurations that best fit a personal preference in the presence of tradeoffs and 2) interactive cluster analysis via cluster sculpting in N-D. PMID:22350201

  13. GX-Means: A model-based divide and merge algorithm for geospatial image clustering

    SciTech Connect

    Vatsavai, Raju; Symons, Christopher T; Chandola, Varun; Jun, Goo

    2011-01-01

    One of the practical issues in clustering is the specification of the appropriate number of clusters, which is not obvious when analyzing geospatial datasets, partly because they are huge (both in size and spatial extent) and high dimensional. In this paper we present a computationally efficient model-based split and merge clustering algorithm that incrementally finds model parameters and the number of clusters. Additionally, we attempt to provide insights into this problem and other data mining challenges that are encountered when clustering geospatial data. The basic algorithm we present is similar to the G-means and X-means algorithms; however, our proposed approach avoids certain limitations of these well-known clustering algorithms that are pertinent when dealing with geospatial data. We compare the performance of our approach with the G-means and X-means algorithms. Experimental evaluation on simulated data and on multispectral and hyperspectral remotely sensed image data demonstrates the effectiveness of our algorithm.

  14. The Problem of Hipparcos Distances to Open Clusters. Report 1; Constraints from Multicolor a Main-Sequence Fitting

    NASA Technical Reports Server (NTRS)

    Pinsonneault, Marc H.; Stauffer, John; Soderblom, David R.; King, Jeremy R.; Hanson, Robert B.

    1998-01-01

    Parallax data from the Hipparcos mission allow the direct distance to open clusters to be compared with the distance inferred from main-sequence (MS) fitting. There are surprising differences between the two distance measurements. indicating either the need for changes in the cluster compositions or reddening, underlying problems with the technique of MS fitting, or systematic errors in the Hipparcos parallaxes at the 1 mas level. We examine the different possibilities, focusing on MS fitting in both metallicity-sensitive B-V and metallicity-insensitive V-I for five well-studied systems (the Hyades, Pleiades, alpha Per, Praesepe, and Coma Ber). The Hipparcos distances to the Hyades and alpha Per are within 1 sigma of the MS-fitting distance in B-V and V-I, while the Hipparcos distances to Coma Ber and the Pleiades are in disagreement with the MS-fitting distance at more than the 3 sigma level. There are two Hipparcos measurements of the distance to Praesepe; one is in good agreement with the MS-fitting distance and the other disagrees at the 2 sigma level. The distance estimates from the different colors are in conflict with one another for Coma but in agreement for the Pleiades. Changes in the relative cluster metal abundances, age related effects, helium, and reddening are shown to be unlikely to explain the puzzling behavior of the Pleiades. We present evidence for spatially dependent systematic errors at the 1 mas level in the parallaxes of Pleiades stars. The implications of this result are discussed.

  15. Sparse subspace clustering: algorithm, theory, and applications.

    PubMed

    Elhamifar, Ehsan; Vidal, René

    2013-11-01

    Many real-world problems deal with collections of high-dimensional data, such as images, videos, text, and web documents, DNA microarray data, and more. Often, such high-dimensional data lie close to low-dimensional structures corresponding to several classes or categories to which the data belong. In this paper, we propose and study an algorithm, called sparse subspace clustering, to cluster data points that lie in a union of low-dimensional subspaces. The key idea is that, among the infinitely many possible representations of a data point in terms of other points, a sparse representation corresponds to selecting a few points from the same subspace. This motivates solving a sparse optimization program whose solution is used in a spectral clustering framework to infer the clustering of the data into subspaces. Since solving the sparse optimization program is in general NP-hard, we consider a convex relaxation and show that, under appropriate conditions on the arrangement of the subspaces and the distribution of the data, the proposed minimization program succeeds in recovering the desired sparse representations. The proposed algorithm is efficient and can handle data points near the intersections of subspaces. Another key advantage of the proposed algorithm with respect to the state of the art is that it can deal directly with data nuisances, such as noise, sparse outlying entries, and missing entries, by incorporating the model of the data into the sparse optimization program. We demonstrate the effectiveness of the proposed algorithm through experiments on synthetic data as well as the two real-world problems of motion segmentation and face clustering. PMID:24051734

  16. Preventing mental health problems in children: the Families in Mind population-based cluster randomised controlled trial

    PubMed Central

    2012-01-01

    Background Externalising and internalising problems affect one in seven school-aged children and are the single strongest predictor of mental health problems into early adolescence. As the burden of mental health problems persists globally, childhood prevention of mental health problems is paramount. Prevention can be offered to all children (universal) or to children at risk of developing mental health problems (targeted). The relative effectiveness and costs of a targeted only versus combined universal and targeted approach are unknown. This study aims to determine the effectiveness, costs and uptake of two approaches to early childhood prevention of mental health problems ie: a Combined universal-targeted approach, versus a Targeted only approach, in comparison to current primary care services (Usual care). Methods/design Three armed, population-level cluster randomised trial (2010–2014) within the universal, well child Maternal Child Health system, attended by more than 80% of families in Victoria, Australia at infant age eight months. Participants were families of eight month old children from nine participating local government areas. Randomised to one of three groups: Combined, Targeted or Usual care. The interventions comprises (a) the Combined universal and targeted program where all families are offered the universal Toddlers Without Tears group parenting program followed by the targeted Family Check-Up one-on-one program or (b) the Targeted Family Check-Up program. The Family Check-Up program is only offered to children at risk of behavioural problems. Participants will be analysed according to the trial arm to which they were randomised, using logistic and linear regression models to compare primary and secondary outcomes. An economic evaluation (cost consequences analysis) will compare incremental costs to all incremental outcomes from a societal perspective. Discussion This trial will inform public health policy by making recommendations about the

  17. A Spiking Neural Network Model of Model-Free Reinforcement Learning with High-Dimensional Sensory Input and Perceptual Ambiguity

    PubMed Central

    Nakano, Takashi; Otsuka, Makoto; Yoshimoto, Junichiro; Doya, Kenji

    2015-01-01

    A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the past, even though these are inevitable and constraining features of learning in real environments. This class of problem is formally known as partially observable reinforcement learning (PORL) problems. It provides a generalization of reinforcement learning to partially observable domains. In addition, observations in the real world tend to be rich and high-dimensional. In this work, we use a spiking neural network model to approximate the free energy of a restricted Boltzmann machine and apply it to the solution of PORL problems with high-dimensional observations. Our spiking network model solves maze tasks with perceptually ambiguous high-dimensional observations without knowledge of the true environment. An extended model with working memory also solves history-dependent tasks. The way spiking neural networks handle PORL problems may provide a glimpse into the underlying laws of neural information processing which can only be discovered through such a top-down approach. PMID:25734662

  18. Reasoning Exercises in Assisted Living: a cluster randomized trial to improve reasoning and everyday problem solving

    PubMed Central

    Williams, Kristine; Herman, Ruth; Bontempo, Daniel

    2014-01-01

    Purpose of the study Assisted living (AL) residents are at risk for cognitive and functional declines that eventually reduce their ability to care for themselves, thereby triggering nursing home placement. In developing a method to slow this decline, the efficacy of Reasoning Exercises in Assisted Living (REAL), a cognitive training intervention that teaches everyday reasoning and problem-solving skills to AL residents, was tested. Design and methods At thirteen randomized Midwestern facilities, AL residents whose Mini Mental State Examination scores ranged from 19–29 either were trained in REAL or a vitamin education attention control program or received no treatment at all. For 3 weeks, treated groups received personal training in their respective programs. Results Scores on the Every Day Problems Test for Cognitively Challenged Elders (EPCCE) and on the Direct Assessment of Functional Status (DAFS) showed significant increases only for the REAL group. For EPCCE, change from baseline immediately postintervention was +3.10 (P<0.01), and there was significant retention at the 3-month follow-up (d=2.71; P<0.01). For DAFS, change from baseline immediately postintervention was +3.52 (P<0.001), although retention was not as strong. Neither the attention nor the no-treatment control groups had significant gains immediately postintervention or at follow-up assessments. Post hoc across-group comparison of baseline change also highlights the benefits of REAL training. For EPCCE, the magnitude of gain was significantly larger in the REAL group versus the no-treatment control group immediately postintervention (d=3.82; P<0.01) and at the 3-month follow-up (d=3.80; P<0.01). For DAFS, gain magnitude immediately postintervention for REAL was significantly greater compared with in the attention control group (d=4.73; P<0.01). Implications REAL improves skills in everyday problem solving, which may allow AL residents to maintain self-care and extend AL residency. This benefit

  19. Statistical validation of high-dimensional models of growing networks

    NASA Astrophysics Data System (ADS)

    Medo, Matúš

    2014-03-01

    The abundance of models of complex networks and the current insufficient validation standards make it difficult to judge which models are strongly supported by data and which are not. We focus here on likelihood maximization methods for models of growing networks with many parameters and compare their performance on artificial and real datasets. While high dimensionality of the parameter space harms the performance of direct likelihood maximization on artificial data, this can be improved by introducing a suitable penalization term. Likelihood maximization on real data shows that the presented approach is able to discriminate among available network models. To make large-scale datasets accessible to this kind of analysis, we propose a subset sampling technique and show that it yields substantial model evidence in a fraction of time necessary for the analysis of the complete data.

  20. High-dimensional quantum key distribution using dispersive optics

    NASA Astrophysics Data System (ADS)

    Mower, Jacob; Zhang, Zheshen; Desjardins, Pierre; Lee, Catherine; Shapiro, Jeffrey H.; Englund, Dirk

    2013-06-01

    We propose a high-dimensional quantum key distribution (QKD) protocol that employs temporal correlations of entangled photons. The security of the protocol relies on measurements by Alice and Bob in one of two conjugate bases, implemented using dispersive optics. We show that this dispersion-based approach is secure against collective attacks. The protocol, which represents a QKD analog of pulse position modulation, is compatible with standard fiber telecommunications channels and wavelength division multiplexers. We describe several physical implementations to enhance the transmission rate and describe a heralded qudit source that is easy to implement and enables secret-key generation at >4 bits per character of distilled key across over 200 km of fiber.

  1. Algorithmic tools for mining high-dimensional cytometry data

    PubMed Central

    Chester, Cariad; Maecker, Holden T.

    2015-01-01

    The advent of mass cytometry has lead to an unprecedented increase in the number of analytes measured in individual cells, thereby increasing the complexity and information content of cytometric data. While this technology is ideally suited to detailed examination of the immune system, the applicability of the different methods for analyzing such complex data are less clear. Conventional data analysis by ‘manual’ gating of cells in biaxial dotplots is often subjective, time consuming, and neglectful of much of the information contained in a highly dimensional cytometric dataset. Algorithmic data mining has the promise to eliminate these concerns and several such tools have been recently applied to mass cytometry data. Herein, we review computational data mining tools that have been used to analyze mass cytometry data, outline their differences, and comment on their strengths and limitations. This review will help immunologists identify suitable algorithmic tools for their particular projects. PMID:26188071

  2. High-dimensional quantum nature of ghost angular Young's diffraction

    SciTech Connect

    Chen Lixiang; Leach, Jonathan; Jack, Barry; Padgett, Miles J.; Franke-Arnold, Sonja; She Weilong

    2010-09-15

    We propose a technique to characterize the dimensionality of entangled sources affected by any environment, including phase and amplitude masks or atmospheric turbulence. We illustrate this technique on the example of angular ghost diffraction using the orbital angular momentum (OAM) spectrum generated by a nonlocal double slit. We realize a nonlocal angular double slit by placing single angular slits in the paths of the signal and idler modes of the entangled light field generated by parametric down-conversion. Based on the observed OAM spectrum and the measured Shannon dimensionality spectrum of the possible quantum channels that contribute to Young's ghost diffraction, we calculate the associated dimensionality D{sub total}. The measured D{sub total} ranges between 1 and 2.74 depending on the opening angle of the angular slits. The ability to quantify the nature of high-dimensional entanglement is vital when considering quantum information protocols.

  3. Future of High-Dimensional Data-Driven Exoplanet Science

    NASA Astrophysics Data System (ADS)

    Ford, Eric B.

    2016-03-01

    The detection and characterization of exoplanets has come a long way since the 1990’s. For example, instruments specifically designed for Doppler planet surveys feature environmental controls to minimize instrumental effects and advanced calibration systems. Combining these instruments with powerful telescopes, astronomers have detected thousands of exoplanets. The application of Bayesian algorithms has improved the quality and reliability with which astronomers characterize the mass and orbits of exoplanets. Thanks to continued improvements in instrumentation, now the detection of extrasolar low-mass planets is limited primarily by stellar activity, rather than observational uncertainties. This presents a new set of challenges which will require cross-disciplinary research to combine improved statistical algorithms with an astrophysical understanding of stellar activity and the details of astronomical instrumentation. I describe these challenges and outline the roles of parameter estimation over high-dimensional parameter spaces, marginalizing over uncertainties in stellar astrophysics and machine learning for the next generation of Doppler planet searches.

  4. Parsimonious description for predicting high-dimensional dynamics

    PubMed Central

    Hirata, Yoshito; Takeuchi, Tomoya; Horai, Shunsuke; Suzuki, Hideyuki; Aihara, Kazuyuki

    2015-01-01

    When we observe a system, we often cannot observe all its variables and may have some of its limited measurements. Under such a circumstance, delay coordinates, vectors made of successive measurements, are useful to reconstruct the states of the whole system. Although the method of delay coordinates is theoretically supported for high-dimensional dynamical systems, practically there is a limitation because the calculation for higher-dimensional delay coordinates becomes more expensive. Here, we propose a parsimonious description of virtually infinite-dimensional delay coordinates by evaluating their distances with exponentially decaying weights. This description enables us to predict the future values of the measurements faster because we can reuse the calculated distances, and more accurately because the description naturally reduces the bias of the classical delay coordinates toward the stable directions. We demonstrate the proposed method with toy models of the atmosphere and real datasets related to renewable energy. PMID:26510518

  5. Building high dimensional imaging database for content based image search

    NASA Astrophysics Data System (ADS)

    Sun, Qinpei; Sun, Jianyong; Ling, Tonghui; Wang, Mingqing; Yang, Yuanyuan; Zhang, Jianguo

    2016-03-01

    In medical imaging informatics, content-based image retrieval (CBIR) techniques are employed to aid radiologists in the retrieval of images with similar image contents. CBIR uses visual contents, normally called as image features, to search images from large scale image databases according to users' requests in the form of a query image. However, most of current CBIR systems require a distance computation of image character feature vectors to perform query, and the distance computations can be time consuming when the number of image character features grows large, and thus this limits the usability of the systems. In this presentation, we propose a novel framework which uses a high dimensional database to index the image character features to improve the accuracy and retrieval speed of a CBIR in integrated RIS/PACS.

  6. High dimensional reflectance analysis of soil organic matter

    NASA Technical Reports Server (NTRS)

    Henderson, T. L.; Baumgardner, M. F.; Franzmeier, D. P.; Stott, D. E.; Coster, D. C.

    1992-01-01

    Recent breakthroughs in remote-sensing technology have led to the development of high spectral resolution imaging sensors for observation of earth surface features. This research was conducted to evaluate the effects of organic matter content and composition on narrowband soil reflectance across the visible and reflective infrared spectral ranges. Organic matter from four Indiana agricultural soils, ranging in organic C content from 0.99 to 1.72 percent, was extracted, fractionated, and purified. Six components of each soil were isolated and prepared for spectral analysis. Reflectance was measured in 210 narrow bands in the 400- to 2500-nm wavelength range. Statistical analysis of reflectance values indicated the potential of high dimensional reflectance data in specific visible, near-infrared, and middle-infrared bands to provide information about soil organic C content, but not organic matter composition. These bands also responded significantly to Fe- and Mn-oxide content.

  7. Modeling for Process Control: High-Dimensional Systems

    SciTech Connect

    Lev S. Tsimring

    2008-09-15

    Most of other technologically important systems (among them, powders and other granular systems) are intrinsically nonlinear. This project is focused on building the dynamical models for granular systems as a prototype for nonlinear high-dimensional systems exhibiting complex non-equilibrium phenomena. Granular materials present a unique opportunity to study these issues in a technologically important and yet fundamentally interesting setting. Granular systems exhibit a rich variety of regimes from gas-like to solid-like depending on the external excitation. Based the combination of the rigorous asymptotic analysis, available experimental data and nonlinear signal processing tools, we developed a multi-scale approach to the modeling of granular systems from detailed description of grain-grain interaction on a micro-scale to continuous modeling of large-scale granular flows with important geophysical applications.

  8. Spectral feature design in high dimensional multispectral data

    NASA Technical Reports Server (NTRS)

    Chen, Chih-Chien Thomas; Landgrebe, David A.

    1988-01-01

    The High resolution Imaging Spectrometer (HIRIS) is designed to acquire images simultaneously in 192 spectral bands in the 0.4 to 2.5 micrometers wavelength region. It will make possible the collection of essentially continuous reflectance spectra at a spectral resolution sufficient to extract significantly enhanced amounts of information from return signals as compared to existing systems. The advantages of such high dimensional data come at a cost of increased system and data complexity. For example, since the finer the spectral resolution, the higher the data rate, it becomes impractical to design the sensor to be operated continuously. It is essential to find new ways to preprocess the data which reduce the data rate while at the same time maintaining the information content of the high dimensional signal produced. Four spectral feature design techniques are developed from the Weighted Karhunen-Loeve Transforms: (1) non-overlapping band feature selection algorithm; (2) overlapping band feature selection algorithm; (3) Walsh function approach; and (4) infinite clipped optimal function approach. The infinite clipped optimal function approach is chosen since the features are easiest to find and their classification performance is the best. After the preprocessed data has been received at the ground station, canonical analysis is further used to find the best set of features under the criterion that maximal class separability is achieved. Both 100 dimensional vegetation data and 200 dimensional soil data were used to test the spectral feature design system. It was shown that the infinite clipped versions of the first 16 optimal features had excellent classification performance. The overall probability of correct classification is over 90 percent while providing for a reduced downlink data rate by a factor of 10.

  9. Enhanced, targeted sampling of high-dimensional free-energy landscapes using variationally enhanced sampling, with an application to chignolin.

    PubMed

    Shaffer, Patrick; Valsson, Omar; Parrinello, Michele

    2016-02-01

    The capabilities of molecular simulations have been greatly extended by a number of widely used enhanced sampling methods that facilitate escaping from metastable states and crossing large barriers. Despite these developments there are still many problems which remain out of reach for these methods which has led to a vigorous effort in this area. One of the most important problems that remains unsolved is sampling high-dimensional free-energy landscapes and systems that are not easily described by a small number of collective variables. In this work we demonstrate a new way to compute free-energy landscapes of high dimensionality based on the previously introduced variationally enhanced sampling, and we apply it to the miniprotein chignolin. PMID:26787868

  10. Enhanced, targeted sampling of high-dimensional free-energy landscapes using variationally enhanced sampling, with an application to chignolin

    PubMed Central

    Shaffer, Patrick; Valsson, Omar; Parrinello, Michele

    2016-01-01

    The capabilities of molecular simulations have been greatly extended by a number of widely used enhanced sampling methods that facilitate escaping from metastable states and crossing large barriers. Despite these developments there are still many problems which remain out of reach for these methods which has led to a vigorous effort in this area. One of the most important problems that remains unsolved is sampling high-dimensional free-energy landscapes and systems that are not easily described by a small number of collective variables. In this work we demonstrate a new way to compute free-energy landscapes of high dimensionality based on the previously introduced variationally enhanced sampling, and we apply it to the miniprotein chignolin. PMID:26787868

  11. Smart sampling and incremental function learning for very large high dimensional data.

    PubMed

    Loyola R, Diego G; Pedergnana, Mattia; Gimeno García, Sebastián

    2016-06-01

    Very large high dimensional data are common nowadays and they impose new challenges to data-driven and data-intensive algorithms. Computational Intelligence techniques have the potential to provide powerful tools for addressing these challenges, but the current literature focuses mainly on handling scalability issues related to data volume in terms of sample size for classification tasks. This work presents a systematic and comprehensive approach for optimally handling regression tasks with very large high dimensional data. The proposed approach is based on smart sampling techniques for minimizing the number of samples to be generated by using an iterative approach that creates new sample sets until the input and output space of the function to be approximated are optimally covered. Incremental function learning takes place in each sampling iteration, the new samples are used to fine tune the regression results of the function learning algorithm. The accuracy and confidence levels of the resulting approximation function are assessed using the probably approximately correct computation framework. The smart sampling and incremental function learning techniques can be easily used in practical applications and scale well in the case of extremely large data. The feasibility and good results of the proposed techniques are demonstrated using benchmark functions as well as functions from real-world problems. PMID:26476936

  12. A method for analysis of phenotypic change for phenotypes described by high-dimensional data.

    PubMed

    Collyer, M L; Sekora, D J; Adams, D C

    2015-10-01

    The analysis of phenotypic change is important for several evolutionary biology disciplines, including phenotypic plasticity, evolutionary developmental biology, morphological evolution, physiological evolution, evolutionary ecology and behavioral evolution. It is common for researchers in these disciplines to work with multivariate phenotypic data. When phenotypic variables exceed the number of research subjects--data called 'high-dimensional data'--researchers are confronted with analytical challenges. Parametric tests that require high observation to variable ratios present a paradox for researchers, as eliminating variables potentially reduces effect sizes for comparative analyses, yet test statistics require more observations than variables. This problem is exacerbated with data that describe 'multidimensional' phenotypes, whereby a description of phenotype requires high-dimensional data. For example, landmark-based geometric morphometric data use the Cartesian coordinates of (potentially) many anatomical landmarks to describe organismal shape. Collectively such shape variables describe organism shape, although the analysis of each variable, independently, offers little benefit for addressing biological questions. Here we present a nonparametric method of evaluating effect size that is not constrained by the number of phenotypic variables, and motivate its use with example analyses of phenotypic change using geometric morphometric data. Our examples contrast different characterizations of body shape for a desert fish species, associated with measuring and comparing sexual dimorphism between two populations. We demonstrate that using more phenotypic variables can increase effect sizes, and allow for stronger inferences. PMID:25204302

  13. A Dynamical Clustering Model of Brain Connectivity Inspired by the N -Body Problem.

    PubMed

    Prasad, Gautam; Burkart, Josh; Joshi, Shantanu H; Nir, Talia M; Toga, Arthur W; Thompson, Paul M

    2013-01-01

    We present a method for studying brain connectivity by simulating a dynamical evolution of the nodes of the network. The nodes are treated as particles, and evolved under a simulated force analogous to gravitational acceleration in the well-known N -body problem. The particle nodes correspond to regions of the cortex. The locations of particles are defined as the centers of the respective regions on the cortex and their masses are proportional to each region's volume. The force of attraction is modeled on the gravitational force, and explicitly made proportional to the elements of a connectivity matrix derived from diffusion imaging data. We present experimental results of the simulation on a population of 110 subjects from the Alzheimer's Disease Neuroimaging Initiative (ADNI), consisting of healthy elderly controls, early mild cognitively impaired (eMCI), late MCI (LMCI), and Alzheimer's disease (AD) patients. Results show significant differences in the dynamic properties of connectivity networks in healthy controls, compared to eMCI as well as AD patients. PMID:25340177

  14. A Dynamical Clustering Model of Brain Connectivity Inspired by the N -Body Problem

    PubMed Central

    Prasad, Gautam; Burkart, Josh; Joshi, Shantanu H.; Nir, Talia M.; Toga, Arthur W.; Thompson, Paul M.

    2014-01-01

    We present a method for studying brain connectivity by simulating a dynamical evolution of the nodes of the network. The nodes are treated as particles, and evolved under a simulated force analogous to gravitational acceleration in the well-known N -body problem. The particle nodes correspond to regions of the cortex. The locations of particles are defined as the centers of the respective regions on the cortex and their masses are proportional to each region’s volume. The force of attraction is modeled on the gravitational force, and explicitly made proportional to the elements of a connectivity matrix derived from diffusion imaging data. We present experimental results of the simulation on a population of 110 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), consisting of healthy elderly controls, early mild cognitively impaired (eMCI), late MCI (LMCI), and Alzheimer’s disease (AD) patients. Results show significant differences in the dynamic properties of connectivity networks in healthy controls, compared to eMCI as well as AD patients. PMID:25340177

  15. Mining Approximate Order Preserving Clusters in the Presence of Noise

    PubMed Central

    Zhang, Mengsheng; Wang, Wei; Liu, Jinze

    2010-01-01

    Subspace clustering has attracted great attention due to its capability of finding salient patterns in high dimensional data. Order preserving subspace clusters have been proven to be important in high throughput gene expression analysis, since functionally related genes are often co-expressed under a set of experimental conditions. Such co-expression patterns can be represented by consistent orderings of attributes. Existing order preserving cluster models require all objects in a cluster have identical attribute order without deviation. However, real data are noisy due to measurement technology limitation and experimental variability which prohibits these strict models from revealing true clusters corrupted by noise. In this paper, we study the problem of revealing the order preserving clusters in the presence of noise. We propose a noise-tolerant model called approximate order preserving cluster (AOPC). Instead of requiring all objects in a cluster have identical attribute order, we require that (1) at least a certain fraction of the objects have identical attribute order; (2) other objects in the cluster may deviate from the consensus order by up to a certain fraction of attributes. We also propose an algorithm to mine AOPC. Experiments on gene expression data demonstrate the efficiency and effectiveness of our algorithm. PMID:20689652

  16. immunoClust--An automated analysis pipeline for the identification of immunophenotypic signatures in high-dimensional cytometric datasets.

    PubMed

    Sörensen, Till; Baumgart, Sabine; Durek, Pawel; Grützkau, Andreas; Häupl, Thomas

    2015-07-01

    Multiparametric fluorescence and mass cytometry offers new perspectives to disclose and to monitor the high diversity of cell populations in the peripheral blood for biomarker research. While high-end cytometric devices are currently available to detect theoretically up to 120 individual parameters at the single cell level, software tools are needed to analyze these complex datasets automatically in acceptable time and without operator bias or knowledge. We developed an automated analysis pipeline, immunoClust, for uncompensated fluorescence and mass cytometry data, which consists of two parts. First, cell events of each sample are grouped into individual clusters. Subsequently, a classification algorithm assorts these cell event clusters into populations comparable between different samples. The clustering of cell events is designed for datasets with large event counts in high dimensions as a global unsupervised method, sensitive to identify rare cell types even when next to large populations. Both parts use model-based clustering with an iterative expectation maximization algorithm and the integrated classification likelihood to obtain the clusters. A detailed description of both algorithms is presented. Testing and validation was performed using 1) blood cell samples of defined composition that were depleted of particular cell subsets by magnetic cell sorting, 2) datasets of the FlowCAP III challenges to identify populations of rare cell types and 3) high-dimensional fluorescence and mass-cytometry datasets for comparison with conventional manual gating procedures. In conclusion, the immunoClust-algorithm is a promising tool to standardize and automate the analysis of high-dimensional cytometric datasets. As a prerequisite for interpretation of such data, it will support our efforts in developing immunological biomarkers for chronic inflammatory disorders and therapy recommendations in personalized medicine. immunoClust is implemented as an R-package and is

  17. A Hyperspherical Adaptive Sparse-Grid Method for High-Dimensional Discontinuity Detection

    SciTech Connect

    Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max D.; Burkardt, John V.

    2015-06-24

    This study proposes and analyzes a hyperspherical adaptive hierarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces. The method is motivated by the theoretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a function representation of the discontinuity hypersurface of an N-dimensional discontinuous quantity of interest, by virtue of a hyperspherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyperspherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. In addition, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous complexity analyses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.

  18. High dimensional data analysis using multivariate generalized spatial quantiles

    PubMed Central

    Mukhopadhyay, Nitai D.; Chatterjee, Snigdhansu

    2015-01-01

    High dimensional data routinely arises in image analysis, genetic experiments, network analysis, and various other research areas. Many such datasets do not correspond to well-studied probability distributions, and in several applications the data-cloud prominently displays non-symmetric and non-convex shape features. We propose using spatial quantiles and their generalizations, in particular, the projection quantile, for describing, analyzing and conducting inference with multivariate data. Minimal assumptions are made about the nature and shape characteristics of the underlying probability distribution, and we do not require the sample size to be as high as the data-dimension. We present theoretical properties of the generalized spatial quantiles, and an algorithm to compute them quickly. Our quantiles may be used to obtain multidimensional confidence or credible regions that are not required to conform to a pre-determined shape. We also propose a new notion of multidimensional order statistics, which may be used to obtain multidimensional outliers. Many of the features revealed using a generalized spatial quantile-based analysis would be missed if the data was shoehorned into a well-known probabilistic configuration. PMID:26617421

  19. Predicting Viral Infection From High-Dimensional Biomarker Trajectories

    PubMed Central

    Chen, Minhua; Zaas, Aimee; Woods, Christopher; Ginsburg, Geoffrey S.; Lucas, Joseph; Dunson, David; Carin, Lawrence

    2013-01-01

    There is often interest in predicting an individual’s latent health status based on high-dimensional biomarkers that vary over time. Motivated by time-course gene expression array data that we have collected in two influenza challenge studies performed with healthy human volunteers, we develop a novel time-aligned Bayesian dynamic factor analysis methodology. The time course trajectories in the gene expressions are related to a relatively low-dimensional vector of latent factors, which vary dynamically starting at the latent initiation time of infection. Using a nonparametric cure rate model for the latent initiation times, we allow selection of the genes in the viral response pathway, variability among individuals in infection times, and a subset of individuals who are not infected. As we demonstrate using held-out data, this statistical framework allows accurate predictions of infected individuals in advance of the development of clinical symptoms, without labeled data and even when the number of biomarkers vastly exceeds the number of individuals under study. Biological interpretation of several of the inferred pathways (factors) is provided. PMID:23704802

  20. An efficient chemical kinetics solver using high dimensional model representation

    SciTech Connect

    Shorter, J.A.; Ip, P.C.; Rabitz, H.A.

    1999-09-09

    A high dimensional model representation (HDMR) technique is introduced to capture the input-output behavior of chemical kinetic models. The HDMR expresses the output chemical species concentrations as a rapidly convergent hierarchical correlated function expansion in the input variables. In this paper, the input variables are taken as the species concentrations at time t{sub i} and the output is the concentrations at time t{sub i} + {delta}, where {delta} can be much larger than conventional integration time steps. A specially designed set of model runs is performed to determine the correlated functions making up the HDMR. The resultant HDMR can be used to (1) identify the key input variables acting independently or cooperatively on the output, and (2) create a high speed fully equivalent operational model (FEOM) serving to replace the original kinetic model and its differential equation solver. A demonstration of the HDMR technique is presented for stratospheric chemical kinetics. The FEOM proved to give accurate and stable chemical concentrations out to long times of many years. In addition, the FEOM was found to be orders of magnitude faster than a conventional stiff equation solver. This computational acceleration should have significance in many chemical kinetic applications.

  1. High-dimensional quantum cryptography with twisted light

    NASA Astrophysics Data System (ADS)

    Mirhosseini, Mohammad; Magaña-Loaiza, Omar S.; O'Sullivan, Malcolm N.; Rodenburg, Brandon; Malik, Mehul; Lavery, Martin P. J.; Padgett, Miles J.; Gauthier, Daniel J.; Boyd, Robert W.

    2015-03-01

    Quantum key distribution (QKD) systems often rely on polarization of light for encoding, thus limiting the amount of information that can be sent per photon and placing tight bounds on the error rates that such a system can tolerate. Here we describe a proof-of-principle experiment that indicates the feasibility of high-dimensional QKD based on the transverse structure of the light field allowing for the transfer of more than 1 bit per photon. Our implementation uses the orbital angular momentum (OAM) of photons and the corresponding mutually unbiased basis of angular position (ANG). Our experiment uses a digital micro-mirror device for the rapid generation of OAM and ANG modes at 4 kHz, and a mode sorter capable of sorting single photons based on their OAM and ANG content with a separation efficiency of 93%. Through the use of a seven-dimensional alphabet encoded in the OAM and ANG bases, we achieve a channel capacity of 2.05 bits per sifted photon. Our experiment demonstrates that, in addition to having an increased information capacity, multilevel QKD systems based on spatial-mode encoding can be more resilient against intercept-resend eavesdropping attacks.

  2. The Effectiveness of the BITSEA as a Tool to Early Detect Psychosocial Problems in Toddlers, a Cluster Randomized Trial

    PubMed Central

    Kruizinga, Ingrid; Jansen, Wilma; van Sprang, Nicolien C.; Carter, Alice S.; Raat, Hein

    2015-01-01

    Objective Effective early detection tools are needed in child health care to detect psychosocial problems among young children. This study aimed to evaluate the effectiveness of the Brief Infant-Toddler Social and Emotional Assessment (BITSEA), in reducing psychosocial problems at one year follow-up, compared to care as usual. Method Well-child centers in Rotterdam, the Netherlands, were allocated in a cluster randomized controlled trial to the intervention condition (BITSEA—15 centers), or to the control condition (‘care-as-usual’- 16 centers). Parents of 2610 2-year-old children (1,207 intervention; 1,403 control) provided informed consent and completed the baseline and 1-year follow-up questionnaire. Multilevel regression analyses were used to evaluate the effect of condition on psychosocial problems and health related quality of life (i.e. respectively Child Behavior Checklist and Infant-Toddler Quality of Life). The number of (pursuits of) referrals and acceptability of the BITSEA were also evaluated. Results Children in the intervention condition scored more favourably on the CBCL at follow-up than children in the control condition: B = -2.43 (95% confidence interval [95%CI] = -3.53;-1.33 p<0.001). There were no differences between conditions regarding ITQOL. Child health professionals reported referring fewer children in the intervention condition (n = 56, 5.7%), compared to the control condition (n = 95, 7.9%; p<0.05). There was no intervention effect on parents’ reported number of referrals pursued. It took less time to complete (parents) or work with (child health professional) the BITSEA, compared to care as usual. In the control condition, 84.2% of the parents felt (very) well prepared for the well-child visit, compared to 77.9% in the intervention condition (p<0.001). Conclusion The results support the use of the BITSEA as a tool for child health professionals in the early detection of psychosocial problems in 2-year-olds. We recommend future

  3. Local polynomial chaos expansion for linear differential equations with high dimensional random inputs

    SciTech Connect

    Chen, Yi; Jakeman, John; Gittelson, Claude; Xiu, Dongbin

    2015-01-08

    In this paper we present a localized polynomial chaos expansion for partial differential equations (PDE) with random inputs. In particular, we focus on time independent linear stochastic problems with high dimensional random inputs, where the traditional polynomial chaos methods, and most of the existing methods, incur prohibitively high simulation cost. Furthermore, the local polynomial chaos method employs a domain decomposition technique to approximate the stochastic solution locally. In each subdomain, a subdomain problem is solved independently and, more importantly, in a much lower dimensional random space. In a postprocesing stage, accurate samples of the original stochastic problems are obtained from the samples of the local solutions by enforcing the correct stochastic structure of the random inputs and the coupling conditions at the interfaces of the subdomains. Overall, the method is able to solve stochastic PDEs in very large dimensions by solving a collection of low dimensional local problems and can be highly efficient. In our paper we present the general mathematical framework of the methodology and use numerical examples to demonstrate the properties of the method.

  4. Genuinely high-dimensional nonlocality optimized by complementary measurements

    NASA Astrophysics Data System (ADS)

    Lim, James; Ryu, Junghee; Yoo, Seokwon; Lee, Changhyoup; Bang, Jeongho; Lee, Jinhyoung

    2010-10-01

    Qubits exhibit extreme nonlocality when their state is maximally entangled and this is observed by mutually unbiased local measurements. This criterion does not hold for the Bell inequalities of high-dimensional systems (qudits), recently proposed by Collins-Gisin-Linden-Massar-Popescu and Son-Lee-Kim. Taking an alternative approach, called the quantum-to-classical approach, we derive a series of Bell inequalities for qudits that satisfy the criterion as for the qubits. In the derivation each d-dimensional subsystem is assumed to be measured by one of d possible measurements with d being a prime integer. By applying to two qubits (d=2), we find that a derived inequality is reduced to the Clauser-Horne-Shimony-Holt inequality when the degree of nonlocality is optimized over all the possible states and local observables. Further applying to two and three qutrits (d=3), we find Bell inequalities that are violated for the three-dimensionally entangled states but are not violated by any two-dimensionally entangled states. In other words, the inequalities discriminate three-dimensional (3D) entanglement from two-dimensional (2D) entanglement and in this sense they are genuinely 3D. In addition, for the two qutrits we give a quantitative description of the relations among the three degrees of complementarity, entanglement and nonlocality. It is shown that the degree of complementarity jumps abruptly to very close to its maximum as nonlocality starts appearing. These characteristics imply that complementarity plays a more significant role in the present inequality compared with the previously proposed inequality.

  5. Cluster and SOHO - A joint endeavor by ESA and NASA to address problems in solar, heliospheric, and space plasma physics

    NASA Technical Reports Server (NTRS)

    Schmidt, Rudolf; Domingo, Vicente; Shawhan, Stanley D.; Bohlin, David

    1988-01-01

    The NASA/ESA Solar-Terrestrial Science Program, which consists of the four-spacecraft cluster mission and the Solar and Heliospheric Observatory (SOHO), is examined. It is expected that the SOHO spacecraft will be launched in 1995 to study solar interior structure and the physical processes associated with the solar corona. The SOHO design, operation, data, and ground segment are discussed. The Cluster mission is designed to study small-scale structures in the earth's plasma environment. The Soviet Union is expected to contribute two additional spacecraft, which will be similar to Cluster in instrumentation and design. The capabilities, mission strategy, spacecraft design, payload, and ground segment of Cluster are discussed.

  6. A comprehensive analysis of earthquake damage patterns using high dimensional model representation feature selection

    NASA Astrophysics Data System (ADS)

    Taşkin Kaya, Gülşen

    2013-10-01

    -output relationships in high-dimensional systems for many problems in science and engineering. The HDMR method is developed to improve the efficiency of the deducing high dimensional behaviors. The method is formed by a particular organization of low dimensional component functions, in which each function is the contribution of one or more input variables to the output variables.

  7. High dimensional spatial modeling of extremes with applications to United States Rainfalls

    NASA Astrophysics Data System (ADS)

    Zhou, Jie

    2007-12-01

    Spatial statistical models are used to predict unobserved variables based on observed variables and to estimate unknown model parameters. Extreme value theory(EVT) is used to study large or small observations from a random phenomenon. Both spatial statistics and extreme value theory have been studied in a lot of areas such as agriculture, finance, industry and environmental science. This dissertation proposes two spatial statistical models which concentrate on non-Gaussian probability densities with general spatial covariance structures. The two models are also applied in analyzing United States Rainfalls and especially, rainfall extremes. When the data set is not too large, the first model is used. The model constructs a generalized linear mixed model(GLMM) which can be considered as an extension of Diggle's model-based geostatistical approach(Diggle et al. 1998). The approach improves conventional kriging with a form of generalized linear mixed structure. As for high dimensional problems, two different methods are established to improve the computational efficiency of Markov Chain Monte Carlo(MCMC) implementation. The first method is based on spectral representation of spatial dependence structures which provides good approximations on each MCMC iteration. The other method embeds high dimensional covariance matrices in matrices with block circulant structures. The eigenvalues and eigenvectors of block circulant matrices can be calculated exactly by Fast Fourier Transforms(FFT). The computational efficiency is gained by transforming the posterior matrices into lower dimensional matrices. This method gives us exact update on each MCMC iteration. Future predictions are also made by keeping spatial dependence structures fixed and using the relationship between present days and future days provided by some Global Climate Model(GCM). The predictions are refined by sampling techniques. Both ways of handling high dimensional covariance matrices are novel to analyze large

  8. Fast and accurate probability density estimation in large high dimensional astronomical datasets

    NASA Astrophysics Data System (ADS)

    Gupta, Pramod; Connolly, Andrew J.; Gardner, Jeffrey P.

    2015-01-01

    Astronomical surveys will generate measurements of hundreds of attributes (e.g. color, size, shape) on hundreds of millions of sources. Analyzing these large, high dimensional data sets will require efficient algorithms for data analysis. An example of this is probability density estimation that is at the heart of many classification problems such as the separation of stars and quasars based on their colors. Popular density estimation techniques use binning or kernel density estimation. Kernel density estimation has a small memory footprint but often requires large computational resources. Binning has small computational requirements but usually binning is implemented with multi-dimensional arrays which leads to memory requirements which scale exponentially with the number of dimensions. Hence both techniques do not scale well to large data sets in high dimensions. We present an alternative approach of binning implemented with hash tables (BASH tables). This approach uses the sparseness of data in the high dimensional space to ensure that the memory requirements are small. However hashing requires some extra computation so a priori it is not clear if the reduction in memory requirements will lead to increased computational requirements. Through an implementation of BASH tables in C++ we show that the additional computational requirements of hashing are negligible. Hence this approach has small memory and computational requirements. We apply our density estimation technique to photometric selection of quasars using non-parametric Bayesian classification and show that the accuracy of the classification is same as the accuracy of earlier approaches. Since the BASH table approach is one to three orders of magnitude faster than the earlier approaches it may be useful in various other applications of density estimation in astrostatistics.

  9. Multiscale hierarchical support vector clustering

    NASA Astrophysics Data System (ADS)

    Hansen, Michael Saas; Holm, David Alberg; Sjöstrand, Karl; Ley, Carsten Dan; Rowland, Ian John; Larsen, Rasmus

    2008-03-01

    Clustering is the preferred choice of method in many applications, and support vector clustering (SVC) has proven efficient for clustering noisy and high-dimensional data sets. A method for multiscale support vector clustering is demonstrated, using the recently emerged method for fast calculation of the entire regularization path of the support vector domain description. The method is illustrated on artificially generated examples, and applied for detecting blood vessels from high resolution time series of magnetic resonance imaging data. The obtained results are robust while the need for parameter estimation is reduced, compared to support vector clustering.

  10. Möbius transformational high dimensional model representation on multi-way arrays

    NASA Astrophysics Data System (ADS)

    Özay, Evrim Korkmaz

    2012-09-01

    Transformational High Dimensional Model Representation has been used for continous structures with different transformations before. This work is inventive because not only for the transformation type but also its usage. Möbius Transformational High Dimensional Model Representation has been used at multi-way arrays, by using truncation approximant and inverse transformation an approximation has been obtained for original multi-way array.

  11. The problem of the structure (state of helium) in small He{sub N}-CO clusters

    SciTech Connect

    Potapov, A. V. Panfilov, V. A.; Surin, L. A.; Dumesh, B. S.

    2010-11-15

    A second-order perturbation theory, developed for calculating the energy levels of the He-CO binary complex, is applied to small He{sub N}-CO clusters with N = 2-4, the helium atoms being considered as a single bound object. The interaction potential between the CO molecule and HeN is represented as a linear expansion in Legendre polynomials, in which the free rotation limit is chosen as the zero approximation and the angular dependence of the interaction is considered as a small perturbation. By fitting calculated rotational transitions to experimental values it was possible to determine the optimal parameters of the potential and to achieve good agreement (to within less than 1%) between calculated and experimental energy levels. As a result, the shape of the angular anisotropy of the interaction potential is obtained for various clusters. It turns out that the minimum of the potential energy is smoothly shifted from an angle between the axes of the CO molecule and the cluster of {theta} = 100{sup o} in He-CO to {theta} = 180{sup o} (the oxygen end) in He{sub 3}-CO and He{sub 4}-CO clusters. Under the assumption that the distribution of helium atoms with respect to the cluster axis is cylindrically symmetric, the structure of the cluster can be represented as a pyramid with the CO molecule at the vertex.

  12. Lorentzian sparsity based spectroscopic reconstruction for fast high-dimensional magnetic resonance spectroscopy

    NASA Astrophysics Data System (ADS)

    Jiang, Boyu; Hu, Xiaoping; Gao, Hao

    2016-01-01

    Two-dimensional magnetic resonance spectroscopy (2D MRS) is challenging, even with state-of-art compressive sensing methods, such as L1-sparsity method. In this work, using the prior that the 2D MRS can be regarded as a series of Lorentzian functions, we aim to develop a robust Lorentzian-sparsity based spectroscopy reconstruction method for high-dimensional MRS. The proposed method sparsifies 2D MRS in Lorentzian functions. Instead of thousands of pixel-wise variables, this Lorentzian-sparsity method significantly reduces the number of unknowns to several geometric variables, such as the center, magnitude and shape parameters for each Lorentzian function. The spectroscopy reconstruction is formulated as a nonlinear and nonconvex optimization problem, and the simulated annealing algorithm is developed to solve the problem. The proposed method was compared with inverse FFT method and L1-sparsity method, under various undersampling factors. While FFT and L1 results contained severe artifacts, the Lorentzian-sparsity results provided significantly improved spectroscopy. A new 2D MRS reconstruction method is proposed using the Lorentzian sparsity, with significantly improved MRS reconstruction quality, in comparison with standard inverse FFT method or state-of-art L1-sparsity method.

  13. SPReM: Sparse Projection Regression Model For High-dimensional Linear Regression *

    PubMed Central

    Sun, Qiang; Zhu, Hongtu; Liu, Yufeng; Ibrahim, Joseph G.

    2014-01-01

    The aim of this paper is to develop a sparse projection regression modeling (SPReM) framework to perform multivariate regression modeling with a large number of responses and a multivariate covariate of interest. We propose two novel heritability ratios to simultaneously perform dimension reduction, response selection, estimation, and testing, while explicitly accounting for correlations among multivariate responses. Our SPReM is devised to specifically address the low statistical power issue of many standard statistical approaches, such as the Hotelling’s T2 test statistic or a mass univariate analysis, for high-dimensional data. We formulate the estimation problem of SPREM as a novel sparse unit rank projection (SURP) problem and propose a fast optimization algorithm for SURP. Furthermore, we extend SURP to the sparse multi-rank projection (SMURP) by adopting a sequential SURP approximation. Theoretically, we have systematically investigated the convergence properties of SURP and the convergence rate of SURP estimates. Our simulation results and real data analysis have shown that SPReM out-performs other state-of-the-art methods. PMID:26527844

  14. Effects of a Research-Based Intervention to Improve Seventh-Grade Students' Proportional Problem Solving: A Cluster Randomized Trial

    ERIC Educational Resources Information Center

    Jitendra, Asha K.; Harwell, Michael R.; Dupuis, Danielle N.; Karl, Stacy R.; Lein, Amy E.; Simonson, Gregory; Slater, Susan C.

    2015-01-01

    This experimental study evaluated the effectiveness of a research-based intervention, schema-based instruction (SBI), on students' proportional problem solving. SBI emphasizes the underlying mathematical structure of problems, uses schematic diagrams to represent information in the problem text, provides explicit problem-solving and metacognitive…

  15. The Problem of Hipparcos Distances to Open Clusters. II. Constraints from Nearby Field Theory. Report 2; ClustersConstraints from nearly Field Stars

    NASA Technical Reports Server (NTRS)

    Soderblom, David R.; King, Jeremy R.; Hanson, Robert B.; Jones, Burton F.; Fischer, Debra; Stauffer, John R.; Pinsonneault, Marc H.

    1998-01-01

    This paper examines the discrepancy between distances to nearby open clusters as determined by parallaxes from Hipparcos compared to traditional main-sequence fitting. The biggest difference is seen for the Pleiades, and our hypothesis is that if the Hipparcos distance to the Pleiades is correct, then similar subluminous zero-age main-sequence (ZAMS) stars should exist elsewhere, including in the immediate solar neighborhood. We examine a color-magnitude diagram of very young and nearby solar-type stars and show that none of them lie below the traditional ZAMS, despite the fact that the Hipparcos Pleiades parallax would place its members 0.3 mag below that ZAMS. We also present analyses and observations of solar-type stars that do lie below the ZAMS, and we show that they are subluminous because of low metallicity and that they have the kinematics of old stars.

  16. High-dimensional analysis of the murine myeloid cell system.

    PubMed

    Becher, Burkhard; Schlitzer, Andreas; Chen, Jinmiao; Mair, Florian; Sumatoh, Hermi R; Teng, Karen Wei Weng; Low, Donovan; Ruedl, Christiane; Riccardi-Castagnoli, Paola; Poidinger, Michael; Greter, Melanie; Ginhoux, Florent; Newell, Evan W

    2014-12-01

    Advances in cell-fate mapping have revealed the complexity in phenotype, ontogeny and tissue distribution of the mammalian myeloid system. To capture this phenotypic diversity, we developed a 38-antibody panel for mass cytometry and used dimensionality reduction with machine learning-aided cluster analysis to build a composite of murine (mouse) myeloid cells in the steady state across lymphoid and nonlymphoid tissues. In addition to identifying all previously described myeloid populations, higher-order analysis allowed objective delineation of otherwise ambiguous subsets, including monocyte-macrophage intermediates and an array of granulocyte variants. Using mice that cannot sense granulocyte macrophage-colony stimulating factor GM-CSF (Csf2rb(-/-)), which have discrete alterations in myeloid development, we confirmed differences in barrier tissue dendritic cells, lung macrophages and eosinophils. The methodology further identified variations in the monocyte and innate lymphoid cell compartment that were unexpected, which confirmed that this approach is a powerful tool for unambiguous and unbiased characterization of the myeloid system. PMID:25306126

  17. Finite-key analysis of a practical decoy-state high-dimensional quantum key distribution

    NASA Astrophysics Data System (ADS)

    Bao, Haize; Bao, Wansu; Wang, Yang; Zhou, Chun; Chen, Ruike

    2016-05-01

    Compared with two-level quantum key distribution (QKD), high-dimensional QKD enables two distant parties to share a secret key at a higher rate. We provide a finite-key security analysis for the recently proposed practical high-dimensional decoy-state QKD protocol based on time-energy entanglement. We employ two methods to estimate the statistical fluctuation of the postselection probability and give a tighter bound on the secure-key capacity. By numerical evaluation, we show the finite-key effect on the secure-key capacity in different conditions. Moreover, our approach could be used to optimize parameters in practical implementations of high-dimensional QKD.

  18. Linear stability theory as an early warning sign for transitions in high dimensional complex systems

    NASA Astrophysics Data System (ADS)

    Piovani, Duccio; Grujić, Jelena; Jeldtoft Jensen, Henrik

    2016-07-01

    We analyse in detail a new approach to the monitoring and forecasting of the onset of transitions in high dimensional complex systems by application to the Tangled Nature model of evolutionary ecology and high dimensional replicator systems with a stochastic element. A high dimensional stability matrix is derived in the mean field approximation to the stochastic dynamics. This allows us to determine the stability spectrum about the observed quasi-stable configurations. From overlap of the instantaneous configuration vector of the full stochastic system with the eigenvectors of the unstable directions of the deterministic mean field approximation, we are able to construct a good early-warning indicator of the transitions occurring intermittently.

  19. Characterization and application of microearthquake clusters to problems of scaling, fault zone dynamics, and seismic monitoring at Parkfield, California

    SciTech Connect

    Nadeau, R.M.

    1995-10-01

    This document contains information about the characterization and application of microearthquake clusters and fault zone dynamics. Topics discussed include: Seismological studies; fault-zone dynamics; periodic recurrence; scaling of microearthquakes to large earthquakes; implications of fault mechanics and seismic hazards; and wave propagation and temporal changes.

  20. A hyper-spherical adaptive sparse-grid method for high-dimensional discontinuity detection

    SciTech Connect

    Zhang, Guannan; Webster, Clayton G; Gunzburger, Max D; Burkardt, John V

    2014-03-01

    This work proposes and analyzes a hyper-spherical adaptive hi- erarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces is proposed. The method is motivated by the the- oretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a func- tion representation of the discontinuity hyper-surface of an N-dimensional dis- continuous quantity of interest, by virtue of a hyper-spherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyper-spherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smooth- ness of the hyper-surface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous error estimates and complexity anal- yses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.

  1. A decision-theory approach to interpretable set analysis for high-dimensional data.

    PubMed

    Boca, Simina M; Bravo, Héctor Céorrada; Caffo, Brian; Leek, Jeffrey T; Parmigiani, Giovanni

    2013-09-01

    A key problem in high-dimensional significance analysis is to find pre-defined sets that show enrichment for a statistical signal of interest; the classic example is the enrichment of gene sets for differentially expressed genes. Here, we propose a new decision-theory approach to the analysis of gene sets which focuses on estimating the fraction of non-null variables in a set. We introduce the idea of "atoms," non-overlapping sets based on the original pre-defined set annotations. Our approach focuses on finding the union of atoms that minimizes a weighted average of the number of false discoveries and missed discoveries. We introduce a new false discovery rate for sets, called the atomic false discovery rate (afdr), and prove that the optimal estimator in our decision-theory framework is to threshold the afdr. These results provide a coherent and interpretable framework for the analysis of sets that addresses the key issues of overlapping annotations and difficulty in interpreting p values in both competitive and self-contained tests. We illustrate our method and compare it to a popular existing method using simulated examples, as well as gene-set and brain ROI data analyses. PMID:23909925

  2. Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics

    PubMed Central

    Luo, Le; Li, Li

    2014-01-01

    Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications. PMID:24416136

  3. Sparse grid discontinuous Galerkin methods for high-dimensional elliptic equations

    NASA Astrophysics Data System (ADS)

    Wang, Zixuan; Tang, Qi; Guo, Wei; Cheng, Yingda

    2016-06-01

    This paper constitutes our initial effort in developing sparse grid discontinuous Galerkin (DG) methods for high-dimensional partial differential equations (PDEs). Over the past few decades, DG methods have gained popularity in many applications due to their distinctive features. However, they are often deemed too costly because of the large degrees of freedom of the approximation space, which are the main bottleneck for simulations in high dimensions. In this paper, we develop sparse grid DG methods for elliptic equations with the aim of breaking the curse of dimensionality. Using a hierarchical basis representation, we construct a sparse finite element approximation space, reducing the degrees of freedom from the standard O (h-d) to O (h-1 |log2 ⁡ h| d - 1) for d-dimensional problems, where h is the uniform mesh size in each dimension. Our method, based on the interior penalty (IP) DG framework, can achieve accuracy of O (hk |log2 ⁡ h| d - 1) in the energy norm, where k is the degree of polynomials used. Error estimates are provided and confirmed by numerical tests in multi-dimensions.

  4. Beyond Sub-Gaussian Measurements: High-Dimensional Structured Estimation with Sub-Exponential Designs

    PubMed Central

    Sivakumar, Vidyashankar; Banerjee, Arindam; Ravikumar, Pradeep

    2016-01-01

    We consider the problem of high-dimensional structured estimation with norm-regularized estimators, such as Lasso, when the design matrix and noise are drawn from sub-exponential distributions. Existing results only consider sub-Gaussian designs and noise, and both the sample complexity and non-asymptotic estimation error have been shown to depend on the Gaussian width of suitable sets. In contrast, for the sub-exponential setting, we show that the sample complexity and the estimation error will depend on the exponential width of the corresponding sets, and the analysis holds for any norm. Further, using generic chaining, we show that the exponential width for any set will be at most logp times the Gaussian width of the set, yielding Gaussian width based results even for the sub-exponential case. Further, for certain popular estimators, viz Lasso and Group Lasso, using a VC-dimension based analysis, we show that the sample complexity will in fact be the same order as Gaussian designs. Our general analysis and results are the first in the sub-exponential setting, and are readily applicable to special sub-exponential families such as log-concave and extreme-value distributions. PMID:27563230

  5. Individual-based models for adaptive diversification in high-dimensional phenotype spaces.

    PubMed

    Ispolatov, Iaroslav; Madhok, Vaibhav; Doebeli, Michael

    2016-02-01

    Most theories of evolutionary diversification are based on equilibrium assumptions: they are either based on optimality arguments involving static fitness landscapes, or they assume that populations first evolve to an equilibrium state before diversification occurs, as exemplified by the concept of evolutionary branching points in adaptive dynamics theory. Recent results indicate that adaptive dynamics may often not converge to equilibrium points and instead generate complicated trajectories if evolution takes place in high-dimensional phenotype spaces. Even though some analytical results on diversification in complex phenotype spaces are available, to study this problem in general we need to reconstruct individual-based models from the adaptive dynamics generating the non-equilibrium dynamics. Here we first provide a method to construct individual-based models such that they faithfully reproduce the given adaptive dynamics attractor without diversification. We then show that a propensity to diversify can be introduced by adding Gaussian competition terms that generate frequency dependence while still preserving the same adaptive dynamics. For sufficiently strong competition, the disruptive selection generated by frequency-dependence overcomes the directional evolution along the selection gradient and leads to diversification in phenotypic directions that are orthogonal to the selection gradient. PMID:26598329

  6. A Hyperspherical Adaptive Sparse-Grid Method for High-Dimensional Discontinuity Detection

    DOE PAGESBeta

    Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max D.; Burkardt, John V.

    2015-06-24

    This study proposes and analyzes a hyperspherical adaptive hierarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces. The method is motivated by the theoretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a function representation of the discontinuity hypersurface of an N-dimensional discontinuous quantity of interest, by virtue of a hyperspherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyperspherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the newmore » technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. In addition, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous complexity analyses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.« less

  7. DETECTING DYNAMIC AND GENETIC EFFECTS ON BRAIN STRUCTURE USING HIGH-DIMENSIONAL CORTICAL PATTERN MATCHING.

    PubMed

    Thompson, Paul M; Hayashi, Kiralee M; de Zubicaray, Greig; Janke, Andrew L; Rose, Stephen E; Semple, James; Doddrell, David M; Cannon, Tyrone D; Toga, Arthur W

    2002-01-01

    We briefly describe a set of algorithms to detect and visualize effects of disease and genetic factors on the brain. Extreme variations in cortical anatomy, even among normal subjects, complicate the detection and mapping of systematic effects on brain structure in human populations. We tackle this problem in two stages. First, we develop a cortical pattern matching approach, based on metrically covariant partial differential equations (PDEs), to associate corresponding regions of cortex in an MRI brain image database (N=102 scans). Second, these high-dimensional deformation maps are used to transfer within-subject cortical signals, including measures of gray matter distribution, shape asymmetries, and degenerative rates, to a common anatomic template for statistical analysis. We illustrate these techniques in two applications: (1) mapping dynamic patterns of gray matter loss in longitudinally scanned Alzheimer's disease patients; and (2) mapping genetic influences on brain structure. We extend statistics used widely in behavioral genetics to cortical manifolds. Specifically, we introduce methods based on h-squared distributed random fields to map hereditary influences on brain structure in human populations. PMID:19759832

  8. Approximating high-dimensional dynamics by barycentric coordinates with linear programming

    SciTech Connect

    Hirata, Yoshito Aihara, Kazuyuki; Suzuki, Hideyuki; Shiro, Masanori; Takahashi, Nozomu; Mas, Paloma

    2015-01-15

    The increasing development of novel methods and techniques facilitates the measurement of high-dimensional time series but challenges our ability for accurate modeling and predictions. The use of a general mathematical model requires the inclusion of many parameters, which are difficult to be fitted for relatively short high-dimensional time series observed. Here, we propose a novel method to accurately model a high-dimensional time series. Our method extends the barycentric coordinates to high-dimensional phase space by employing linear programming, and allowing the approximation errors explicitly. The extension helps to produce free-running time-series predictions that preserve typical topological, dynamical, and/or geometric characteristics of the underlying attractors more accurately than the radial basis function model that is widely used. The method can be broadly applied, from helping to improve weather forecasting, to creating electronic instruments that sound more natural, and to comprehensively understanding complex biological data.

  9. Snapshot advantage: a review of the light collection improvement for parallel high-dimensional measurement systems

    PubMed Central

    Hagen, Nathan; Kester, Robert T.; Gao, Liang; Tkaczyk, Tomasz S.

    2012-01-01

    The snapshot advantage is a large increase in light collection efficiency available to high-dimensional measurement systems that avoid filtering and scanning. After discussing this advantage in the context of imaging spectrometry, where the greatest effort towards developing snapshot systems has been made, we describe the types of measurements where it is applicable. We then generalize it to the larger context of high-dimensional measurements, where the advantage increases geometrically with measurement dimensionality. PMID:22791926

  10. Nonlocality of high-dimensional two-photon orbital angular momentum states

    SciTech Connect

    Aiello, A.; Oemrawsingh, S. S. R.; Eliel, E. R.; Woerdman, J. P.

    2005-11-15

    We propose an interferometric method to investigate the nonlocality of high-dimensional two-photon orbital angular momentum states generated by spontaneous parametric down conversion. We incorporate two half-integer spiral phase plates and a variable-reflectivity output beam splitter into a Mach-Zehnder interferometer to build an orbital angular momentum analyzer. This setup enables testing the nonlocality of high-dimensional two-photon states by repeated use of the Clauser-Horne-Shimony-Holt inequality.

  11. Modeling change from large-scale high-dimensional spatio-temporal array data

    NASA Astrophysics Data System (ADS)

    Lu, Meng; Pebesma, Edzer

    2014-05-01

    The massive data that come from Earth observation satellite and other sensors provide significant information for modeling global change. At the same time, the high dimensionality of the data has brought challenges in data acquisition, management, effective querying and processing. In addition, the output of earth system modeling tends to be data intensive and needs methodologies for storing, validation, analyzing and visualization, e.g. as maps. An important proportion of earth system observations and simulated data can be represented as multi-dimensional array data, which has received increasingly attention in big data management and spatial-temporal analysis. Study cases will be developed in natural science such as climate change, hydrological modeling, sediment dynamics, from which the addressing of big data problems is necessary. Multi-dimensional array-based database management and analytics system such as Rasdaman, SciDB, and R will be applied to these cases. From these studies will hope to learn the strengths and weaknesses of these systems, how they might work together or how semantics of array operations differ, through addressing the problems associated with big data. Research questions include: • How can we reduce dimensions spatially and temporally, or thematically? • How can we extend existing GIS functions to work on multidimensional arrays? • How can we combine data sets of different dimensionality or different resolutions? • Can map algebra be extended to an intelligible array algebra? • What are effective semantics for array programming of dynamic data driven applications? • In which sense are space and time special, as dimensions, compared to other properties? • How can we make the analysis of multi-spectral, multi-temporal and multi-sensor earth observation data easy?

  12. Gaussian processes with built-in dimensionality reduction: Applications to high-dimensional uncertainty propagation

    NASA Astrophysics Data System (ADS)

    Tripathy, Rohit; Bilionis, Ilias; Gonzalez, Marcial

    2016-09-01

    Uncertainty quantification (UQ) tasks, such as model calibration, uncertainty propagation, and optimization under uncertainty, typically require several thousand evaluations of the underlying computer codes. To cope with the cost of simulations, one replaces the real response surface with a cheap surrogate based, e.g., on polynomial chaos expansions, neural networks, support vector machines, or Gaussian processes (GP). However, the number of simulations required to learn a generic multivariate response grows exponentially as the input dimension increases. This curse of dimensionality can only be addressed, if the response exhibits some special structure that can be discovered and exploited. A wide range of physical responses exhibit a special structure known as an active subspace (AS). An AS is a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. If the dimensionality of the AS is low enough, then learning the link function is a much easier problem than the original problem of learning a high-dimensional function. The classic approach to discovering the AS requires gradient information, a fact that severely limits its applicability. Furthermore, and partly because of its reliance to gradients, it is not able to handle noisy observations. The latter is an essential trait if one wants to be able to propagate uncertainty through stochastic simulators, e.g., through molecular dynamics codes. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction. In particular, the AS is represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the

  13. A Personalized Collaborative Recommendation Approach Based on Clustering of Customers

    NASA Astrophysics Data System (ADS)

    Wang, Pu

    Collaborative filtering has been known to be the most successful recommender techniques in recommendation systems. Collaborative methods recommend items based on aggregated user ratings of those items and these techniques do not depend on the availability of textual descriptions. They share the common goal of assisting in the users search for items of interest, and thus attempt to address one of the key research problems of the information overload. Collaborative filtering systems can deal with large numbers of customers and with many different products. However there is a problem that the set of ratings is sparse, such that any two customers will most likely have only a few co-rated products. The high dimensional sparsity of the rating matrix and the problem of scalability result in low quality recommendations. In this paper, a personalized collaborative recommendation approach based on clustering of customers is presented. This method uses the clustering technology to form the customers centers. The personalized collaborative filtering approach based on clustering of customers can alleviate the scalability problem in the collaborative recommendations.

  14. Machine learning etudes in astrophysics: selection functions for mock cluster catalogs

    SciTech Connect

    Hajian, Amir; Alvarez, Marcelo A.; Bond, J. Richard E-mail: malvarez@cita.utoronto.ca

    2015-01-01

    Making mock simulated catalogs is an important component of astrophysical data analysis. Selection criteria for observed astronomical objects are often too complicated to be derived from first principles. However the existence of an observed group of objects is a well-suited problem for machine learning classification. In this paper we use one-class classifiers to learn the properties of an observed catalog of clusters of galaxies from ROSAT and to pick clusters from mock simulations that resemble the observed ROSAT catalog. We show how this method can be used to study the cross-correlations of thermal Sunya'ev-Zeldovich signals with number density maps of X-ray selected cluster catalogs. The method reduces the bias due to hand-tuning the selection function and is readily scalable to large catalogs with a high-dimensional space of astrophysical features.

  15. A rough set based rational clustering framework for determining correlated genes.

    PubMed

    Jeyaswamidoss, Jeba Emilyn; Thangaraj, Kesavan; Ramar, Kadarkarai; Chitra, Muthusamy

    2016-06-01

    Cluster analysis plays a foremost role in identifying groups of genes that show similar behavior under a set of experimental conditions. Several clustering algorithms have been proposed for identifying gene behaviors and to understand their significance. The principal aim of this work is to develop an intelligent rough clustering technique, which will efficiently remove the irrelevant dimensions in a high-dimensional space and obtain appropriate meaningful clusters. This paper proposes a novel biclustering technique that is based on rough set theory. The proposed algorithm uses correlation coefficient as a similarity measure to simultaneously cluster both the rows and columns of a gene expression data matrix and mean squared residue to generate the initial biclusters. Furthermore, the biclusters are refined to form the lower and upper boundaries by determining the membership of the genes in the clusters using mean squared residue. The algorithm is illustrated with yeast gene expression data and the experiment proves the effectiveness of the method. The main advantage is that it overcomes the problem of selection of initial clusters and also the restriction of one object belonging to only one cluster by allowing overlapping of biclusters. PMID:27352972

  16. High-dimensional propensity score algorithm in comparative effectiveness research with time-varying interventions.

    PubMed

    Neugebauer, Romain; Schmittdiel, Julie A; Zhu, Zheng; Rassen, Jeremy A; Seeger, John D; Schneeweiss, Sebastian

    2015-02-28

    The high-dimensional propensity score (hdPS) algorithm was proposed for automation of confounding adjustment in problems involving large healthcare databases. It has been evaluated in comparative effectiveness research (CER) with point treatments to handle baseline confounding through matching or covariance adjustment on the hdPS. In observational studies with time-varying interventions, such hdPS approaches are often inadequate to handle time-dependent confounding and selection bias. Inverse probability weighting (IPW) estimation to fit marginal structural models can adequately handle these biases under the fundamental assumption of no unmeasured confounders. Upholding of this assumption relies on the selection of an adequate set of covariates for bias adjustment. We describe the application and performance of the hdPS algorithm to improve covariate selection in CER with time-varying interventions based on IPW estimation and explore stabilization of the resulting estimates using Super Learning. The evaluation is based on both the analysis of electronic health records data in a real-world CER study of adults with type 2 diabetes and a simulation study. This report (i) establishes the feasibility of IPW estimation with the hdPS algorithm based on large electronic health records databases, (ii) demonstrates little impact on inferences when supplementing the set of expert-selected covariates using the hdPS algorithm in a setting with extensive background knowledge, (iii) supports the application of the hdPS algorithm in discovery settings with little background knowledge or limited data availability, and (iv) motivates the application of Super Learning to stabilize effect estimates based on the hdPS algorithm. PMID:25488047

  17. Reduced basis ANOVA methods for partial differential equations with high-dimensional random inputs

    NASA Astrophysics Data System (ADS)

    Liao, Qifeng; Lin, Guang

    2016-07-01

    In this paper we present a reduced basis ANOVA approach for partial deferential equations (PDEs) with random inputs. The ANOVA method combined with stochastic collocation methods provides model reduction in high-dimensional parameter space through decomposing high-dimensional inputs into unions of low-dimensional inputs. In this work, to further reduce the computational cost, we investigate spatial low-rank structures in the ANOVA-collocation method, and develop efficient spatial model reduction techniques using hierarchically generated reduced bases. We present a general mathematical framework of the methodology, validate its accuracy and demonstrate its efficiency with numerical experiments.

  18. Hamiltonian structure of the Vlasov-Einstein system and the problem of stability for spherical relativistic star clusters

    SciTech Connect

    Kandrup, H.E.; Morrison, P.J.

    1992-11-01

    The Hamiltonian formulation of the Vlasov-Einstein system, which is appropriate for collisionless, self-gravitating systems like clusters of stars that are so dense that gravity must be described by the Einstein equation, is presented. In particular, it is demonstrated explicitly in the context of a 3 + 1 splitting that, for spherically symmetric configurations, the Vlasov-Einstein system can be viewed as a Hamiltonian system, where the dynamics is generated by a noncanonical Poisson bracket, with the Hamiltonian generating the evolution of the distribution function f (a noncanonical variable) being the conserved ADM mass-energy H{sub ADM}. An explicit expression is derived for the energy {delta}({sup 2})H{sub ADM} associated with an arbitrary phase space preserving perturbation of an arbitrary spherical equilibrium, and it is shown that the equilibrium must be linearly stable if {delta}({sup 2})H{sub ADM} is positive semi-definite. Insight into the Hamiltonian reformulation is provided by a description of general finite degree of freedom systems.

  19. Hamiltonian structure of the Vlasov-Einstein system and the problem of stability for spherical relativistic star clusters

    SciTech Connect

    Kandrup, H.E. ); Morrison, P.J. . Inst. for Fusion Studies)

    1992-11-01

    The Hamiltonian formulation of the Vlasov-Einstein system, which is appropriate for collisionless, self-gravitating systems like clusters of stars that are so dense that gravity must be described by the Einstein equation, is presented. In particular, it is demonstrated explicitly in the context of a 3 + 1 splitting that, for spherically symmetric configurations, the Vlasov-Einstein system can be viewed as a Hamiltonian system, where the dynamics is generated by a noncanonical Poisson bracket, with the Hamiltonian generating the evolution of the distribution function f (a noncanonical variable) being the conserved ADM mass-energy H[sub ADM]. An explicit expression is derived for the energy [delta]([sup 2])H[sub ADM] associated with an arbitrary phase space preserving perturbation of an arbitrary spherical equilibrium, and it is shown that the equilibrium must be linearly stable if [delta]([sup 2])H[sub ADM] is positive semi-definite. Insight into the Hamiltonian reformulation is provided by a description of general finite degree of freedom systems.

  20. Gravitational clustering of galaxies: Derivation of two-point galaxy correlation function using statistical mechanics of cosmological many-body problem

    NASA Astrophysics Data System (ADS)

    Ahmad, Farooq; Malik, Manzoor A.; Bhat, M. Maqbool

    2016-07-01

    We derive the spatial pair correlation function in gravitational clustering for extended structure of galaxies (e.g. galaxies with halos) by using statistical mechanics of cosmological many-body problem. Our results indicate that in the limit of point masses (ɛ=0) the two-point correlation function varies as inverse square of relative separation of two galaxies. The effect of softening parameter `ɛ' on the pair correlation function is also studied and results indicate that two-point correlation function is affected by the softening parameter when the distance between galaxies is small. However, for larger distance between galaxies, the two-point correlation function is not affected at all. The correlation length r0 derived by our method depends on the random dispersion velocities < v2rangle^{1/2} and mean number density bar{n}, which is in agreement with N-body simulations and observations. Further, our results are applicable to the clusters of galaxies for their correlation functions and we apply our results to obtain the correlation length r0 for such systems which again agrees with the data of N-body simulations and observations.

  1. Reading fluency and speech perception speed of beginning readers with persistent reading problems: the perception of initial stop consonants and consonant clusters

    PubMed Central

    van der Leij, Aryan; Blok, Henk; de Jong, Peter F.

    2010-01-01

    This study investigated the role of speech perception accuracy and speed in fluent word decoding of reading disabled (RD) children. A same-different phoneme discrimination task with natural speech tested the perception of single consonants and consonant clusters by young but persistent RD children. RD children were slower than chronological age (CA) controls in recognizing identical sounds, suggesting less distinct phonemic categories. In addition, after controlling for phonetic similarity Tallal’s (Brain Lang 9:182–198, 1980) fast transitions account of RD children’s speech perception problems was contrasted with Studdert-Kennedy’s (Read Writ Interdiscip J 15:5–14, 2002) similarity explanation. Results showed no specific RD deficit in perceiving fast transitions. Both phonetic similarity and fast transitions influenced accurate speech perception for RD children as well as CA controls. PMID:20652455

  2. Average Transient Lifetime and Lyapunov Dimension for Transient Chaos in a High-Dimensional System

    NASA Astrophysics Data System (ADS)

    Chen, Hong; Tang, Jian-Xin; Tang, Shao-Yan; Xiang, Hong; Chen, Xin

    2001-11-01

    The average transient lifetime of a chaotic transient versus the Lyapunov dimension of a chaotic saddle is studied for high-dimensional nonlinear dynamical systems. Typically the average lifetime depends upon not only the system parameter but also the Lyapunov dimension of the chaotic saddle. The numerical example uses the delayed feedback differential equation.

  3. Controlling chaos in a high dimensional system with periodic parametric perturbations

    SciTech Connect

    Mirus, K.A.; Sprott, J.C.

    1998-10-01

    The effect of applying a periodic perturbation to an accessible parameter of a high-dimensional (coupled-Lorenz) chaotic system is examined. Numerical results indicate that perturbation frequencies near the natural frequencies of the unstable periodic orbits of the chaotic system can result in limit cycles or significantly reduced dimension for relatively small perturbations.

  4. Salient Region Detection via High-Dimensional Color Transform and Local Spatial Support.

    PubMed

    Kim, Jiwhan; Han, Dongyoon; Tai, Yu-Wing; Kim, Junmo

    2016-01-01

    In this paper, we introduce a novel approach to automatically detect salient regions in an image. Our approach consists of global and local features, which complement each other to compute a saliency map. The first key idea of our work is to create a saliency map of an image by using a linear combination of colors in a high-dimensional color space. This is based on an observation that salient regions often have distinctive colors compared with backgrounds in human perception, however, human perception is complicated and highly nonlinear. By mapping the low-dimensional red, green, and blue color to a feature vector in a high-dimensional color space, we show that we can composite an accurate saliency map by finding the optimal linear combination of color coefficients in the high-dimensional color space. To further improve the performance of our saliency estimation, our second key idea is to utilize relative location and color contrast between superpixels as features and to resolve the saliency estimation from a trimap via a learning-based algorithm. The additional local features and learning-based algorithm complement the global estimation from the high-dimensional color transform-based algorithm. The experimental results on three benchmark datasets show that our approach is effective in comparison with the previous state-of-the-art saliency estimation methods. PMID:26529764

  5. High-Dimensional Exploratory Item Factor Analysis by a Metropolis-Hastings Robbins-Monro Algorithm

    ERIC Educational Resources Information Center

    Cai, Li

    2010-01-01

    A Metropolis-Hastings Robbins-Monro (MH-RM) algorithm for high-dimensional maximum marginal likelihood exploratory item factor analysis is proposed. The sequence of estimates from the MH-RM algorithm converges with probability one to the maximum likelihood solution. Details on the computer implementation of this algorithm are provided. The…

  6. Coupled cluster investigation on the thermochemistry of dimethyl sulphide, dimethyl disulphide and their dissociation products: the problem of the enthalpy of formation of atomic sulphur

    NASA Astrophysics Data System (ADS)

    Denis, Pablo A.

    2014-04-01

    By means of coupled cluster theory and correlation consistent basis sets we investigated the thermochemistry of dimethyl sulphide (DMS), dimethyl disulphide (DMDS) and four closely related sulphur-containing molecules: CH3SS, CH3S, CH3SH and CH3CH2SH. For the four closed-shell molecules studied, their enthalpies of formation (EOFs) were derived using bomb calorimetry. We found that the deviation of the EOF with respect to experiment was 0.96, 0.65, 1.24 and 1.29 kcal/mol, for CH3SH, CH3CH2SH, DMS and DMDS, respectively, when ΔHf,0 = 65.6 kcal/mol was utilised (JANAF value). However, if the recently proposed ΔHf,0 = 66.2 kcal/mol was used to estimate EOF, the errors dropped to 0.36, 0.05, 0.64 and 0.09 kcal/mol, respectively. In contrast, for the CH3SS radical, a better agreement with experiment was obtained if the 65.6 kcal/mol value was used. To compare with experiment avoiding the problem of the ΔHf,0 (S), we determined the CH3-S and CH3-SS bond dissociation energies (BDEs) in CH3S and CH3SS. At the coupled cluster with singles doubles and perturbative triples correction level of theory, these values are 48.0 and 71.4 kcal/mol, respectively. The latter BDEs are 1.5 and 1.2 kcal/mol larger than the experimental values. The agreement can be considered to be acceptable if we take into consideration that these two radicals present important challenges when determining their EOFs. It is our hope that this work stimulates new studies which help elucidate the problem of the EOF of atomic sulphur.

  7. Variable Selection for Sparse High-Dimensional Nonlinear Regression Models by Combining Nonnegative Garrote and Sure Independence Screening

    PubMed Central

    Xue, Hongqi; Wu, Yichao; Wu, Hulin

    2013-01-01

    In many regression problems, the relations between the covariates and the response may be nonlinear. Motivated by the application of reconstructing a gene regulatory network, we consider a sparse high-dimensional additive model with the additive components being some known nonlinear functions with unknown parameters. To identify the subset of important covariates, we propose a new method for simultaneous variable selection and parameter estimation by iteratively combining a large-scale variable screening (the nonlinear independence screening, NLIS) and a moderate-scale model selection (the nonnegative garrote, NNG) for the nonlinear additive regressions. We have shown that the NLIS procedure possesses the sure screening property and it is able to handle problems with non-polynomial dimensionality; and for finite dimension problems, the NNG for the nonlinear additive regressions has selection consistency for the unimportant covariates and also estimation consistency for the parameter estimates of the important covariates. The proposed method is applied to simulated data and a real data example for identifying gene regulations to illustrate its numerical performance. PMID:25170239

  8. The effectiveness of the Screening Inventory of Psychosocial Problems (SIPP) in cancer patients treated with radiotherapy: design of a cluster randomised controlled trial

    PubMed Central

    2009-01-01

    Background The Screening Inventory of Psychosocial Problems (SIPP) is a short, validated self-reported questionnaire to identify psychosocial problems in Dutch cancer patients. The one-page 24-item questionnaire assesses physical complaints, psychological complaints and social and sexual problems. Very little is known about the effects of using the SIPP in consultation settings. Our study aims are to test the hypotheses that using the SIPP (a) may contribute to adequate referral to relevant psychosocial caregivers, (b) should facilitate communication between radiotherapists and cancer patients about psychosocial distress and (c) may prevent underdiagnosis of early symptoms reflecting psychosocial problems. This paper presents the design of a cluster randomised controlled trial (CRCT) evaluating the effectiveness of using the SIPP in cancer patients treated with radiotherapy. Methods/Design A CRCT is developed using a Solomon four-group design (two intervention and two control groups) to evaluate the effects of using the SIPP. Radiotherapists, instead of cancer patients, are randomly allocated to the experimental or control groups. Within these groups, all included cancer patients are randomised into two subgroups: with and without pre-measurement. Self-reported assessments are conducted at four times: a pre-test at baseline before the first consultation and a post-test directly following the first consultation, and three and 12 months after baseline measurement. The primary outcome measures are the number and types of referrals of cancer patients with psychosocial problems to relevant (psychosocial) caregivers. The secondary outcome measures are patients' satisfaction with the radiotherapist-patient communication, psychosocial distress and quality of life. Furthermore, a process evaluation will be carried out. Data of the effect-evaluation will be analysed according to the intention-to-treat principle and data regarding the types of referrals to health care

  9. Survey on granularity clustering.

    PubMed

    Ding, Shifei; Du, Mingjing; Zhu, Hong

    2015-12-01

    With the rapid development of uncertain artificial intelligent and the arrival of big data era, conventional clustering analysis and granular computing fail to satisfy the requirements of intelligent information processing in this new case. There is the essential relationship between granular computing and clustering analysis, so some researchers try to combine granular computing with clustering analysis. In the idea of granularity, the researchers expand the researches in clustering analysis and look for the best clustering results with the help of the basic theories and methods of granular computing. Granularity clustering method which is proposed and studied has attracted more and more attention. This paper firstly summarizes the background of granularity clustering and the intrinsic connection between granular computing and clustering analysis, and then mainly reviews the research status and various methods of granularity clustering. Finally, we analyze existing problem and propose further research. PMID:26557926

  10. Simple, Scalable Proteomic Imaging for High-Dimensional Profiling of Intact Systems.

    PubMed

    Murray, Evan; Cho, Jae Hun; Goodwin, Daniel; Ku, Taeyun; Swaney, Justin; Kim, Sung-Yon; Choi, Heejin; Park, Young-Gyun; Park, Jeong-Yoon; Hubbert, Austin; McCue, Margaret; Vassallo, Sara; Bakh, Naveed; Frosch, Matthew P; Wedeen, Van J; Seung, H Sebastian; Chung, Kwanghun

    2015-12-01

    Combined measurement of diverse molecular and anatomical traits that span multiple levels remains a major challenge in biology. Here, we introduce a simple method that enables proteomic imaging for scalable, integrated, high-dimensional phenotyping of both animal tissues and human clinical samples. This method, termed SWITCH, uniformly secures tissue architecture, native biomolecules, and antigenicity across an entire system by synchronizing the tissue preservation reaction. The heat- and chemical-resistant nature of the resulting framework permits multiple rounds (>20) of relabeling. We have performed 22 rounds of labeling of a single tissue with precise co-registration of multiple datasets. Furthermore, SWITCH synchronizes labeling reactions to improve probe penetration depth and uniformity of staining. With SWITCH, we performed combinatorial protein expression profiling of the human cortex and also interrogated the geometric structure of the fiber pathways in mouse brains. Such integrated high-dimensional information may accelerate our understanding of biological systems at multiple levels. PMID:26638076