Clustering high dimensional data using RIA
Aziz, Nazrina
2015-05-15
Clustering may simply represent a convenient method for organizing a large data set so that it can easily be understood and information can efficiently be retrieved. However, identifying cluster in high dimensionality data sets is a difficult task because of the curse of dimensionality. Another challenge in clustering is some traditional functions cannot capture the pattern dissimilarity among objects. In this article, we used an alternative dissimilarity measurement called Robust Influence Angle (RIA) in the partitioning method. RIA is developed using eigenstructure of the covariance matrix and robust principal component score. We notice that, it can obtain cluster easily and hence avoid the curse of dimensionality. It is also manage to cluster large data sets with mixed numeric and categorical value.
NASA Technical Reports Server (NTRS)
Srivastava, Ashok, N.; Akella, Ram; Diev, Vesselin; Kumaresan, Sakthi Preethi; McIntosh, Dawn M.; Pontikakis, Emmanuel D.; Xu, Zuobing; Zhang, Yi
2006-01-01
This paper describes the results of a significant research and development effort conducted at NASA Ames Research Center to develop new text mining techniques to discover anomalies in free-text reports regarding system health and safety of two aerospace systems. We discuss two problems of significant importance in the aviation industry. The first problem is that of automatic anomaly discovery about an aerospace system through the analysis of tens of thousands of free-text problem reports that are written about the system. The second problem that we address is that of automatic discovery of recurring anomalies, i.e., anomalies that may be described m different ways by different authors, at varying times and under varying conditions, but that are truly about the same part of the system. The intent of recurring anomaly identification is to determine project or system weakness or high-risk issues. The discovery of recurring anomalies is a key goal in building safe, reliable, and cost-effective aerospace systems. We address the anomaly discovery problem on thousands of free-text reports using two strategies: (1) as an unsupervised learning problem where an algorithm takes free-text reports as input and automatically groups them into different bins, where each bin corresponds to a different unknown anomaly category; and (2) as a supervised learning problem where the algorithm classifies the free-text reports into one of a number of known anomaly categories. We then discuss the application of these methods to the problem of discovering recurring anomalies. In fact the special nature of recurring anomalies (very small cluster sizes) requires incorporating new methods and measures to enhance the original approach for anomaly detection. ?& pant 0-
Adaptive dimension reduction for clustering high dimensional data
Ding, Chris; He, Xiaofeng; Zha, Hongyuan; Simon, Horst
2002-10-01
It is well-known that for high dimensional data clustering, standard algorithms such as EM and the K-means are often trapped in local minimum. many initialization methods were proposed to tackle this problem, but with only limited success. In this paper they propose a new approach to resolve this problem by repeated dimension reductions such that K-means or EM are performed only in very low dimensions. Cluster membership is utilized as a bridge between the reduced dimensional sub-space and the original space, providing flexibility and ease of implementation. Clustering analysis performed on highly overlapped Gaussians, DNA gene expression profiles and internet newsgroups demonstrate the effectiveness of the proposed algorithm.
Pairwise Variable Selection for High-dimensional Model-based Clustering
Guo, Jian; Levina, Elizaveta; Michailidis, George
2009-01-01
SUMMARY Variable selection for clustering is an important and challenging problem in high-dimensional data analysis. Existing variable selection methods for model-based clustering select informative variables in a “one-in-all-out” manner; that is, a variable is selected if at least one pair of clusters is separable by this variable and removed if it cannot separate any of the clusters. In many applications, however, it is of interest to further establish exactly which clusters are separable by each informative variable. To address this question, we propose a pairwise variable selection method for high-dimensional model-based clustering. The method is based on a new pairwise penalty. Results on simulated and real data show that the new method performs better than alternative approaches which use ℓ1 and ℓ∞ penalties and offers better interpretation. PMID:19912170
Semi-supervised high-dimensional clustering by tight wavelet frames
NASA Astrophysics Data System (ADS)
Dong, Bin; Hao, Ning
2015-08-01
High-dimensional clustering arises frequently from many areas in natural sciences, technical disciplines and social medias. In this paper, we consider the problem of binary clustering of high-dimensional data, i.e. classification of a data set into 2 classes. We assume that the correct (or mostly correct) classification of a small portion of the given data is known. Based on such partial classification, we design optimization models that complete the clustering of the entire data set using the recently introduced tight wavelet frames on graphs.1 Numerical experiments of the proposed models applied to some real data sets are conducted. In particular, the performance of the models on some very high-dimensional data sets are examined; and combinations of the models with some existing dimension reduction techniques are also considered.
Visualization of high-dimensional clusters using nonlinear magnification
Keahey, T.A.
1998-12-31
This paper describes a cluster visualization system used for data-mining fraud detection. The system can simultaneously show 6 dimensions of data, and a unique technique of 3D nonlinear magnification allows individual clusters of data points to be magnified while still maintaining a view of the global context. The author first describes the fraud detection problem, along with the data which is to be visualized. Then he describes general characteristics of the visualization system, and shows how nonlinear magnification can be used in this system. Finally he concludes and describes options for further work.
Srinivasan, Thenmozhi; Palanisamy, Balasubramanie
2015-01-01
Clusters of high-dimensional data techniques are emerging, according to data noisy and poor quality challenges. This paper has been developed to cluster data using high-dimensional similarity based PCM (SPCM), with ant colony optimization intelligence which is effective in clustering nonspatial data without getting knowledge about cluster number from the user. The PCM becomes similarity based by using mountain method with it. Though this is efficient clustering, it is checked for optimization using ant colony algorithm with swarm intelligence. Thus the scalable clustering technique is obtained and the evaluation results are checked with synthetic datasets. PMID:26495413
NASA Astrophysics Data System (ADS)
Manukyan, N.; Eppstein, M. J.; Rizzo, D. M.
2011-12-01
data to demonstrate how the proposed methods facilitate automatic identification and visualization of clusters in real-world, high-dimensional biogeochemical data with complex relationships. The proposed methods are quite general and are applicable to a wide range of geophysical problems. [1] Pearce, A., Rizzo, D., and Mouser, P., "Subsurface characterization of groundwater contaminated by landfill leachate using microbial community profile data and a nonparametric decision-making process", Water Resources Research, 47:W06511, 11 pp, 2011. [2] Mouser, P., Rizzo, D., Druschel, G., Morales, S, O'Grady, P., Hayden, N., Stevens, L., "Enhanced detection of groundwater contamination from a leaking waste disposal site by microbial community profiles", Water Resources Research, 46:W12506, 12 pp., 2010.
Banerjee, Arindam; Ghosh, Joydeep
2004-05-01
Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (spkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, spkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the spkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques. Index Terms-Balanced clustering, expectation maximization (EM), frequency-sensitive competitive learning (FSCL), high-dimensional clustering, kmeans, normalized data, scalable clustering, streaming data, text clustering.
Visualization of high-dimensional clusters using nonlinear magnification
NASA Astrophysics Data System (ADS)
Keahey, T. A.
1999-03-01
This paper describes a visualization system which has been used as part of a data-mining effort to detect fraud and abuse within state medicare programs. The data-mining process generates a set of N attributes for each medicare provider and beneficiary in the state; these attributes can be numeric, categorical, or derived from the scoring proces of the data- mining routines. The attribute list can be considered as an N- dimensional space, which is subsequently partitioned into some fixed number of cluster partitions. The sparse nature of the clustered space provides room for the simultaneous visualization of more than 3 dimensions; examples in the paper will show 6-dimensional visualization. This ability to view higher dimensional data allows the data-mining researcher to compare the clustering effectiveness of the different attributes. Transparency based rendering is also used in conjunction with filtering techniques to provide selective rendering of only those data which are of greatest interest. Nonlinear magnification techniques are used to stretch the N- dimensional space to allow focus on one or more regions of interest while still allowing a view of the global context. The magnification can either be applied globally, or in a constrained fashion to expand individual clusters within the space.
Integrative clustering methods for high-dimensional molecular data
Chalise, Prabhakar; Koestler, Devin C.; Bimali, Milan; Yu, Qing; Fridley, Brooke L.
2014-01-01
High-throughput ‘omic’ data, such as gene expression, DNA methylation, DNA copy number, has played an instrumental role in furthering our understanding of the molecular basis in states of human health and disease. As cells with similar morphological characteristics can exhibit entirely different molecular profiles and because of the potential that these discrepancies might further our understanding of patient-level variability in clinical outcomes, there is significant interest in the use of high-throughput ‘omic’ data for the identification of novel molecular subtypes of a disease. While numerous clustering methods have been proposed for identifying of molecular subtypes, most were developed for single “omic’ data types and may not be appropriate when more than one ‘omic’ data type are collected on study subjects. Given that complex diseases, such as cancer, arise as a result of genomic, epigenomic, transcriptomic, and proteomic alterations, integrative clustering methods for the simultaneous clustering of multiple ‘omic’ data types have great potential to aid in molecular subtype discovery. Traditionally, ad hoc manual data integration has been performed using the results obtained from the clustering of individual ‘omic’ data types on the same set of patient samples. However, such methods often result in inconsistent assignment of subjects to the molecular cancer subtypes. Recently, several methods have been proposed in the literature that offers a rigorous framework for the simultaneous integration of multiple ‘omic’ data types in a single comprehensive analysis. In this paper, we present a systematic review of existing integrative clustering methods. PMID:25243110
High dimensional data clustering by partitioning the hypergraphs using dense subgraph partition
NASA Astrophysics Data System (ADS)
Sun, Xili; Tian, Shoucai; Lu, Yonggang
2015-12-01
Due to the curse of dimensionality, traditional clustering methods usually fail to produce meaningful results for the high dimensional data. Hypergraph partition is believed to be a promising method for dealing with this challenge. In this paper, we first construct a graph G from the data by defining an adjacency relationship between the data points using Shared Reverse k Nearest Neighbors (SRNN). Then a hypergraph is created from the graph G by defining the hyperedges to be all the maximal cliques in the graph G. After the hypergraph is produced, a powerful hypergraph partitioning method called dense subgraph partition (DSP) combined with the k-medoids method is used to produce the final clustering results. The proposed method is evaluated on several real high-dimensional datasets, and the experimental results show that the proposed method can improve the clustering results of the high dimensional data compared with applying k-medoids method directly on the original data.
Variational Bayesian strategies for high-dimensional, stochastic design problems
NASA Astrophysics Data System (ADS)
Koutsourelakis, P. S.
2016-03-01
This paper is concerned with a lesser-studied problem in the context of model-based, uncertainty quantification (UQ), that of optimization/design/control under uncertainty. The solution of such problems is hindered not only by the usual difficulties encountered in UQ tasks (e.g. the high computational cost of each forward simulation, the large number of random variables) but also by the need to solve a nonlinear optimization problem involving large numbers of design variables and potentially constraints. We propose a framework that is suitable for a class of such problems and is based on the idea of recasting them as probabilistic inference tasks. To that end, we propose a Variational Bayesian (VB) formulation and an iterative VB-Expectation-Maximization scheme that is capable of identifying a local maximum as well as a low-dimensional set of directions in the design space, along which, the objective exhibits the largest sensitivity. We demonstrate the validity of the proposed approach in the context of two numerical examples involving thousands of random and design variables. In all cases considered the cost of the computations in terms of calls to the forward model was of the order of 100 or less. The accuracy of the approximations provided is assessed by information-theoretic metrics.
Nuclear Potential Clustering As a New Tool to Detect Patterns in High Dimensional Datasets
NASA Astrophysics Data System (ADS)
Tonkova, V.; Paulus, D.; Neeb, H.
2013-02-01
We present a new approach for the clustering of high dimensional data without prior assumptions about the structure of the underlying distribution. The proposed algorithm is based on a concept adapted from nuclear physics. To partition the data, we model the dynamic behaviour of nucleons interacting in an N-dimensional space. An adaptive nuclear potential, comprised of a short-range attractive (strong interaction) and a long-range repulsive term (Coulomb force) is assigned to each data point. By modelling the dynamics, nucleons that are densely distributed in space fuse to build nuclei (clusters) whereas single point clusters repel each other. The formation of clusters is completed when the system reaches the state of minimal potential energy. The data are then grouped according to the particles' final effective potential energy level. The performance of the algorithm is tested with several synthetic datasets showing that the proposed method can robustly identify clusters even when complex configurations are present. Furthermore, quantitative MRI data from 43 multiple sclerosis patients were analyzed, showing a reasonable splitting into subgroups according to the individual patients' disease grade. The good performance of the algorithm on such highly correlated non-spherical datasets, which are typical for MRI derived image features, shows that Nuclear Potential Clustering is a valuable tool for automated data analysis, not only in the MRI domain.
NASA Astrophysics Data System (ADS)
Li, Enying; Wang, Hu; Li, Guangyao
2012-09-01
High-dimensional model representation (HDMR) is a general set of metamodel assessment and analysis tools to improve the efficiency of high dimensional underlying system behavior. Compared with the current popular modeling methods, such as Kriging (KG), radial basis function (RBF), and the moving least square approximation method (MLS), the distinctive characteristic of the HDMR is to decouple the input variables. Therefore, a high dimensional problem can be transformed as a low, middle or combination of middle dimensional function. Although the HDMR is a feasible method for high dimensional problems, the computational cost is still a bottleneck for complex engineering problems. To improve the efficiency of the HDMR method further, the purpose of this study is to use an intelligent sampling method for the HDMR. Because the HDMR cannot be integrated with the sampling method directly, a projection-based intelligent method is suggested. Compared with the popular HDMR methods, the construction procedure for the HDMR-based model is optimized. To validate the performance of the suggested method, multiple mathematical test functions are given to illustrate the modeling principles, procedures, and the efficiency and accuracy of HDMR models with problems of a wide scope of dimensionalities.
Clustering High-Dimensional Landmark-based Two-dimensional Shape Data‡
Huang, Chao; Styner, Martin; Zhu, Hongtu
2015-01-01
An important goal in image analysis is to cluster and recognize objects of interest according to the shapes of their boundaries. Clustering such objects faces at least four major challenges including a curved shape space, a high-dimensional feature space, a complex spatial correlation structure, and shape variation associated with some covariates (e.g., age or gender). The aim of this paper is to develop a penalized model-based clustering framework to cluster landmark-based planar shape data, while explicitly addressing these challenges. Specifically, a mixture of offset-normal shape factor analyzers (MOSFA) is proposed with mixing proportions defined through a regression model (e.g., logistic) and an offset-normal shape distribution in each component for data in the curved shape space. A latent factor analysis model is introduced to explicitly model the complex spatial correlation. A penalized likelihood approach with both adaptive pairwise fusion Lasso penalty function and L2 penalty function is used to automatically realize variable selection via thresholding and deliver a sparse solution. Our real data analysis has confirmed the excellent finite-sample performance of MOSFA in revealing meaningful clusters in the corpus callosum shape data obtained from the Attention Deficit Hyperactivity Disorder-200 (ADHD-200) study. PMID:26604425
CHARACTERIZATION OF DISCONTINUITIES IN HIGH-DIMENSIONAL STOCHASTIC PROBLEMS ON ADAPTIVE SPARSE GRIDS
Jakeman, John D; Archibald, Richard K; Xiu, Dongbin
2011-01-01
In this paper we present a set of efficient algorithms for detection and identification of discontinuities in high dimensional space. The method is based on extension of polynomial annihilation for edge detection in low dimensions. Compared to the earlier work, the present method poses significant improvements for high dimensional problems. The core of the algorithms relies on adaptive refinement of sparse grids. It is demonstrated that in the commonly encountered cases where a discontinuity resides on a small subset of the dimensions, the present method becomes optimal , in the sense that the total number of points required for function evaluations depends linearly on the dimensionality of the space. The details of the algorithms will be presented and various numerical examples are utilized to demonstrate the efficacy of the method.
Cardiac motion estimation by using high-dimensional features and K-means clustering method
NASA Astrophysics Data System (ADS)
Oubel, Estanislao; Hero, Alfred O.; Frangi, Alejandro F.
2006-03-01
Tagged Magnetic Resonance Imaging (MRI) is currently the reference modality for myocardial motion and strain analysis. Mutual Information (MI) based non rigid registration has proven to be an accurate method to retrieve cardiac motion and overcome many drawbacks present on previous approaches. In a previous work1, we used Wavelet-based Attribute Vectors (WAVs) instead of pixel intensity to measure similarity between frames. Since the curse of dimensionality forbids the use of histograms to estimate MI of high dimensional features, k-Nearest Neighbors Graphs (kNNG) were applied to calculate α-MI. Results showed that cardiac motion estimation was feasible with that approach. In this paper, K-Means clustering method is applied to compute MI from the same set of WAVs. The proposed method was applied to four tagging MRI sequences, and the resulting displacements were compared with respect to manual measurements made by two observers. Results show that more accurate motion estimation is obtained with respect to the use of pixel intensity.
NASA Astrophysics Data System (ADS)
Mohammad khaninezhad, M.; Jafarpour, B.
2012-12-01
Data limitation and heterogeneity of the geologic formations introduce significant uncertainty in predicting the related flow and transport processes in these environments. Fluid flow and displacement behavior in subsurface systems is mainly controlled by the structural connectivity models that create preferential flow pathways (or barriers). The connectivity of extreme geologic features strongly constrains the evolution of the related flow and transport processes in subsurface formations. Therefore, characterization of the geologic continuity and facies connectivity is critical for reliable prediction of the flow and transport behavior. The goal of this study is to develop a robust and geologically consistent framework for solving large-scale nonlinear subsurface characterization inverse problems under uncertainty about geologic continuity and structural connectivity. We formulate a novel inverse modeling approach by adopting a sparse reconstruction perspective, which involves two major components: 1) sparse description of hydraulic property distribution under significant uncertainty in structural connectivity and 2) formulation of an effective sparsity-promoting inversion method that is robust against prior model uncertainty. To account for the significant variability in the structural connectivity, we use, as prior, multiple distinct connectivity models. For sparse/compact representation of high-dimensional hydraulic property maps, we investigate two methods. In one approach, we apply the principle component analysis (PCA) to each prior connectivity model individually and combine the resulting leading components from each model to form a diverse geologic dictionary. Alternatively, we combine many realizations of the hydraulic properties from different prior connectivity models and use them to generate a diverse training dataset. We use the training dataset with a sparsifying transform, such as K-SVD, to construct a sparse geologic dictionary that is robust to
'SGoFicance Trace': assessing significance in high dimensional testing problems.
de Uña-Alvarez, Jacobo; Carvajal-Rodriguez, Antonio
2010-01-01
Recently, an exact binomial test called SGoF (Sequential Goodness-of-Fit) has been introduced as a new method for handling high dimensional testing problems. SGoF looks for statistical significance when comparing the amount of null hypotheses individually rejected at level γ = 0.05 with the expected amount under the intersection null, and then proceeds to declare a number of effects accordingly. SGoF detects an increasing proportion of true effects with the number of tests, unlike other methods for which the opposite is true. It is worth mentioning that the choice γ = 0.05 is not essential to the SGoF procedure, and more power may be reached at other values of γ depending on the situation. In this paper we enhance the possibilities of SGoF by letting the γ vary on the whole interval (0,1). In this way, we introduce the 'SGoFicance Trace' (from SGoF's significance trace), a graphical complement to SGoF which can help to make decisions in multiple-testing problems. A script has been written for the computation in R of the SGoFicance Trace. This script is available from the web site http://webs.uvigo.es/acraaj/SGoFicance.htm.
‘SGoFicance Trace’: Assessing Significance in High Dimensional Testing Problems
de Uña-Alvarez, Jacobo; Carvajal-Rodriguez, Antonio
2010-01-01
Recently, an exact binomial test called SGoF (Sequential Goodness-of-Fit) has been introduced as a new method for handling high dimensional testing problems. SGoF looks for statistical significance when comparing the amount of null hypotheses individually rejected at level γ = 0.05 with the expected amount under the intersection null, and then proceeds to declare a number of effects accordingly. SGoF detects an increasing proportion of true effects with the number of tests, unlike other methods for which the opposite is true. It is worth mentioning that the choice γ = 0.05 is not essential to the SGoF procedure, and more power may be reached at other values of γ depending on the situation. In this paper we enhance the possibilities of SGoF by letting the γ vary on the whole interval (0,1). In this way, we introduce the ‘SGoFicance Trace’ (from SGoF's significance trace), a graphical complement to SGoF which can help to make decisions in multiple-testing problems. A script has been written for the computation in R of the SGoFicance Trace. This script is available from the web site http://webs.uvigo.es/acraaj/SGoFicance.htm. PMID:21209966
NASA Astrophysics Data System (ADS)
Choo, Jaegul; Lee, Hanseung; Liu, Zhicheng; Stasko, John; Park, Haesun
2013-01-01
Many of the modern data sets such as text and image data can be represented in high-dimensional vector spaces and have benefited from computational methods that utilize advanced computational methods. Visual analytics approaches have contributed greatly to data understanding and analysis due to their capability of leveraging humans' ability for quick visual perception. However, visual analytics targeting large-scale data such as text and image data has been challenging due to the limited screen space in terms of both the numbers of data points and features to represent. Among various computational methods supporting visual analytics, dimension reduction and clustering have played essential roles by reducing these numbers in an intelligent way to visually manageable sizes. Given numerous dimension reduction and clustering methods available, however, the decision on the choice of algorithms and their parameters becomes difficult. In this paper, we present an interactive visual testbed system for dimension reduction and clustering in a large-scale high-dimensional data analysis. The testbed system enables users to apply various dimension reduction and clustering methods with different settings, visually compare the results from different algorithmic methods to obtain rich knowledge for the data and tasks at hand, and eventually choose the most appropriate path for a collection of algorithms and parameters. Using various data sets such as documents, images, and others that are already encoded in vectors, we demonstrate how the testbed system can support these tasks.
NASA Astrophysics Data System (ADS)
Nakano, S.; Higuchi, T.
2012-04-01
The particle filter (PF) is one of ensemble-based algorithms for data assimilation. The PF obtains an approximation of a posterior PDF of a state by resampling with replacement from a prior ensemble. The procedure of the PF does not assume linearity or Gaussianity. Thus, it can be applied to general nonlinear problems. However, in order to obtain appropriate results for high-dimensional problems, the PF requires an enormous number of ensemble members. Since the PF must calculate the time integral for each particle at each time step, the large ensemble size results in prohibitive computational cost. There exists various methods for reducing the number of particle. In contrast, we employ a straightforward approach to overcome this problem; that is, we use a massively parallel computer to achieve sufficiently large ensemble size. Since the time integral in the PF can be readily be parallelized, we can notably improve the computational efficiency using a parallel computer. However, if we naively implement the PF on a distributed computing system, we encounter another difficulty; that is, many data transfers occur randomly between different nodes of the distributed computing system. Such data transfers can be reduced by dividing the ensemble into small subsets (groups). If we limit the resampling within each of the subsets, the data transfers can be done efficiently in parallel. If the ensemble are divided into small subsets, the risk of local sample impoverishment within each of the subsets is enhanced. However, if we change the grouping at each time step, the information held by a node can be propagated to all of the nodes after a finite number of time steps and the local sample impoverishment can be avoided. In the present study, we compare between the above method based on the local resampling of each group and the naive implementation of the PF based on the global resampling of the whole ensemble. The global resampling enables us to achive a slightly better
Semi-Supervised Clustering for High-Dimensional and Sparse Features
ERIC Educational Resources Information Center
Yan, Su
2010-01-01
Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some "weak" form of side…
Cluster expression in fission and fusion in high-dimensional macroscopic-microscopic calculations
Iwamoto, A.; Ichikawa, T.; Moller, P.; Sierk, A. J.
2004-01-01
We discuss the relation between the fission-fusion potential-energy surfaces of very heavy nuclei and the formation process of these nuclei in cold-fusion reactions. In the potential-energy surfaces, we find a pronounced valley structure, with one valley corresponding to the cold-fusion reaction, the other to fission. As the touching point is approached in the cold-fusion entrance channel, an instability towards dynamical deformation of the projectile occurs, which enhances the fusion cross section. These two 'cluster effects' enhance the production of superheavy nuclei in cold-fusion reactions, in addition to the effect of the low compound-system excitation energy in these reactions. Heavy-ion fusion reactions have been used extensively to synthesize heavy elements beyond actinide nuclei. In order to proceed further in this direction, we need to understand the formation process more precisely, not just the decay process. The dynamics of the formation process are considerably more complex than the dynamics necessary to interpret the spontaneous-fission decay of heavy elements. However, before implementing a full dynamical description it is useful to understand the basic properties of the potential-energy landscape encountered in the initial stages of the collision. The collision process and entrance-channel landscape can conveniently be separated into two parts, namely the early-stage separated system before touching and the late-stage composite system after touching. The transition between these two stages is particularly important, but not very well understood until now. To understand better the transition between the two stages we analyze here in detail the potential energy landscape or 'collision surface' of the system both outside and inside the touching configuration of the target and projectile. In Sec. 2, we discuss calculated five-dimensional potential-energy landscapes inside touching and identify major features. In Sec. 3, we present calculated
Naim, Iftekhar; Datta, Suprakash; Rebhahn, Jonathan; Cavenaugh, James S; Mosmann, Tim R; Sharma, Gaurav
2014-05-01
We present a model-based clustering method, SWIFT (Scalable Weighted Iterative Flow-clustering Technique), for digesting high-dimensional large-sized datasets obtained via modern flow cytometry into more compact representations that are well-suited for further automated or manual analysis. Key attributes of the method include the following: (a) the analysis is conducted in the multidimensional space retaining the semantics of the data, (b) an iterative weighted sampling procedure is utilized to maintain modest computational complexity and to retain discrimination of extremely small subpopulations (hundreds of cells from datasets containing tens of millions), and (c) a splitting and merging procedure is incorporated in the algorithm to preserve distinguishability between biologically distinct populations, while still providing a significant compaction relative to the original data. This article presents a detailed algorithmic description of SWIFT, outlining the application-driven motivations for the different design choices, a discussion of computational complexity of the different steps, and results obtained with SWIFT for synthetic data and relatively simple experimental data that allow validation of the desirable attributes. A companion paper (Part 2) highlights the use of SWIFT, in combination with additional computational tools, for more challenging biological problems.
Recent Results from Application of the Implicit Particle Filter to High-dimensional Problems
NASA Astrophysics Data System (ADS)
Miller, R.; Weir, B.; Spitz, Y. H.
2012-12-01
We present our most recent results on the application of the implicit particle filter to a stochastic shallow water model of nearshore circulation. This highly nonlinear model has approximately 30,000 state variables, and, in our twin experiments, we assimilate 32 observed quantities. Application of most particle methods to problems of this size are subject to sample impoverishment. In our implementation of the implicit particle filter, we have found that manageable size ensembles can still retain a sufficient number of independent particles for reasonable accuracy.
A numerical algorithm for optimal feedback gains in high dimensional LQR problems
NASA Technical Reports Server (NTRS)
Banks, H. T.; Ito, K.
1986-01-01
A hybrid method for computing the feedback gains in linear quadratic regulator problems is proposed. The method, which combines the use of a Chandrasekhar type system with an iteration of the Newton-Kleinman form with variable acceleration parameter Smith schemes, is formulated so as to efficiently compute directly the feedback gains rather than solutions of an associated Riccati equation. The hybrid method is particularly appropriate when used with large dimensional systems such as those arising in approximating infinite dimensional (distributed parameter) control systems (e.g., those governed by delay-differential and partial differential equations). Computational advantage of the proposed algorithm over the standard eigenvector (Potter, Laub-Schur) based techniques are discussed and numerical evidence of the efficacy of our ideas presented.
A facility for using cluster research to study environmental problems
Not Available
1991-11-01
This report begins by describing the general application of cluster based research to environmental chemistry and the development of a Cluster Structure and Dynamics Research Facility (CSDRF). Next, four important areas of cluster research are described in more detail, including how they can impact environmental problems. These are: surface-supported clusters, water and contaminant interactions, time-resolved dynamic studies in clusters, and cluster structures and reactions. These facilities and equipment required for each area of research are then presented. The appendices contain workshop agenda and a listing of the researchers who participated in the workshop discussions that led to this report.
MacGregor, James N
2015-10-01
Research on human performance in solving traveling salesman problems typically uses point sets as stimuli, and most models have proposed a processing stage at which stimulus dots are clustered. However, few empirical studies have investigated the effects of clustering on performance. In one recent study, researchers compared the effects of clustered, random, and regular stimuli, and concluded that clustering facilitates performance (Dry, Preiss, & Wagemans, 2012). Another study suggested that these results may have been influenced by the location rather than the degree of clustering (MacGregor, 2013). Two experiments are reported that mark an attempt to disentangle these factors. The first experiment tested several combinations of degree of clustering and cluster location, and revealed mixed evidence that clustering influences performance. In a second experiment, both factors were varied independently, showing that they interact. The results are discussed in terms of the importance of clustering effects, in particular, and perceptual factors, in general, during performance of the traveling salesman problem.
Haitian adolescent personality clusters and their problem area correlates.
McMahon, Robert C; Bryant, Vaughn E; Dévieux, Jessy G; Jean-Gilles, Michèle; Rosenberg, Rhonda; Malow, Robert M
2013-04-01
This study identified personality clusters among a community sample of adolescents of Haitian decent and related cluster subgroup membership to problems in the areas of substance abuse, mental and physical health, family and peer relationships, educational and vocational status, social skills, leisure and recreational pursuits, aggressive behavior-delinquency, and to sexual risk activity. Three cluster subgroups were identified: dependent/conforming (N = 68), high pathology (N = 30); and confident/extroverted/conforming (N = 111). Although the overall sample was relatively healthy based on low average endorsement of problems across areas of expressed concern, significant physical health, mental health, relationship, educational, and HIV risk problems were identified in a MACI identified high psychopathology cluster subgroup. A confident/extraverted/conforming cluster subgroup revealed few problems and appears to reflect a protective style.
ICANP2: Isoenergetic cluster algorithm for NP-complete Problems
NASA Astrophysics Data System (ADS)
Zhu, Zheng; Fang, Chao; Katzgraber, Helmut G.
NP-complete optimization problems with Boolean variables are of fundamental importance in computer science, mathematics and physics. Most notably, the minimization of general spin-glass-like Hamiltonians remains a difficult numerical task. There has been a great interest in designing efficient heuristics to solve these computationally difficult problems. Inspired by the rejection-free isoenergetic cluster algorithm developed for Ising spin glasses, we present a generalized cluster update that can be applied to different NP-complete optimization problems with Boolean variables. The cluster updates allow for a wide-spread sampling of phase space, thus speeding up optimization. By carefully tuning the pseudo-temperature (needed to randomize the configurations) of the problem, we show that the method can efficiently tackle problems on topologies with a large site-percolation threshold. We illustrate the ICANP2 heuristic on paradigmatic optimization problems, such as the satisfiability problem and the vertex cover problem.
Solving global optimization problems on GPU cluster
NASA Astrophysics Data System (ADS)
Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya
2016-06-01
The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.
The Heterogeneous P-Median Problem for Categorization Based Clustering
ERIC Educational Resources Information Center
Blanchard, Simon J.; Aloise, Daniel; DeSarbo, Wayne S.
2012-01-01
The p-median offers an alternative to centroid-based clustering algorithms for identifying unobserved categories. However, existing p-median formulations typically require data aggregation into a single proximity matrix, resulting in masked respondent heterogeneity. A proposed three-way formulation of the p-median problem explicitly considers…
Optimization of the K-means algorithm for the solution of high dimensional instances
NASA Astrophysics Data System (ADS)
Pérez, Joaquín; Pazos, Rodolfo; Olivares, Víctor; Hidalgo, Miguel; Ruiz, Jorge; Martínez, Alicia; Almanza, Nelva; González, Moisés
2016-06-01
This paper addresses the problem of clustering instances with a high number of dimensions. In particular, a new heuristic for reducing the complexity of the K-means algorithm is proposed. Traditionally, there are two approaches that deal with the clustering of instances with high dimensionality. The first executes a preprocessing step to remove those attributes of limited importance. The second, called divide and conquer, creates subsets that are clustered separately and later their results are integrated through post-processing. In contrast, this paper proposes a new solution which consists of the reduction of distance calculations from the objects to the centroids at the classification step. This heuristic is derived from the visual observation of the clustering process of K-means, in which it was found that the objects can only migrate to adjacent clusters without crossing distant clusters. Therefore, this heuristic can significantly reduce the number of distance calculations from an object to the centroids of the potential clusters that it may be classified to. To validate the proposed heuristic, it was designed a set of experiments with synthetic and high dimensional instances. One of the most notable results was obtained for an instance of 25,000 objects and 200 dimensions, where its execution time was reduced up to 96.5% and the quality of the solution decreased by only 0.24% when compared to the K-means algorithm.
Manifold learning to interpret JET high-dimensional operational space
NASA Astrophysics Data System (ADS)
Cannas, B.; Fanni, A.; Murari, A.; Pau, A.; Sias, G.; JET EFDA Contributors, the
2013-04-01
In this paper, the problem of visualization and exploration of JET high-dimensional operational space is considered. The data come from plasma discharges selected from JET campaigns from C15 (year 2005) up to C27 (year 2009). The aim is to learn the possible manifold structure embedded in the data and to create some representations of the plasma parameters on low-dimensional maps, which are understandable and which preserve the essential properties owned by the original data. A crucial issue for the design of such mappings is the quality of the dataset. This paper reports the details of the criteria used to properly select suitable signals downloaded from JET databases in order to obtain a dataset of reliable observations. Moreover, a statistical analysis is performed to recognize the presence of outliers. Finally data reduction, based on clustering methods, is performed to select a limited and representative number of samples for the operational space mapping. The high-dimensional operational space of JET is mapped using a widely used manifold learning method, the self-organizing maps. The results are compared with other data visualization methods. The obtained maps can be used to identify characteristic regions of the plasma scenario, allowing to discriminate between regions with high risk of disruption and those with low risk of disruption.
Scalable Nearest Neighbor Algorithms for High Dimensional Data.
Muja, Marius; Lowe, David G
2014-11-01
For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching. PMID:26353063
Scalable Nearest Neighbor Algorithms for High Dimensional Data.
Muja, Marius; Lowe, David G
2014-11-01
For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching.
Distributed Computation of the knn Graph for Large High-Dimensional Point Sets
Plaku, Erion; Kavraki, Lydia E.
2009-01-01
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318
Statistical Physics of High Dimensional Inference
NASA Astrophysics Data System (ADS)
Advani, Madhu; Ganguli, Surya
To model modern large-scale datasets, we need efficient algorithms to infer a set of P unknown model parameters from N noisy measurements. What are fundamental limits on the accuracy of parameter inference, given limited measurements, signal-to-noise ratios, prior information, and computational tractability requirements? How can we combine prior information with measurements to achieve these limits? Classical statistics gives incisive answers to these questions as the measurement density α =N/P --> ∞ . However, modern high-dimensional inference problems, in fields ranging from bio-informatics to economics, occur at finite α. We formulate and analyze high-dimensional inference analytically by applying the replica and cavity methods of statistical physics where data serves as quenched disorder and inferred parameters play the role of thermal degrees of freedom. Our analysis reveals that widely cherished Bayesian inference algorithms such as maximum likelihood and maximum a posteriori are suboptimal in the modern setting, and yields new tractable, optimal algorithms to replace them as well as novel bounds on the achievable accuracy of a large class of high-dimensional inference algorithms. Thanks to Stanford Graduate Fellowship and Mind Brain Computation IGERT grant for support.
Problem decomposition by mutual information and force-based clustering
NASA Astrophysics Data System (ADS)
Otero, Richard Edward
The scale of engineering problems has sharply increased over the last twenty years. Larger coupled systems, increasing complexity, and limited resources create a need for methods that automatically decompose problems into manageable sub-problems by discovering and leveraging problem structure. The ability to learn the coupling (inter-dependence) structure and reorganize the original problem could lead to large reductions in the time to analyze complex problems. Such decomposition methods could also provide engineering insight on the fundamental physics driving problem solution. This work forwards the current state of the art in engineering decomposition through the application of techniques originally developed within computer science and information theory. The work describes the current state of automatic problem decomposition in engineering and utilizes several promising ideas to advance the state of the practice. Mutual information is a novel metric for data dependence and works on both continuous and discrete data. Mutual information can measure both the linear and non-linear dependence between variables without the limitations of linear dependence measured through covariance. Mutual information is also able to handle data that does not have derivative information, unlike other metrics that require it. The value of mutual information to engineering design work is demonstrated on a planetary entry problem. This study utilizes a novel tool developed in this work for planetary entry system synthesis. A graphical method, force-based clustering, is used to discover related sub-graph structure as a function of problem structure and links ranked by their mutual information. This method does not require the stochastic use of neural networks and could be used with any link ranking method currently utilized in the field. Application of this method is demonstrated on a large, coupled low-thrust trajectory problem. Mutual information also serves as the basis for an
Application of clustering global optimization to thin film design problems.
Lemarchand, Fabien
2014-03-10
Refinement techniques usually calculate an optimized local solution, which is strongly dependent on the initial formula used for the thin film design. In the present study, a clustering global optimization method is used which can iteratively change this initial formula, thereby progressing further than in the case of local optimization techniques. A wide panel of local solutions is found using this procedure, resulting in a large range of optical thicknesses. The efficiency of this technique is illustrated by two thin film design problems, in particular an infrared antireflection coating, and a solar-selective absorber coating. PMID:24663856
Application of clustering global optimization to thin film design problems.
Lemarchand, Fabien
2014-03-10
Refinement techniques usually calculate an optimized local solution, which is strongly dependent on the initial formula used for the thin film design. In the present study, a clustering global optimization method is used which can iteratively change this initial formula, thereby progressing further than in the case of local optimization techniques. A wide panel of local solutions is found using this procedure, resulting in a large range of optical thicknesses. The efficiency of this technique is illustrated by two thin film design problems, in particular an infrared antireflection coating, and a solar-selective absorber coating.
Statistical challenges of high-dimensional data
Johnstone, Iain M.; Titterington, D. Michael
2009-01-01
Modern applications of statistical theory and methods can involve extremely large datasets, often with huge numbers of measurements on each of a comparatively small number of experimental units. New methodology and accompanying theory have emerged in response: the goal of this Theme Issue is to illustrate a number of these recent developments. This overview article introduces the difficulties that arise with high-dimensional data in the context of the very familiar linear statistical model: we give a taste of what can nevertheless be achieved when the parameter vector of interest is sparse, that is, contains many zero elements. We describe other ways of identifying low-dimensional subspaces of the data space that contain all useful information. The topic of classification is then reviewed along with the problem of identifying, from within a very large set, the variables that help to classify observations. Brief mention is made of the visualization of high-dimensional data and ways to handle computational problems in Bayesian analysis are described. At appropriate points, reference is made to the other papers in the issue. PMID:19805443
A facility for using cluster research to study environmental problems. Workshop proceedings
Not Available
1991-11-01
This report begins by describing the general application of cluster based research to environmental chemistry and the development of a Cluster Structure and Dynamics Research Facility (CSDRF). Next, four important areas of cluster research are described in more detail, including how they can impact environmental problems. These are: surface-supported clusters, water and contaminant interactions, time-resolved dynamic studies in clusters, and cluster structures and reactions. These facilities and equipment required for each area of research are then presented. The appendices contain workshop agenda and a listing of the researchers who participated in the workshop discussions that led to this report.
Random rotation survival forest for high dimensional censored data.
Zhou, Lifeng; Wang, Hong; Xu, Qingsong
2016-01-01
Recently, rotation forest has been extended to regression and survival analysis problems. However, due to intensive computation incurred by principal component analysis, rotation forest often fails when high-dimensional or big data are confronted. In this study, we extend rotation forest to high dimensional censored time-to-event data analysis by combing random subspace, bagging and rotation forest. Supported by proper statistical analysis, we show that the proposed method random rotation survival forest outperforms state-of-the-art survival ensembles such as random survival forest and popular regularized Cox models. PMID:27625979
An approximation polynomial-time algorithm for a sequence bi-clustering problem
NASA Astrophysics Data System (ADS)
Kel'manov, A. V.; Khamidullin, S. A.
2015-06-01
We consider a strongly NP-hard problem of partitioning a finite sequence of vectors in Euclidean space into two clusters using the criterion of the minimal sum of the squared distances from the elements of the clusters to the centers of the clusters. The center of one of the clusters is to be optimized and is determined as the mean value over all vectors in this cluster. The center of the other cluster is fixed at the origin. Moreover, the partition is such that the difference between the indices of two successive vectors in the first cluster is bounded above and below by prescribed constants. A 2-approximation polynomial-time algorithm is proposed for this problem.
Sparse High Dimensional Models in Economics.
Fan, Jianqing; Lv, Jinchi; Qi, Lei
2011-09-01
This paper reviews the literature on sparse high dimensional models and discusses some applications in economics and finance. Recent developments of theory, methods, and implementations in penalized least squares and penalized likelihood methods are highlighted. These variable selection methods are proved to be effective in high dimensional sparse modeling. The limits of dimensionality that regularization methods can handle, the role of penalty functions, and their statistical properties are detailed. Some recent advances in ultra-high dimensional sparse modeling are also briefly discussed. PMID:22022635
Sparse High Dimensional Models in Economics
Fan, Jianqing; Lv, Jinchi; Qi, Lei
2010-01-01
This paper reviews the literature on sparse high dimensional models and discusses some applications in economics and finance. Recent developments of theory, methods, and implementations in penalized least squares and penalized likelihood methods are highlighted. These variable selection methods are proved to be effective in high dimensional sparse modeling. The limits of dimensionality that regularization methods can handle, the role of penalty functions, and their statistical properties are detailed. Some recent advances in ultra-high dimensional sparse modeling are also briefly discussed. PMID:22022635
Bayesian Methods for High Dimensional Linear Models
Mallick, Himel; Yi, Nengjun
2013-01-01
In this article, we present a selective overview of some recent developments in Bayesian model and variable selection methods for high dimensional linear models. While most of the reviews in literature are based on conventional methods, we focus on recently developed methods, which have proven to be successful in dealing with high dimensional variable selection. First, we give a brief overview of the traditional model selection methods (viz. Mallow’s Cp, AIC, BIC, DIC), followed by a discussion on some recently developed methods (viz. EBIC, regularization), which have occupied the minds of many statisticians. Then, we review high dimensional Bayesian methods with a particular emphasis on Bayesian regularization methods, which have been used extensively in recent years. We conclude by briefly addressing the asymptotic behaviors of Bayesian variable selection methods for high dimensional linear models under different regularity conditions. PMID:24511433
Bayesian Methods for High Dimensional Linear Models.
Mallick, Himel; Yi, Nengjun
2013-06-01
In this article, we present a selective overview of some recent developments in Bayesian model and variable selection methods for high dimensional linear models. While most of the reviews in literature are based on conventional methods, we focus on recently developed methods, which have proven to be successful in dealing with high dimensional variable selection. First, we give a brief overview of the traditional model selection methods (viz. Mallow's Cp, AIC, BIC, DIC), followed by a discussion on some recently developed methods (viz. EBIC, regularization), which have occupied the minds of many statisticians. Then, we review high dimensional Bayesian methods with a particular emphasis on Bayesian regularization methods, which have been used extensively in recent years. We conclude by briefly addressing the asymptotic behaviors of Bayesian variable selection methods for high dimensional linear models under different regularity conditions.
Analyzing High-Dimensional Multispectral Data
NASA Technical Reports Server (NTRS)
Lee, Chulhee; Landgrebe, David A.
1993-01-01
In this paper, through a series of specific examples, we illustrate some characteristics encountered in analyzing high- dimensional multispectral data. The increased importance of the second-order statistics in analyzing high-dimensional data is illustrated, as is the shortcoming of classifiers such as the minimum distance classifier which rely on first-order variations alone. We also illustrate how inaccurate estimation or first- and second-order statistics, e.g., from use of training sets which are too small, affects the performance of a classifier. Recognizing the importance of second-order statistics on the one hand, but the increased difficulty in perceiving and comprehending information present in statistics derived from high-dimensional data on the other, we propose a method to aid visualization of high-dimensional statistics using a color coding scheme.
Numerical methods for high-dimensional probability density function equations
NASA Astrophysics Data System (ADS)
Cho, H.; Venturi, D.; Karniadakis, G. E.
2016-01-01
In this paper we address the problem of computing the numerical solution to kinetic partial differential equations involving many phase variables. These types of equations arise naturally in many different areas of mathematical physics, e.g., in particle systems (Liouville and Boltzmann equations), stochastic dynamical systems (Fokker-Planck and Dostupov-Pugachev equations), random wave theory (Malakhov-Saichev equations) and coarse-grained stochastic systems (Mori-Zwanzig equations). We propose three different classes of new algorithms addressing high-dimensionality: The first one is based on separated series expansions resulting in a sequence of low-dimensional problems that can be solved recursively and in parallel by using alternating direction methods. The second class of algorithms relies on truncation of interaction in low-orders that resembles the Bogoliubov-Born-Green-Kirkwood-Yvon (BBGKY) framework of kinetic gas theory and it yields a hierarchy of coupled probability density function equations. The third class of algorithms is based on high-dimensional model representations, e.g., the ANOVA method and probabilistic collocation methods. A common feature of all these approaches is that they are reducible to the problem of computing the solution to high-dimensional equations via a sequence of low-dimensional problems. The effectiveness of the new algorithms is demonstrated in numerical examples involving nonlinear stochastic dynamical systems and partial differential equations, with up to 120 variables.
James, Lisa M; Taylor, Jeanette
2007-04-01
The co-occurrence of personality disorders (PDs) and substance use disorders (SUDs) can be partially attributed to shared underlying personality traits. This study examined the role of negative emotionality (NEM) and impulsivity in 617 university students with self-reported substance use problems and Cluster B PD symptoms. Results indicated that NEM was significantly associated with drug and alcohol use problems, antisocial PD, borderline PD, and narcissistic PD. Impulsivity was significantly associated with drug use problems, antisocial PD, and histrionic PD. Only NEM mediated the relationship between alcohol use problems and symptoms of each of the Cluster B PDs while impulsivity mediated only the relationship between drug use problems and histrionic PD. These results suggest that NEM may be more relevant than impulsivity to our understanding of the co-occurrence between substance use problems and Cluster B PD features.
Clusters of primordial black holes and reionization problem
Belotsky, K. M. Kirillov, A. A. Rubin, S. G.
2015-05-15
Clusters of primordial black holes may cause the formation of quasars in the early Universe. In turn, radiation from these quasars may lead to the reionization of the Universe. However, the evaporation of primordial black holes via Hawking’s mechanism may also contribute to the ionization of matter. The possibility of matter ionization via the evaporation of primordial black holes with allowance for existing constraints on their density is discussed. The contribution to ionization from the evaporation of primordial black holes characterized by their preset mass spectrum can roughly be estimated at about 10{sup −3}.
Feature extraction and classification algorithms for high dimensional data
NASA Technical Reports Server (NTRS)
Lee, Chulhee; Landgrebe, David
1993-01-01
Feature extraction and classification algorithms for high dimensional data are investigated. Developments with regard to sensors for Earth observation are moving in the direction of providing much higher dimensional multispectral imagery than is now possible. In analyzing such high dimensional data, processing time becomes an important factor. With large increases in dimensionality and the number of classes, processing time will increase significantly. To address this problem, a multistage classification scheme is proposed which reduces the processing time substantially by eliminating unlikely classes from further consideration at each stage. Several truncation criteria are developed and the relationship between thresholds and the error caused by the truncation is investigated. Next an approach to feature extraction for classification is proposed based directly on the decision boundaries. It is shown that all the features needed for classification can be extracted from decision boundaries. A characteristic of the proposed method arises by noting that only a portion of the decision boundary is effective in discriminating between classes, and the concept of the effective decision boundary is introduced. The proposed feature extraction algorithm has several desirable properties: it predicts the minimum number of features necessary to achieve the same classification accuracy as in the original space for a given pattern recognition problem; and it finds the necessary feature vectors. The proposed algorithm does not deteriorate under the circumstances of equal means or equal covariances as some previous algorithms do. In addition, the decision boundary feature extraction algorithm can be used both for parametric and non-parametric classifiers. Finally, some problems encountered in analyzing high dimensional data are studied and possible solutions are proposed. First, the increased importance of the second order statistics in analyzing high dimensional data is recognized
An Extended Membrane System with Active Membranes to Solve Automatic Fuzzy Clustering Problems.
Peng, Hong; Wang, Jun; Shi, Peng; Pérez-Jiménez, Mario J; Riscos-Núñez, Agustín
2016-05-01
This paper focuses on automatic fuzzy clustering problem and proposes a novel automatic fuzzy clustering method that employs an extended membrane system with active membranes that has been designed as its computing framework. The extended membrane system has a dynamic membrane structure; since membranes can evolve, it is particularly suitable for processing the automatic fuzzy clustering problem. A modification of a differential evolution (DE) mechanism was developed as evolution rules for objects according to membrane structure and object communication mechanisms. Under the control of both the object's evolution-communication mechanism and the membrane evolution mechanism, the extended membrane system can effectively determine the most appropriate number of clusters as well as the corresponding optimal cluster centers. The proposed method was evaluated over 13 benchmark problems and was compared with four state-of-the-art automatic clustering methods, two recently developed clustering methods and six classification techniques. The comparison results demonstrate the superiority of the proposed method in terms of effectiveness and robustness. PMID:26790484
Problem-Solving Environments (PSEs) to Support Innovation Clustering
NASA Technical Reports Server (NTRS)
Gill, Zann
1999-01-01
This paper argues that there is need for high level concepts to inform the development of Problem-Solving Environment (PSE) capability. A traditional approach to PSE implementation is to: (1) assemble a collection of tools; (2) integrate the tools; and (3) assume that collaborative work begins after the PSE is assembled. I argue for the need to start from the opposite premise, that promoting human collaboration and observing that process comes first, followed by the development of supporting tools, and finally evolution of PSE capability through input from collaborating project teams.
Identifying the number of population clusters with structure: problems and solutions.
Gilbert, Kimberly J
2016-05-01
The program structure has been used extensively to understand and visualize population genetic structure. It is one of the most commonly used clustering algorithms, cited over 11,500 times in Web of Science since its introduction in 2000. The method estimates ancestry proportions to assign individuals to clusters, and post hoc analyses of results may indicate the most likely number of clusters, or populations, on the landscape. However, as has been shown in this issue of Molecular Ecology Resources by Puechmaille (), when sampling is uneven across populations or across hierarchical levels of population structure, these post hoc analyses can be inaccurate and identify an incorrect number of population clusters. To solve this problem, Puechmaille () presents strategies for subsampling and new analysis methods that are robust to uneven sampling to improve inferences of the number of population clusters. PMID:27062588
Identifying the number of population clusters with structure: problems and solutions.
Gilbert, Kimberly J
2016-05-01
The program structure has been used extensively to understand and visualize population genetic structure. It is one of the most commonly used clustering algorithms, cited over 11,500 times in Web of Science since its introduction in 2000. The method estimates ancestry proportions to assign individuals to clusters, and post hoc analyses of results may indicate the most likely number of clusters, or populations, on the landscape. However, as has been shown in this issue of Molecular Ecology Resources by Puechmaille (), when sampling is uneven across populations or across hierarchical levels of population structure, these post hoc analyses can be inaccurate and identify an incorrect number of population clusters. To solve this problem, Puechmaille () presents strategies for subsampling and new analysis methods that are robust to uneven sampling to improve inferences of the number of population clusters.
Testing for associations with missing high-dimensional categorical covariates.
Schumi, Jennifer; DiRienzo, A Gregory; DeGruttola, Victor
2008-01-01
Understanding how long-term clinical outcomes relate to short-term response to therapy is an important topic of research with a variety of applications. In HIV, early measures of viral RNA levels are known to be a strong prognostic indicator of future viral load response. However, mutations observed in the high-dimensional viral genotype at an early time point may change this prognosis. Unfortunately, some subjects may not have a viral genetic sequence measured at the early time point, and the sequence may be missing for reasons related to the outcome. Complete-case analyses of missing data are generally biased when the assumption that data are missing completely at random is not met, and methods incorporating multiple imputation may not be well-suited for the analysis of high-dimensional data. We propose a semiparametric multiple testing approach to the problem of identifying associations between potentially missing high-dimensional covariates and response. Following the recent exposition by Tsiatis, unbiased nonparametric summary statistics are constructed by inversely weighting the complete cases according to the conditional probability of being observed, given data that is observed for each subject. Resulting summary statistics will be unbiased under the assumption of missing at random. We illustrate our approach through an application to data from a recent AIDS clinical trial, and demonstrate finite sample properties with simulations. PMID:20231909
Bayesian Analysis of High Dimensional Classification
NASA Astrophysics Data System (ADS)
Mukhopadhyay, Subhadeep; Liang, Faming
2009-12-01
Modern data mining and bioinformatics have presented an important playground for statistical learning techniques, where the number of input variables is possibly much larger than the sample size of the training data. In supervised learning, logistic regression or probit regression can be used to model a binary output and form perceptron classification rules based on Bayesian inference. In these cases , there is a lot of interest in searching for sparse model in High Dimensional regression(/classification) setup. we first discuss two common challenges for analyzing high dimensional data. The first one is the curse of dimensionality. The complexity of many existing algorithms scale exponentially with the dimensionality of the space and by virtue of that algorithms soon become computationally intractable and therefore inapplicable in many real applications. secondly, multicollinearities among the predictors which severely slowdown the algorithm. In order to make Bayesian analysis operational in high dimension we propose a novel 'Hierarchical stochastic approximation monte carlo algorithm' (HSAMC), which overcomes the curse of dimensionality, multicollinearity of predictors in high dimension and also it possesses the self-adjusting mechanism to avoid the local minima separated by high energy barriers. Models and methods are illustrated by simulation inspired from from the feild of genomics. Numerical results indicate that HSAMC can work as a general model selection sampler in high dimensional complex model space.
NASA Astrophysics Data System (ADS)
Masood, Tabasum
2016-07-01
The distribution of galaxies in the universe can be well understood by correlation function analysis. The lowest order two point auto correlation function has remained a successful tool for understanding the galaxy clustering phenomena. The two point correlation function is a probability of finding two galaxies in a given volume separated by some particular distance. Given a random galaxy in a location, the correlation function describes the probability that another galaxy will be found within a given distance .The correlation function tool is important for theoretical models of physical cosmology because it provides means of testing models which assume different things about the contents of the universe Correlation function is one of the way to characterize the distribution of galaxies in the space . This can be done by observations and can be extracted from numerical N-body experiments. Correlation function is a natural quantity in theoretical dynamical description of gravitating systems. These correlations can answer many interesting questions about the evolution and the distribution of galaxies.
ANISOTROPIC THERMAL CONDUCTION AND THE COOLING FLOW PROBLEM IN GALAXY CLUSTERS
Parrish, Ian J.; Sharma, Prateek; Quataert, Eliot
2009-09-20
We examine the long-standing cooling flow problem in galaxy clusters with three-dimensional magnetohydrodynamics simulations of isolated clusters including radiative cooling and anisotropic thermal conduction along magnetic field lines. The central regions of the intracluster medium (ICM) can have cooling timescales of {approx}200 Myr or shorter-in order to prevent a cooling catastrophe the ICM must be heated by some mechanism such as active galactic nucleus feedback or thermal conduction from the thermal reservoir at large radii. The cores of galaxy clusters are linearly unstable to the heat-flux-driven buoyancy instability (HBI), which significantly changes the thermodynamics of the cluster core. The HBI is a convective, buoyancy-driven instability that rearranges the magnetic field to be preferentially perpendicular to the temperature gradient. For a wide range of parameters, our simulations demonstrate that in the presence of the HBI, the effective radial thermal conductivity is reduced to {approx}<10% of the full Spitzer conductivity. With this suppression of conductive heating, the cooling catastrophe occurs on a timescale comparable to the central cooling time of the cluster. Thermal conduction alone is thus unlikely to stabilize clusters with low central entropies and short central cooling timescales. High central entropy clusters have sufficiently long cooling times that conduction can help stave off the cooling catastrophe for cosmologically interesting timescales.
ERIC Educational Resources Information Center
Raver, C. Cybele; Jones, Stephanie M.; Li-Grining, Christine; Zhai, Fuhua; Metzger, Molly W.; Solomon, Bonnie
2009-01-01
The present study evaluated the efficacy of a multicomponent, classroom-based intervention in reducing preschoolers' behavior problems. The Chicago School Readiness Project model was implemented in 35 Head Start classrooms using a clustered-randomized controlled trial design. Results indicate significant treatment effects (ds = 0.53-0.89) for…
ERIC Educational Resources Information Center
Brusco, Michael J.; Kohn, Hans-Friedrich
2009-01-01
The clique partitioning problem (CPP) requires the establishment of an equivalence relation for the vertices of a graph such that the sum of the edge costs associated with the relation is minimized. The CPP has important applications for the social sciences because it provides a framework for clustering objects measured on a collection of nominal…
Locating landmarks on high-dimensional free energy surfaces.
Chen, Ming; Yu, Tang-Qing; Tuckerman, Mark E
2015-03-17
Coarse graining of complex systems possessing many degrees of freedom can often be a useful approach for analyzing and understanding key features of these systems in terms of just a few variables. The relevant energy landscape in a coarse-grained description is the free energy surface as a function of the coarse-grained variables, which, despite the dimensional reduction, can still be an object of high dimension. Consequently, navigating and exploring this high-dimensional free energy surface is a nontrivial task. In this paper, we use techniques from multiscale modeling, stochastic optimization, and machine learning to devise a strategy for locating minima and saddle points (termed "landmarks") on a high-dimensional free energy surface "on the fly" and without requiring prior knowledge of or an explicit form for the surface. In addition, we propose a compact graph representation of the landmarks and connections between them, and we show that the graph nodes can be subsequently analyzed and clustered based on key attributes that elucidate important properties of the system. Finally, we show that knowledge of landmark locations allows for the efficient determination of their relative free energies via enhanced sampling techniques.
Fast Gibbs sampling for high-dimensional Bayesian inversion
NASA Astrophysics Data System (ADS)
Lucka, Felix
2016-11-01
Solving ill-posed inverse problems by Bayesian inference has recently attracted considerable attention. Compared to deterministic approaches, the probabilistic representation of the solution by the posterior distribution can be exploited to explore and quantify its uncertainties. In applications where the inverse solution is subject to further analysis procedures can be a significant advantage. Alongside theoretical progress, various new computational techniques allow us to sample very high dimensional posterior distributions: in (Lucka 2012 Inverse Problems 28 125012), and a Markov chain Monte Carlo posterior sampler was developed for linear inverse problems with {{\\ell }}1-type priors. In this article, we extend this single component (SC) Gibbs-type sampler to a wide range of priors used in Bayesian inversion, such as general {{\\ell }}pq priors with additional hard constraints. In addition, a fast computation of the conditional, SC densities in an explicit, parameterized form, a fast, robust and exact sampling from these one-dimensional densities is key to obtain an efficient algorithm. We demonstrate that a generalization of slice sampling can utilize their specific structure for this task and illustrate the performance of the resulting slice-within-Gibbs samplers by different computed examples. These new samplers allow us to perform sample-based Bayesian inference in high-dimensional scenarios with certain priors for the first time, including the inversion of computed tomography data with the popular isotropic total variation prior.
NASA Astrophysics Data System (ADS)
Stewart, John; Miller, Mayo; Audo, Christine; Stewart, Gay
2012-12-01
This study examined the evolution of student responses to seven contextually different versions of two Force Concept Inventory questions in an introductory physics course at the University of Arkansas. The consistency in answering the closely related questions evolved little over the seven-question exam. A model for the state of student knowledge involving the probability of selecting one of the multiple-choice answers was developed. Criteria for using clustering algorithms to extract model parameters were explored and it was found that the overlap between the probability distributions of the model vectors was an important parameter in characterizing the cluster models. The course data were then clustered and the extracted model showed that students largely fit into two groups both pre- and postinstruction: one that answered all questions correctly with high probability and one that selected the distracter representing the same misconception with high probability. For the course studied, 14% of the students were left with persistent misconceptions post instruction on a static force problem and 30% on a dynamic Newton’s third law problem. These students selected the answer representing the predominant misconception slightly more consistently postinstruction, indicating that the course studied had been ineffective at moving this subgroup of students nearer a Newtonian force concept and had instead moved them slightly farther away from a correct conceptual understanding of these two problems. The consistency in answering pairs of problems with varied physical contexts is shown to be an important supplementary statistic to the score on the problems and suggests that the inclusion of such problem pairs in future conceptual inventories would be efficacious. Multiple, contextually varied questions further probe the structure of students’ knowledge. To allow working instructors to make use of the additional insight gained from cluster analysis, it is our hope that the
Particle Filters for Very High-Dimensional Systems
NASA Astrophysics Data System (ADS)
van Leeuwen, P. J.
2014-12-01
Nonlinear data assimilation for high-dimensional geophysical systems is a rapidly evolving field. Particle filters seem to be the most promising methods as they do not require long chains of model runs to start sampling the posterior probability density function (pdf). Up to very recently developments in particle filtering has been hampered by the 'curse of dimensionality', roughly meaning that the number of particles needed to avoid weight collapse growths exponentially with the dimension of the system. However it has been realised that for particle filtering it is not the dimension of the state vector but the number of independent observations that is the problem. Furthermore, proposal densities that ensure better positioning of the particles in state space before observations are encountered lead to much better performance. Recently particle filters have been proposed that do not suffer from weight collapse by construction. In this talk I will present several of these new filters, including the equivalent-weights particle filters, new combinations with the implicit particle filter, and filters using large-deviation theory. I will present basic ideas and applications to very high-dimensional systems, including a full climate model. Emphasis will be on the fruitful forward directions and on areas that still need attention, as we haven't solved the problem yet.
Mode Estimation for High Dimensional Discrete Tree Graphical Models
Chen, Chao; Liu, Han; Metaxas, Dimitris N.; Zhao, Tianqi
2014-01-01
This paper studies the following problem: given samples from a high dimensional discrete distribution, we want to estimate the leading (δ, ρ)-modes of the underlying distributions. A point is defined to be a (δ, ρ)-mode if it is a local optimum of the density within a δ-neighborhood under metric ρ. As we increase the “scale” parameter δ, the neighborhood size increases and the total number of modes monotonically decreases. The sequence of the (δ, ρ)-modes reveal intrinsic topographical information of the underlying distributions. Though the mode finding problem is generally intractable in high dimensions, this paper unveils that, if the distribution can be approximated well by a tree graphical model, mode characterization is significantly easier. An efficient algorithm with provable theoretical guarantees is proposed and is applied to applications like data analysis and multiple predictions. PMID:25620859
NASA Astrophysics Data System (ADS)
Konno, Yohko; Suzuki, Keiji
This paper describes an approach to development of a solution algorithm of a general-purpose for large scale problems using “Local Clustering Organization (LCO)” as a new solution for Job-shop scheduling problem (JSP). Using a performance effective large scale scheduling in the study of usual LCO, a solving JSP keep stability induced better solution is examined. In this study for an improvement of a performance of a solution for JSP, processes to a optimization by LCO is examined, and a scheduling solution-structure is extended to a new solution-structure based on machine-division. A solving method introduced into effective local clustering for the solution-structure is proposed as an extended LCO. An extended LCO has an algorithm which improves scheduling evaluation efficiently by clustering of parallel search which extends over plural machines. A result verified by an application of extended LCO on various scale of problems proved to conduce to minimizing make-span and improving on the stable performance.
A cluster-analytic study of substance problems and mental health among street youths.
Adlaf, E M; Zdanowicz, Y M
1999-11-01
Based on a cluster analysis of 211 street youths aged 13-24 years interviewed in 1992 in Toronto, Ontario, Canada, we describe the configuration of mental health and substance use outcomes. Eight clusters were suggested: Entrepreneurs (n = 19) were frequently involved in delinquent activity and were highly entrenched in the street lifestyle; Drifters (n = 35) had infrequent social contact, displayed lower than average family dysfunction, and were not highly entrenched in the street lifestyle; Partiers (n = 40) were distinguished by their recreational motivation for alcohol and drug use and their below average entrenchment in the street lifestyle; Retreatists (n = 32) were distinguished by their high coping motivation for substance use; Fringers (n = 48) were involved marginally in the street lifestyle and showed lower than average family dysfunction; Transcenders (n = 21), despite above average physical and sexual abuse, reported below average mental health or substance use problems; Vulnerables (n = 12) were characterized by high family dysfunction (including physical and sexual abuse), elevated mental health outcomes, and use of alcohol and other drugs motivated by coping and escapism; Sex Workers (n = 4) were highly entrenched in the street lifestyle and reported frequent commercial sexual work, above average sexual abuse, and extensive use of crack cocaine. The results showed that distress, self-esteem, psychotic thoughts, attempted suicide, alcohol problems, drug problems, dual substance problems, and dual disorders varied significantly among the eight clusters. Overall, the findings suggest the need for differential programming. The data showed that risk factors, mental health, and substance use outcomes vary among this population. Also, for some the web of mental health and substance use problems is inseparable.
A Selective Overview of Variable Selection in High Dimensional Feature Space.
Fan, Jianqing; Lv, Jinchi
2010-01-01
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976
A Selective Overview of Variable Selection in High Dimensional Feature Space
Fan, Jianqing
2010-01-01
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976
Classification of high dimensional multispectral image data
NASA Technical Reports Server (NTRS)
Hoffbeck, Joseph P.; Landgrebe, David A.
1993-01-01
A method for classifying high dimensional remote sensing data is described. The technique uses a radiometric adjustment to allow a human operator to identify and label training pixels by visually comparing the remotely sensed spectra to laboratory reflectance spectra. Training pixels for material without obvious spectral features are identified by traditional means. Features which are effective for discriminating between the classes are then derived from the original radiance data and used to classify the scene. This technique is applied to Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data taken over Cuprite, Nevada in 1992, and the results are compared to an existing geologic map. This technique performed well even with noisy data and the fact that some of the materials in the scene lack absorption features. No adjustment for the atmosphere or other scene variables was made to the data classified. While the experimental results compare favorably with an existing geologic map, the primary purpose of this research was to demonstrate the classification method, as compared to the geology of the Cuprite scene.
NASA Astrophysics Data System (ADS)
Chen, L. X.; Wu, Q. P.
2012-10-01
Recently, Dada et al. reported on the experimental entanglement concentration and violation of generalized Bell inequalities with orbital angular momentum (OAM) [Nat. Phys. 7, 677 (2011)]. Here we demonstrate that the high-dimensional entanglement concentration can be performed in arbitrary OAM subspaces with selectivity. Instead of violating the generalized Bell inequalities, the working principle of present entanglement concentration is visualized by the biphoton OAM Klyshko picture, and its good performance is confirmed and quantified through the experimental Shannon dimensionalities after concentration.
Graphics Processing Units and High-Dimensional Optimization
Zhou, Hua; Lange, Kenneth; Suchard, Marc A.
2011-01-01
This paper discusses the potential of graphics processing units (GPUs) in high-dimensional optimization problems. A single GPU card with hundreds of arithmetic cores can be inserted in a personal computer and dramatically accelerates many statistical algorithms. To exploit these devices fully, optimization algorithms should reduce to multiple parallel tasks, each accessing a limited amount of data. These criteria favor EM and MM algorithms that separate parameters and data. To a lesser extent block relaxation and coordinate descent and ascent also qualify. We demonstrate the utility of GPUs in nonnegative matrix factorization, PET image reconstruction, and multidimensional scaling. Speedups of 100 fold can easily be attained. Over the next decade, GPUs will fundamentally alter the landscape of computational statistics. It is time for more statisticians to get on-board. PMID:21847315
Optimal Sets of Projections of High-Dimensional Data.
Lehmann, Dirk J; Theisel, Holger
2016-01-01
Finding good projections of n-dimensional datasets into a 2D visualization domain is one of the most important problems in Information Visualization. Users are interested in getting maximal insight into the data by exploring a minimal number of projections. However, if the number is too small or improper projections are used, then important data patterns might be overlooked. We propose a data-driven approach to find minimal sets of projections that uniquely show certain data patterns. For this we introduce a dissimilarity measure of data projections that discards affine transformations of projections and prevents repetitions of the same data patterns. Based on this, we provide complete data tours of at most n/2 projections. Furthermore, we propose optimal paths of projection matrices for an interactive data exploration. We illustrate our technique with a set of state-of-the-art real high-dimensional benchmark datasets.
Class prediction for high-dimensional class-imbalanced data
2010-01-01
Background The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance. Results Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers. Conclusions Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when
NASA Astrophysics Data System (ADS)
Heggie, D.; Hut, P.
2003-10-01
focus on N = 106 for two main reasons: first, direct numerical integrations of N-body systems are beginning to approach this threshold, and second, globular star clusters provide remarkably accurate physical instantiations of the idealized N-body problem with N = 105 - 106. The authors are distinguished contributors to the study of star-cluster dynamics and the gravitational N-body problem. The book contains lucid and concise descriptions of most of the important tools in the subject, with only a modest bias towards the authors' own interests. These tools include the two-body relaxation approximation, the Vlasov and Fokker-Planck equations, regularization of close encounters, conducting fluid models, Hill's approximation, Heggie's law for binary star evolution, symplectic integration algorithms, Liapunov exponents, and so on. The book also provides an up-to-date description of the principal processes that drive the evolution of idealized N-body systems - two-body relaxation, mass segregation, escape, core collapse and core bounce, binary star hardening, gravothermal oscillations - as well as additional processes such as stellar collisions and tidal shocks that affect real star clusters but not idealized N-body systems. In a relatively short (300 pages plus appendices) book such as this, many topics have to be omitted. The reader who is hoping to learn about the phenomenology of star clusters will be disappointed, as the description of their properties is limited to only a page of text; there is also almost no discussion of other, equally interesting N-body systems such as galaxies(N approx 106 - 1012), open clusters (N simeq 102 - 104), planetary systems, or the star clusters surrounding black holes that are found in the centres of most galaxies. All of these omissions are defensible decisions. Less defensible is the uneven set of references in the text; for example, nowhere is the reader informed that the classic predecessor to this work was Spitzer's 1987 monograph
High dimensional decision dilemmas in climate models
NASA Astrophysics Data System (ADS)
Bracco, A.; Neelin, J. D.; Luo, H.; McWilliams, J. C.; Meyerson, J. E.
2013-10-01
An important source of uncertainty in climate models is linked to the calibration of model parameters. Interest in systematic and automated parameter optimization procedures stems from the desire to improve the model climatology and to quantify the average sensitivity associated with potential changes in the climate system. Building upon on the smoothness of the response of an atmospheric circulation model (AGCM) to changes of four adjustable parameters, Neelin et al. (2010) used a quadratic metamodel to objectively calibrate the AGCM. The metamodel accurately estimates global spatial averages of common fields of climatic interest, from precipitation, to low and high level winds, from temperature at various levels to sea level pressure and geopotential height, while providing a computationally cheap strategy to explore the influence of parameter settings. Here, guided by the metamodel, the ambiguities or dilemmas related to the decision making process in relation to model sensitivity and optimization are examined. Simulations of current climate are subject to considerable regional-scale biases. Those biases may vary substantially depending on the climate variable considered, and/or on the performance metric adopted. Common dilemmas are associated with model revisions yielding improvement in one field or regional pattern or season, but degradation in another, or improvement in the model climatology but degradation in the interannual variability representation. Challenges are posed to the modeler by the high dimensionality of the model output fields and by the large number of adjustable parameters. The use of the metamodel in the optimization strategy helps visualize trade-offs at a regional level, e.g., how mismatches between sensitivity and error spatial fields yield regional errors under minimization of global objective functions.
High dimensional decision dilemmas in climate models
NASA Astrophysics Data System (ADS)
Bracco, A.; Neelin, J. D.; Luo, H.; McWilliams, J. C.; Meyerson, J. E.
2013-05-01
An important source of uncertainty in climate models is linked to the calibration of model parameters. Interest in systematic and automated parameter optimization procedures stems from the desire to improve the model climatology and to quantify the average sensitivity associated with potential changes in the climate system. Neelin et al. (2010) used a quadratic metamodel to objectively calibrate an atmospheric circulation model (AGCM) around four adjustable parameters. The metamodel accurately estimates global spatial averages of common fields of climatic interest, from precipitation, to low and high level winds, from temperature at various levels to sea level pressure and geopotential height, while providing a computationally cheap strategy to explore the influence of parameter settings. Here, guided by the metamodel, the ambiguities or dilemmas related to the decision making process in relation to model sensitivity and optimization are examined. Simulations of current climate are subject to considerable regional-scale biases. Those biases may vary substantially depending on the climate variable considered, and/or on the performance metric adopted. Common dilemmas are associated with model revisions yielding improvement in one field or regional pattern or season, but degradation in another, or improvement in the model climatology but degradation in the interannual variability representation. Challenges are posed to the modeler by the high dimensionality of the model output fields and by the large number of adjustable parameters. The use of the metamodel in the optimization strategy helps visualize trade-offs at a regional level, e.g. how mismatches between sensitivity and error spatial fields yield regional errors under minimization of global objective functions.
Kennedy, Angie C; Adams, Adrienne E
2016-04-01
Using a cluster analysis approach with a sample of 205 young mothers recruited from community sites in an urban Midwestern setting, we examined the effects of cumulative violence exposure (community violence exposure, witnessing intimate partner violence, physical abuse by a caregiver, and sexual victimization, all with onset prior to age 13) on school participation, as mediated by attention and behavior problems in school. We identified five clusters of cumulative exposure, and found that the HiAll cluster (high levels of exposure to all four types) consistently fared the worst, with significantly higher attention and behavior problems, and lower school participation, in comparison with the LoAll cluster (low levels of exposure to all types). Behavior problems were a significant mediator of the effects of cumulative violence exposure on school participation, but attention problems were not.
Solution of relativistic quantum optics problems using clusters of graphical processing units
Gordon, D.F. Hafizi, B.; Helle, M.H.
2014-06-15
Numerical solution of relativistic quantum optics problems requires high performance computing due to the rapid oscillations in a relativistic wavefunction. Clusters of graphical processing units are used to accelerate the computation of a time dependent relativistic wavefunction in an arbitrary external potential. The stationary states in a Coulomb potential and uniform magnetic field are determined analytically and numerically, so that they can used as initial conditions in fully time dependent calculations. Relativistic energy levels in extreme magnetic fields are recovered as a means of validation. The relativistic ionization rate is computed for an ion illuminated by a laser field near the usual barrier suppression threshold, and the ionizing wavefunction is displayed.
Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences
NASA Technical Reports Server (NTRS)
Budalakoti, Suratna; Srivastava, Ashok N.; Akella, Ram; Turkov, Eugene
2006-01-01
This paper addresses the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. The approach taken uses unsupervised clustering of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by detailed analysis of outliers to detect anomalies. As the LCS measure is expensive to compute, the first part of the paper discusses existing algorithms, such as the Hunt-Szymanski algorithm, that have low time-complexity. We then discuss why these algorithms often do not work well in practice and present a new hybrid algorithm for computing the LCS that, in our tests, outperforms the Hunt-Szymanski algorithm by a factor of five. The second part of the paper presents new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an outlier. The algorithms provide a coherent description to an analyst of the anomalies in the sequence, compared to more normal sequences. The algorithms we present are general and domain-independent, so we discuss applications in related areas such as anomaly detection.
Reinke, R.E.
1991-01-01
Clustering is the problem of finding a good organization for data. Because there are many kinds of clustering problems, and because there are many possible clusterings for any data set, clustering programs use knowledge and assumptions about individual problems to make clustering tractable. Cluster-analysis techniques allow knowledge to be expressed in the choice of a pairwise distance measure and in the choice of clustering algorithm. Conceptual clustering adds knowledge and preferences about cluster descriptions. In this study the author describes symbolic clustering, which adds representation choice to the set of ways a data analyst can use problem-specific knowledge. He develops an informal model for symbolic clustering, and uses it to suggest where and how knowledge can be expressed in clustering. A language for creating symbolic clusters, based on the model, was developed and tested on three real clustering problems. The study concludes with a discussion of the implications of the model and the results for clustering in general.
Optimal control problem for the three-sector economic model of a cluster
NASA Astrophysics Data System (ADS)
Murzabekov, Zainel; Aipanov, Shamshi; Usubalieva, Saltanat
2016-08-01
The problem of optimal control for the three-sector economic model of a cluster is considered. Task statement is to determine the optimal distribution of investment and manpower in moving the system from a given initial state to desired final state. To solve the optimal control problem with finite-horizon planning, in case of fixed ends of trajectories, with box constraints, the method of Lagrange multipliers of a special type is used. This approach allows to represent the desired control in the form of synthesis control, depending on state of the system and current time. The results of numerical calculations for an instance of three-sector model of the economy show the effectiveness of the proposed method.
Visualization of High-Dimensionality Data Using Virtual Reality
NASA Astrophysics Data System (ADS)
Djorgovski, S. G.; Donalek, C.; Davidoff, S.; Lombeyda, S.
2015-12-01
An effective visualization of complex and high-dimensionality data sets is now a critical bottleneck on the path from data to discovery in all fields. Visual pattern recognition is the bridge between human intuition and understanding, and the quantitative content of the data and the relationships present there (correlations, outliers, clustering, etc.). We are developing a novel platform for visualization of complex, multi-dimensional data, using immersive virtual reality (VR), that leverages the recent rapid developments in the availability of commodity hardware and development software. VR immersion has been shown to significantly increase the effective visual perception and intuition, compared to the traditional flat-screen tools. This allows to more easily perceive higher dimensional spaces, with an advantage for a visual exploration of complex data compared to the traditional visualization methods. Immersive VR also offers a natural way for a collaborative visual exploration of data, with multiple users interacting with each other and with their data in the same perceptive data space.
Solving the inverse Ising problem by mean-field methods in a clustered phase space with many states.
Decelle, Aurélien; Ricci-Tersenghi, Federico
2016-07-01
In this work we explain how to properly use mean-field methods to solve the inverse Ising problem when the phase space is clustered, that is, many states are present. The clustering of the phase space can occur for many reasons, e.g., when a system undergoes a phase transition, but also when data are collected in different regimes (e.g., quiescent and spiking regimes in neural networks). Mean-field methods for the inverse Ising problem are typically used without taking into account the eventual clustered structure of the input configurations and may lead to very poor inference (e.g., in the low-temperature phase of the Curie-Weiss model). In this work we explain how to modify mean-field approaches when the phase space is clustered and we illustrate the effectiveness of our method on different clustered structures (low-temperature phases of Curie-Weiss and Hopfield models). PMID:27575082
Solving the inverse Ising problem by mean-field methods in a clustered phase space with many states
NASA Astrophysics Data System (ADS)
Decelle, Aurélien; Ricci-Tersenghi, Federico
2016-07-01
In this work we explain how to properly use mean-field methods to solve the inverse Ising problem when the phase space is clustered, that is, many states are present. The clustering of the phase space can occur for many reasons, e.g., when a system undergoes a phase transition, but also when data are collected in different regimes (e.g., quiescent and spiking regimes in neural networks). Mean-field methods for the inverse Ising problem are typically used without taking into account the eventual clustered structure of the input configurations and may lead to very poor inference (e.g., in the low-temperature phase of the Curie-Weiss model). In this work we explain how to modify mean-field approaches when the phase space is clustered and we illustrate the effectiveness of our method on different clustered structures (low-temperature phases of Curie-Weiss and Hopfield models).
DIAMONDS: high-DImensional And multi-MOdal NesteD Sampling
NASA Astrophysics Data System (ADS)
Corsaro, Enrico; De Ridder, Joris
2014-10-01
DIAMONDS (high-DImensional And multi-MOdal NesteD Sampling) provides Bayesian parameter estimation and model comparison by means of the nested sampling Monte Carlo (NSMC) algorithm, an efficient and powerful method very suitable for high-dimensional and multi-modal problems; it can be used for any application involving Bayesian parameter estimation and/or model selection in general. Developed in C++11, DIAMONDS is structured in classes for flexibility and configurability. Any new model, likelihood and prior PDFs can be defined and implemented upon a basic template.
Bias-corrected diagonal discriminant rules for high-dimensional classification.
Huang, Song; Tong, Tiejun; Zhao, Hongyu
2010-12-01
Diagonal discriminant rules have been successfully used for high-dimensional classification problems, but suffer from the serious drawback of biased discriminant scores. In this article, we propose improved diagonal discriminant rules with bias-corrected discriminant scores for high-dimensional classification. We show that the proposed discriminant scores dominate the standard ones under the quadratic loss function. Analytical results on why the bias-corrected rules can potentially improve the predication accuracy are also provided. Finally, we demonstrate the improvement of the proposed rules over the original ones through extensive simulation studies and real case studies.
Engineering two-photon high-dimensional states through quantum interference.
Zhang, Yingwen; Roux, Filippus S; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew
2016-02-01
Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits. PMID:26933685
Engineering two-photon high-dimensional states through quantum interference
Zhang, Yingwen; Roux, Filippus S.; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew
2016-01-01
Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits. PMID:26933685
Engineering two-photon high-dimensional states through quantum interference.
Zhang, Yingwen; Roux, Filippus S; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew
2016-02-01
Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits.
NASA Astrophysics Data System (ADS)
Li, Weixuan; Lin, Guang; Li, Bing
2016-09-01
Many uncertainty quantification (UQ) approaches suffer from the curse of dimensionality, that is, their computational costs become intractable for problems involving a large number of uncertainty parameters. In these situations, the classic Monte Carlo often remains the preferred method of choice because its convergence rate O (n - 1 / 2), where n is the required number of model simulations, does not depend on the dimension of the problem. However, many high-dimensional UQ problems are intrinsically low-dimensional, because the variation of the quantity of interest (QoI) is often caused by only a few latent parameters varying within a low-dimensional subspace, known as the sufficient dimension reduction (SDR) subspace in the statistics literature. Motivated by this observation, we propose two inverse regression-based UQ algorithms (IRUQ) for high-dimensional problems. Both algorithms use inverse regression to convert the original high-dimensional problem to a low-dimensional one, which is then efficiently solved by building a response surface for the reduced model, for example via the polynomial chaos expansion. The first algorithm, which is for the situations where an exact SDR subspace exists, is proved to converge at rate O (n-1), hence much faster than MC. The second algorithm, which doesn't require an exact SDR, employs the reduced model as a control variate to reduce the error of the MC estimate. The accuracy gain could still be significant, depending on how well the reduced model approximates the original high-dimensional one. IRUQ also provides several additional practical advantages: it is non-intrusive; it does not require computing the high-dimensional gradient of the QoI; and it reports an error bar so the user knows how reliable the result is.
Collaborative Care Outcomes for Pediatric Behavioral Health Problems: A Cluster Randomized Trial
Campo, John; Kilbourne, Amy M.; Hart, Jonathan; Sakolsky, Dara; Wisniewski, Stephen
2014-01-01
OBJECTIVE: To assess the efficacy of collaborative care for behavior problems, attention-deficit/hyperactivity disorder (ADHD), and anxiety in pediatric primary care (Doctor Office Collaborative Care; DOCC). METHODS: Children and their caregivers participated from 8 pediatric practices that were cluster randomized to DOCC (n = 160) or enhanced usual care (EUC; n = 161). In DOCC, a care manager delivered a personalized, evidence-based intervention. EUC patients received psychoeducation and a facilitated specialty care referral. Care processes measures were collected after the 6-month intervention period. Family outcome measures included the Vanderbilt ADHD Diagnostic Parent Rating Scale, Parenting Stress Index-Short Form, Individualized Goal Attainment Ratings, and Clinical Global Impression-Improvement Scale. Most measures were collected at baseline, and 6-, 12-, and 18-month assessments. Provider outcome measures examined perceived treatment change, efficacy, and obstacles, and practice climate. RESULTS: DOCC (versus EUC) was associated with higher rates of treatment initiation (99.4% vs 54.2%; P < .001) and completion (76.6% vs 11.6%, P < .001), improvement in behavior problems, hyperactivity, and internalizing problems (P < .05 to .01), and parental stress (P < .05–.001), remission in behavior and internalizing problems (P < .01, .05), goal improvement (P < .05 to .001), treatment response (P < .05), and consumer satisfaction (P < .05). DOCC pediatricians reported greater perceived practice change, efficacy, and skill use to treat ADHD (P < .05 to .01). CONCLUSIONS: Implementing a collaborative care intervention for behavior problems in community pediatric practices is feasible and broadly effective, supporting the utility of integrated behavioral health care services. PMID:24664093
Avoiding common pitfalls when clustering biological data.
Ronan, Tom; Qi, Zhijie; Naegle, Kristen M
2016-01-01
Clustering is an unsupervised learning method, which groups data points based on similarity, and is used to reveal the underlying structure of data. This computational approach is essential to understanding and visualizing the complex data that are acquired in high-throughput multidimensional biological experiments. Clustering enables researchers to make biological inferences for further experiments. Although a powerful technique, inappropriate application can lead biological researchers to waste resources and time in experimental follow-up. We review common pitfalls identified from the published molecular biology literature and present methods to avoid them. Commonly encountered pitfalls relate to the high-dimensional nature of biological data from high-throughput experiments, the failure to consider more than one clustering method for a given problem, and the difficulty in determining whether clustering has produced meaningful results. We present concrete examples of problems and solutions (clustering results) in the form of toy problems and real biological data for these issues. We also discuss ensemble clustering as an easy-to-implement method that enables the exploration of multiple clustering solutions and improves robustness of clustering solutions. Increased awareness of common clustering pitfalls will help researchers avoid overinterpreting or misinterpreting the results and missing valuable insights when clustering biological data. PMID:27303057
NASA Astrophysics Data System (ADS)
Hill, C.
2008-12-01
Low cost graphic cards today use many, relatively simple, compute cores to deliver support for memory bandwidth of more than 100GB/s and theoretical floating point performance of more than 500 GFlop/s. Right now this performance is, however, only accessible to highly parallel algorithm implementations that, (i) can use a hundred or more, 32-bit floating point, concurrently executing cores, (ii) can work with graphics memory that resides on the graphics card side of the graphics bus and (iii) can be partially expressed in a language that can be compiled by a graphics programming tool. In this talk we describe our experiences implementing a complete, but relatively simple, time dependent shallow-water equations simulation targeting a cluster of 30 computers each hosting one graphics card. The implementation takes into account the considerations (i), (ii) and (iii) listed previously. We code our algorithm as a series of numerical kernels. Each kernel is designed to be executed by multiple threads of a single process. Kernels are passed memory blocks to compute over which can be persistent blocks of memory on a graphics card. Each kernel is individually implemented using the NVidia CUDA language but driven from a higher level supervisory code that is almost identical to a standard model driver. The supervisory code controls the overall simulation timestepping, but is written to minimize data transfer between main memory and graphics memory (a massive performance bottle-neck on current systems). Using the recipe outlined we can boost the performance of our cluster by nearly an order of magnitude, relative to the same algorithm executing only on the cluster CPU's. Achieving this performance boost requires that many threads are available to each graphics processor for execution within each numerical kernel and that the simulations working set of data can fit into the graphics card memory. As we describe, this puts interesting upper and lower bounds on the problem sizes
Lee, Jenny Hyunjung; McDonnell, Kevin T; Zelenyuk, Alla; Imre, Dan; Mueller, Klaus
2014-03-01
Although the euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging intercluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multidimensional scaling (MDS) where one can often observe nonintuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our biscale framework distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate euclidean distance.
Hyperspherical Sparse Approximation Techniques for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max; Burkardt, John
2016-08-04
This work proposes a hyperspherical sparse approximation framework for detecting jump discontinuities in functions in high-dimensional spaces. The need for a novel approach results from the theoretical and computational inefficiencies of well-known approaches, such as adaptive sparse grids, for discontinuity detection. Our approach constructs the hyperspherical coordinate representation of the discontinuity surface of a function. Then sparse approximations of the transformed function are built in the hyperspherical coordinate system, with values at each point estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the new technique can identify jump discontinuities with significantly reduced computationalmore » cost, compared to existing methods. Several approaches are used to approximate the transformed discontinuity surface in the hyperspherical system, including adaptive sparse grid and radial basis function interpolation, discrete least squares projection, and compressed sensing approximation. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. In conclusion, rigorous complexity analyses of the new methods are provided, as are several numerical examples that illustrate the effectiveness of our approach.« less
HYPOTHESIS TESTING FOR HIGH-DIMENSIONAL SPARSE BINARY REGRESSION
Mukherjee, Rajarshi; Pillai, Natesh S.; Lin, Xihong
2015-01-01
In this paper, we study the detection boundary for minimax hypothesis testing in the context of high-dimensional, sparse binary regression models. Motivated by genetic sequencing association studies for rare variant effects, we investigate the complexity of the hypothesis testing problem when the design matrix is sparse. We observe a new phenomenon in the behavior of detection boundary which does not occur in the case of Gaussian linear regression. We derive the detection boundary as a function of two components: a design matrix sparsity index and signal strength, each of which is a function of the sparsity of the alternative. For any alternative, if the design matrix sparsity index is too high, any test is asymptotically powerless irrespective of the magnitude of signal strength. For binary design matrices with the sparsity index that is not too high, our results are parallel to those in the Gaussian case. In this context, we derive detection boundaries for both dense and sparse regimes. For the dense regime, we show that the generalized likelihood ratio is rate optimal; for the sparse regime, we propose an extended Higher Criticism Test and show it is rate optimal and sharp. We illustrate the finite sample properties of the theoretical results using simulation studies. PMID:26246645
Visual exploration of high-dimensional data through subspace analysis and dynamic projections
Liu, S.; Wang, B.; Thiagarajan, J. J.; Bremer, P. -T.; Pascucci, V.
2015-06-01
Here, we introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that create smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.
Visual exploration of high-dimensional data through subspace analysis and dynamic projections
Liu, S.; Wang, B.; Thiagarajan, J. J.; Bremer, P. -T.; Pascucci, V.
2015-06-01
Here, we introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that createmore » smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.« less
Visual Exploration of High-Dimensional Data through Subspace Analysis and Dynamic Projections
Liu, S.; Wang, B.; Thiagarajan, Jayaraman J.; Bremer, Peer -Timo; Pascucci, Valerio
2015-06-01
We introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that create smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.
Bit-table based biclustering and frequent closed itemset mining in high-dimensional binary data.
Király, András; Gyenesei, Attila; Abonyi, János
2014-01-01
During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers.
The cosmological lithium problem outside the Galaxy: the Sagittarius globular cluster M54
NASA Astrophysics Data System (ADS)
Mucciarelli, A.; Salaris, M.; Bonifacio, P.; Monaco, L.; Villanova, S.
2014-10-01
The cosmological Li problem is the observed discrepancy between Li abundance (A(Li)) measured in Galactic dwarf, old and metal-poor stars (traditionally assumed to be equal to the initial value A(Li)0), and that predicted by standard big bang nucleosynthesis (BBN) calculations (A(Li)BBN). Here, we attack the Li problem by considering an alternative diagnostic, namely the surface Li abundance of red giant branch stars that in a colour-magnitude diagram populate the region between the completion of the first dredge-up and the red giant branch bump. We obtained high-resolution spectra with the FLAMES facility at the Very Large Telescope for a sample of red giants in the globular cluster M54, belonging to the Sagittarius dwarf galaxy. We obtain A(Li) = 0.93 ± 0.11 dex, translating - after taking into account the dilution due to the dredge-up - to initial abundances (A(Li)0) in the range 2.35-2.29 dex, depending on whether or not atomic diffusion is considered. This is the first measurement of Li in the Sagittarius galaxy and the more distant estimate of A(Li)0 in old stars obtained so far. The A(Li)0 estimated in M54 is lower by ˜0.35 dex than A(Li)BBN, hence incompatible at a level of ˜3σ. Our result shows that this discrepancy is a universal problem concerning both the Milky Way and extragalactic systems. Either modifications of BBN calculations, or a combination of atomic diffusion plus a suitably tuned additional mixing during the main sequence, need to be invoked to solve the discrepancy.
Convex Clustering: An Attractive Alternative to Hierarchical Clustering
Chen, Gary K.; Chi, Eric C.; Ranola, John Michael O.; Lange, Kenneth
2015-01-01
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/ PMID:25965340
ClusterSculptor: Software for Expert-Steered Classification of Single Particle Mass Spectra
Zelenyuk, Alla; Imre, Dan G.; Nam, Eun Ju; Han, Yiping; Mueller, Klaus
2008-08-01
To take full advantage of the vast amount of highly detailed data acquired by single particle mass spectrometers requires that the data be organized according to some rules that have the potential to be insightful. Most commonly statistical tools are used to cluster the individual particle mass spectra on the basis of their similarity. Cluster analysis is a powerful strategy for the exploration of high-dimensional data in the absence of a-priori hypotheses or data classification models, and the results of cluster analysis can then be used to form such models. More often than not, when examining the data clustering results we find that many clusters contain particles of different types and that many particles of one type end up in a number of separate clusters. Our experience with cluster analysis shows that we have a vast amount of non-compiled knowledge and intuition that should be brought to bear in this effort. We will present new software we call ClusterSculptor that provides comprehensive and intuitive framework to aid scientists in data classification. ClusterSculptor uses k-means as the overall clustering engine, but allows tuning its parameters interactively, based on a non-distorted compact visual presentation of the inherent characteristics of the data in high-dimensional space. ClusterSculptor provides all the tools necessary for a high-dimensional activity we call cluster sculpting. ClusterSculptor is designed to be coupled to SpectraMiner, our data mining and visualization software package. The data are first visualized with SpectraMiner and identified problems are exported to ClusterSculptor, where the user steers the reclassification and recombination of clusters of tens of thousands particle mass spectra in real-time. The resulting sculpted clusters can be then imported back into SpectraMiner. Here we will greatly improved single particle chemical speciation in an example of application of this new tool to a number of particle types of atmospheric
Lee, Hyun Jung; McDonnell, Kevin T.; Zelenyuk, Alla; Imre, D.; Mueller, Klaus
2014-03-01
Although the Euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging inter-cluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multi-dimensional scaling (MDS) where one can often observe non-intuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our MDS plots also exhibit similar visual relationships as the method of parallel coordinates which is often used alongside to visualize the high-dimensional data in raw form. We then cast our metric into a bi-scale framework which distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate Euclidean distance.
An Overview of Air Pollution Problem in Megacities and City Clusters in China
NASA Astrophysics Data System (ADS)
Tang, X.
2007-05-01
China has experienced the rapid economic growth in last twenty years. City clusters, which consist of one or several mega cities in close vicinity and many satellite cities and towns, are playing a leading role in Chinese economic growth, owing to their collective economic capacity and interdependency. However, accompanying with the economic boom, population growth and increased energy consumption, the air quality has been degrading in the past two decades. Air pollution in those areas is characterized by concurrent occurrence of high concentrations of multiple primary pollutants leading to form complex secondary pollution problem. After decades long efforts to control air pollution, both the government and scientific communities have realized that to control regional scale air pollution, regional efforts are needed. Field experiments covering the regions like Pearl River Delta region and Beijing City with surrounding areas are critical to understand the chemical and physical processes leading to the formation of regional scale air pollution. In order to formulate policy suggestions for air quality attainment during 2008 Beijing Olympic game and to propose objectives of air quality attainment in 2010 in Beijing, CAREBEIJING (Campaigns of Air Quality Research in Beijing and Surrounding Region) was organized by Peking University in 2006 to learn current air pollution situation of the region, and to identify the transport and transformation processes that lead to the impact of the surrounding area on air quality in Beijing. Same as the purpose for understanding the chemical and physical processes happened in regional scale, the fall and summer campaigns in 2004 and 2006 were carried out in Pearl River Delta. More than 16 domestic and foreign institutions were involved in these campaigns. The background, current status, problems, and some results of these campaigns will be introduced in this presentation.
Choosing ℓp norms in high-dimensional spaces based on hub analysis
Flexer, Arthur; Schnitzer, Dominik
2015-01-01
The hubness phenomenon is a recently discovered aspect of the curse of dimensionality. Hub objects have a small distance to an exceptionally large number of data points while anti-hubs lie far from all other data points. A closely related problem is the concentration of distances in high-dimensional spaces. Previous work has already advocated the use of fractional ℓp norms instead of the ubiquitous Euclidean norm to avoid the negative effects of distance concentration. However, which exact fractional norm to use is a largely unsolved problem. The contribution of this work is an empirical analysis of the relation of different ℓp norms and hubness. We propose an unsupervised approach for choosing an ℓp norm which minimizes hubs while simultaneously maximizing nearest neighbor classification. Our approach is evaluated on seven high-dimensional data sets and compared to three approaches that re-scale distances to avoid hubness. PMID:26640321
Autonomous mental development in high dimensional context and action spaces.
Joshi, Ameet; Weng, Juyang
2003-01-01
Autonomous Mental Development (AMD) of robots opened a new paradigm for developing machine intelligence, using neural network type of techniques and it fundamentally changed the way an intelligent machine is developed from manual to autonomous. The work presented here is a part of SAIL (Self-Organizing Autonomous Incremental Learner) project which deals with autonomous development of humanoid robot with vision, audition, manipulation and locomotion. The major issue addressed here is the challenge of high dimensional action space (5-10) in addition to the high dimensional context space (hundreds to thousands and beyond), typically required by an AMD machine. This is the first work that studies a high dimensional (numeric) action space in conjunction with a high dimensional perception (context state) space, under the AMD mode. Two new learning algorithms, Direct Update on Direction Cosines (DUDC) and High-Dimensional Conjugate Gradient Search (HCGS), are developed, implemented and tested. The convergence properties of both the algorithms and their targeted applications are discussed. Autonomous learning of speech production under reinforcement learning is studied as an example. PMID:12850025
Autonomous mental development in high dimensional context and action spaces.
Joshi, Ameet; Weng, Juyang
2003-01-01
Autonomous Mental Development (AMD) of robots opened a new paradigm for developing machine intelligence, using neural network type of techniques and it fundamentally changed the way an intelligent machine is developed from manual to autonomous. The work presented here is a part of SAIL (Self-Organizing Autonomous Incremental Learner) project which deals with autonomous development of humanoid robot with vision, audition, manipulation and locomotion. The major issue addressed here is the challenge of high dimensional action space (5-10) in addition to the high dimensional context space (hundreds to thousands and beyond), typically required by an AMD machine. This is the first work that studies a high dimensional (numeric) action space in conjunction with a high dimensional perception (context state) space, under the AMD mode. Two new learning algorithms, Direct Update on Direction Cosines (DUDC) and High-Dimensional Conjugate Gradient Search (HCGS), are developed, implemented and tested. The convergence properties of both the algorithms and their targeted applications are discussed. Autonomous learning of speech production under reinforcement learning is studied as an example.
An Effective Parameter Screening Strategy for High Dimensional Watershed Models
NASA Astrophysics Data System (ADS)
Khare, Y. P.; Martinez, C. J.; Munoz-Carpena, R.
2014-12-01
Watershed simulation models can assess the impacts of natural and anthropogenic disturbances on natural systems. These models have become important tools for tackling a range of water resources problems through their implementation in the formulation and evaluation of Best Management Practices, Total Maximum Daily Loads, and Basin Management Action Plans. For accurate applications of watershed models they need to be thoroughly evaluated through global uncertainty and sensitivity analyses (UA/SA). However, due to the high dimensionality of these models such evaluation becomes extremely time- and resource-consuming. Parameter screening, the qualitative separation of important parameters, has been suggested as an essential step before applying rigorous evaluation techniques such as the Sobol' and Fourier Amplitude Sensitivity Test (FAST) methods in the UA/SA framework. The method of elementary effects (EE) (Morris, 1991) is one of the most widely used screening methodologies. Some of the common parameter sampling strategies for EE, e.g. Optimized Trajectories [OT] (Campolongo et al., 2007) and Modified Optimized Trajectories [MOT] (Ruano et al., 2012), suffer from inconsistencies in the generated parameter distributions, infeasible sample generation time, etc. In this work, we have formulated a new parameter sampling strategy - Sampling for Uniformity (SU) - for parameter screening which is based on the principles of the uniformity of the generated parameter distributions and the spread of the parameter sample. A rigorous multi-criteria evaluation (time, distribution, spread and screening efficiency) of OT, MOT, and SU indicated that SU is superior to other sampling strategies. Comparison of the EE-based parameter importance rankings with those of Sobol' helped to quantify the qualitativeness of the EE parameter screening approach, reinforcing the fact that one should use EE only to reduce the resource burden required by FAST/Sobol' analyses but not to replace it.
A quasi-Newton acceleration for high-dimensional optimization algorithms.
Zhou, Hua; Alexander, David; Lange, Kenneth
2011-01-01
In many statistical problems, maximum likelihood estimation by an EM or MM algorithm suffers from excruciatingly slow convergence. This tendency limits the application of these algorithms to modern high-dimensional problems in data mining, genomics, and imaging. Unfortunately, most existing acceleration techniques are ill-suited to complicated models involving large numbers of parameters. The squared iterative methods (SQUAREM) recently proposed by Varadhan and Roland constitute one notable exception. This paper presents a new quasi-Newton acceleration scheme that requires only modest increments in computation per iteration and overall storage and rivals or surpasses the performance of SQUAREM on several representative test problems.
Predicting Time Series from Short-Term High-Dimensional Data
NASA Astrophysics Data System (ADS)
Ma, Huanfei; Zhou, Tianshou; Aihara, Kazuyuki; Chen, Luonan
The prediction of future values of time series is a challenging task in many fields. In particular, making prediction based on short-term data is believed to be difficult. Here, we propose a method to predict systems' low-dimensional dynamics from high-dimensional but short-term data. Intuitively, it can be considered as a transformation from the inter-variable information of the observed high-dimensional data into the corresponding low-dimensional but long-term data, thereby equivalent to prediction of time series data. Technically, this method can be viewed as an inverse implementation of delayed embedding reconstruction. Both methods and algorithms are developed. To demonstrate the effectiveness of the theoretical result, benchmark examples and real-world problems from various fields are studied.
Altiparmak, Fatih; Ferhatosmanoglu, Hakan; Erdal, Selnur; Trost, Donald C
2006-04-01
An effective analysis of clinical trials data involves analyzing different types of data such as heterogeneous and high dimensional time series data. The current time series analysis methods generally assume that the series at hand have sufficient length to apply statistical techniques to them. Other ideal case assumptions are that data are collected in equal length intervals, and while comparing time series, the lengths are usually expected to be equal to each other. However, these assumptions are not valid for many real data sets, especially for the clinical trials data sets. An addition, the data sources are different from each other, the data are heterogeneous, and the sensitivity of the experiments varies by the source. Approaches for mining time series data need to be revisited, keeping the wide range of requirements in mind. In this paper, we propose a novel approach for information mining that involves two major steps: applying a data mining algorithm over homogeneous subsets of data, and identifying common or distinct patterns over the information gathered in the first step. Our approach is implemented specifically for heterogeneous and high dimensional time series clinical trials data. Using this framework, we propose a new way of utilizing frequent itemset mining, as well as clustering and declustering techniques with novel distance metrics for measuring similarity between time series data. By clustering the data, we find groups of analytes (substances in blood) that are most strongly correlated. Most of these relationships already known are verified by the clinical panels, and, in addition, we identify novel groups that need further biomedical analysis. A slight modification to our algorithm results an effective declustering of high dimensional time series data, which is then used for "feature selection." Using industry-sponsored clinical trials data sets, we are able to identify a small set of analytes that effectively models the state of normal health.
NASA Astrophysics Data System (ADS)
Bastian, Nate; Cabrera-Ziri, Ivan; Salaris, Maurizio
2015-05-01
A number of stellar sources have been advocated as the origin of the enriched material required to explain the abundance anomalies seen in ancient globular clusters (GCs). Most studies to date have compared the yields from potential sources [asymptotic giant branch stars (AGBs), fast rotating massive stars (FRMS), high-mass interacting binaries (IBs), and very massive stars (VMS)] with observations of specific elements that are observed to vary from star-to-star in GCs, focusing on extreme GCs such as NGC 2808, which display large He variations. However, a consistency check between the results of fitting extreme cases with the requirements of more typical clusters, has rarely been done. Such a check is particularly timely given the constraints on He abundances in GCs now available. Here, we show that all of the popular enrichment sources fail to reproduce the observed trends in GCs, focusing primarily on Na, O and He. In particular, we show that any model that can fit clusters like NGC 2808, will necessarily fail (by construction) to fit more typical clusters like 47 Tuc or NGC 288. All sources severely overproduce He for most clusters. Additionally, given the large differences in He spreads between clusters, but similar spreads observed in Na-O, only sources with large degrees of stochasticity in the resulting yields will be able to fit the observations. We conclude that no enrichment source put forward so far (AGBs, FRMS, IBs, VMS - or combinations thereof) is consistent with the observations of GCs. Finally, the observed trends of increasing [N/Fe] and He spread with increasing cluster mass cannot be resolved within a self-enrichment framework, without further exacerbating the mass-budget problem.
High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries
Zollanvari, Amin
2015-01-01
High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical–statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject. PMID:27081307
Querying Patterns in High-Dimensional Heterogenous Datasets
ERIC Educational Resources Information Center
Singh, Vishwakarma
2012-01-01
The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a…
NASA Astrophysics Data System (ADS)
Verkhovtseva, É. T.; Gospodarev, I. A.; Grishaev, A. V.; Kovalenko, S. I.; Solnyshkin, D. D.; Syrkin, E. S.; Feodos'ev, S. B.
2003-05-01
The dependence of the rms amplitudes of atoms in free clusters of solidified inert gases on the cluster size is investigated theoretically and experimentally. Free clusters are produced by homogeneous nucleation in an adiabatically expanding supersonic stream. Electron diffraction is used to measure the rms amplitudes of the atoms; the Jacobi-matrix method is used for theoretical calculations. A series of distinguishing features of the atomic dynamics of microclusters was found. This was necessary to determine the character of the formation and the stability conditions of the crystal structure. It wass shown that for clusters consisting of less than N˜103 atoms, as the cluster size decreases, the rms amplitudes grow much more rapidly than expected from the increase in the specific contribution of the surface. It is also established that an fcc structure of a free cluster, as a rule, contains twinning defects (nuclei of an hcp phase). One reason for the appearance of such defects is the so-called vertex instability (anomalously large oscillation amplitudes) of the atoms in coordination spheres.
ERIC Educational Resources Information Center
Brusco, Michael J.
2007-01-01
The study of human performance on discrete optimization problems has a considerable history that spans various disciplines. The two most widely studied problems are the Euclidean traveling salesperson problem and the quadratic assignment problem. The purpose of this paper is to outline a program of study for the measurement of human performance on…
Partially supervised speaker clustering.
Tang, Hao; Chu, Stephen Mingyu; Hasegawa-Johnson, Mark; Huang, Thomas S
2012-05-01
Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm—linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical
Improved shrunken centroid classifiers for high-dimensional class-imbalanced data
2013-01-01
Background PAM, a nearest shrunken centroid method (NSC), is a popular classification method for high-dimensional data. ALP and AHP are NSC algorithms that were proposed to improve upon PAM. The NSC methods base their classification rules on shrunken centroids; in practice the amount of shrinkage is estimated minimizing the overall cross-validated (CV) error rate. Results We show that when data are class-imbalanced the three NSC classifiers are biased towards the majority class. The bias is larger when the number of variables or class-imbalance is larger and/or the differences between classes are smaller. To diminish the class-imbalance problem of the NSC classifiers we propose to estimate the amount of shrinkage by maximizing the CV geometric mean of the class-specific predictive accuracies (g-means). Conclusions The results obtained on simulated and real high-dimensional class-imbalanced data show that our approach outperforms the currently used strategy based on the minimization of the overall error rate when NSC classifiers are biased towards the majority class. The number of variables included in the NSC classifiers when using our approach is much smaller than with the original approach. This result is supported by experiments on simulated and real high-dimensional class-imbalanced data. PMID:23433084
Xue, Zhong; Shen, Dinggang; Davatzikos, Christos
2006-10-01
This paper proposes a 3D statistical model aiming at effectively capturing statistics of high-dimensional deformation fields and then uses this prior knowledge to constrain 3D image warping. The conventional statistical shape model methods, such as the active shape model (ASM), have been very successful in modeling shape variability. However, their accuracy and effectiveness typically drop dramatically in high-dimensionality problems involving relatively small training datasets, which is customary in 3D and 4D medical imaging applications. The proposed statistical model of deformation (SMD) uses wavelet-based decompositions coupled with PCA in each wavelet band, in order to more accurately estimate the pdf of high-dimensional deformation fields, when a relatively small number of training samples are available. SMD is further used as statistical prior to regularize the deformation field in an SMD-constrained deformable registration framework. As a result, more robust registration results are obtained relative to using generic smoothness constraints on deformation fields, such as Laplacian-based regularization. In experiments, we first illustrate the performance of SMD in representing the variability of deformation fields and then evaluate the performance of the SMD-constrained registration, via comparing a hierarchical volumetric image registration algorithm, HAMMER, with its SMD-constrained version, referred to as SMD+HAMMER. This SMD-constrained deformable registration framework can potentially incorporate various registration algorithms to improve robustness and stability via statistical shape constraints.
Lee, Jenny Hyunjung; McDonnell, Kevin T; Zelenyuk, Alla; Imre, Dan; Mueller, Klaus
2013-07-11
Although the Euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging inter-cluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multi-dimensional scaling (MDS) where one can often observe non-intuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our bi-scale framework distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate Euclidean distance.
Ma Xiang; Zabaras, Nicholas
2010-05-20
A computational methodology is developed to address the solution of high-dimensional stochastic problems. It utilizes high-dimensional model representation (HDMR) technique in the stochastic space to represent the model output as a finite hierarchical correlated function expansion in terms of the stochastic inputs starting from lower-order to higher-order component functions. HDMR is efficient at capturing the high-dimensional input-output relationship such that the behavior for many physical systems can be modeled to good accuracy only by the first few lower-order terms. An adaptive version of HDMR is also developed to automatically detect the important dimensions and construct higher-order terms using only the important dimensions. The newly developed adaptive sparse grid collocation (ASGC) method is incorporated into HDMR to solve the resulting sub-problems. By integrating HDMR and ASGC, it is computationally possible to construct a low-dimensional stochastic reduced-order model of the high-dimensional stochastic problem and easily perform various statistic analysis on the output. Several numerical examples involving elementary mathematical functions and fluid mechanics problems are considered to illustrate the proposed method. The cases examined show that the method provides accurate results for stochastic dimensionality as high as 500 even with large-input variability. The efficiency of the proposed method is examined by comparing with Monte Carlo (MC) simulation.
High-dimensional statistical inference: From vector to matrix
NASA Astrophysics Data System (ADS)
Zhang, Anru
Statistical inference for sparse signals or low-rank matrices in high-dimensional settings is of significant interest in a range of contemporary applications. It has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. In this thesis, we consider several problems in including sparse signal recovery (compressed sensing under restricted isometry) and low-rank matrix recovery (matrix recovery via rank-one projections and structured matrix completion). The first part of the thesis discusses compressed sensing and affine rank minimization in both noiseless and noisy cases and establishes sharp restricted isometry conditions for sparse signal and low-rank matrix recovery. The analysis relies on a key technical tool which represents points in a polytope by convex combinations of sparse vectors. The technique is elementary while leads to sharp results. It is shown that, in compressed sensing, delta kA < 1/3, deltak A+ thetak,kA < 1, or deltatkA < √( t - 1)/t for any given constant t ≥ 4/3 guarantee the exact recovery of all k sparse signals in the noiseless case through the constrained ℓ1 minimization, and similarly in affine rank minimization delta rM < 1/3, deltar M + thetar, rM < 1, or deltatrM< √( t - 1)/t ensure the exact reconstruction of all matrices with rank at most r in the noiseless case via the constrained nuclear norm minimization. Moreover, for any epsilon > 0, delta kA < 1/3 + epsilon, deltak A + thetak,kA < 1 + epsilon, or deltatkA< √(t - 1) / t + epsilon are not sufficient to guarantee the exact recovery of all k-sparse signals for large k. Similar result also holds for matrix recovery. In addition, the conditions delta kA<1/3, deltak A+ thetak,kA<1, delta tkA < √(t - 1)/t and deltarM<1/3, delta rM+ thetar,rM<1, delta trM< √(t - 1)/ t are also shown to be sufficient respectively for stable recovery of approximately sparse signals and low-rank matrices in the noisy case
High-dimensional modulation for coherent optical communications systems.
Millar, David S; Koike-Akino, Toshiaki; Arık, Sercan Ö; Kojima, Keisuke; Parsons, Kieran; Yoshida, Tsuyoshi; Sugihara, Takashi
2014-04-01
In this paper, we examine the performance of several modulation formats in more than four dimensions for coherent optical communications systems. We compare two high-dimensional modulation design methodologies based on spherical cutting of lattices and block coding of a 'base constellation' of binary phase shift keying (BPSK) on each dimension. The performances of modulation formats generated with these methodologies is analyzed in the asymptotic signal-to-noise ratio regime and for an additive white Gaussian noise (AWGN) channel. We then study the application of both types of high-dimensional modulation formats to standard single-mode fiber (SSMF) transmission systems. For modulation with spectral efficiencies comparable to dual-polarization (DP-) BPSK, polarization-switched quaternary phase shift keying (PS-QPSK) and DP-QPSK, we demonstrate SNR gains of up to 3 dB, 0.9 dB and 1 dB respectively, at a BER of 10(-3).
Exploration, visualization, and preprocessing of high-dimensional data.
Wu, Zhijin; Wu, Zhiqiang
2010-01-01
The rapid advances in biotechnology have given rise to a variety of high-dimensional data. Many of these data, including DNA microarray data, mass spectrometry protein data, and high-throughput screening (HTS) assay data, are generated by complex experimental procedures that involve multiple steps such as sample extraction, purification and/or amplification, labeling, fragmentation, and detection. Therefore, the quantity of interest is not directly obtained and a number of preprocessing procedures are necessary to convert the raw data into the format with biological relevance. This also makes exploratory data analysis and visualization essential steps to detect possible defects, anomalies or distortion of the data, to test underlying assumptions and thus ensure data quality. The characteristics of the data structure revealed in exploratory analysis often motivate decisions in preprocessing procedures to produce data suitable for downstream analysis. In this chapter we review the common techniques in exploring and visualizing high-dimensional data and introduce the basic preprocessing procedures.
Why neurons mix: high dimensionality for higher cognition.
Fusi, Stefano; Miller, Earl K; Rigotti, Mattia
2016-04-01
Neurons often respond to diverse combinations of task-relevant variables. This form of mixed selectivity plays an important computational role which is related to the dimensionality of the neural representations: high-dimensional representations with mixed selectivity allow a simple linear readout to generate a huge number of different potential responses. In contrast, neural representations based on highly specialized neurons are low dimensional and they preclude a linear readout from generating several responses that depend on multiple task-relevant variables. Here we review the conceptual and theoretical framework that explains the importance of mixed selectivity and the experimental evidence that recorded neural representations are high-dimensional. We end by discussing the implications for the design of future experiments. PMID:26851755
Detector-decoy high-dimensional quantum key distribution.
Bao, Haize; Bao, Wansu; Wang, Yang; Chen, Ruike; Zhou, Chun; Jiang, Musheng; Li, Hongwei
2016-09-19
The decoy-state high-dimensional quantum key distribution provides a practical secure way to share more private information with high photon-information efficiency. In this paper, based on detector-decoy method, we propose a detector-decoy high-dimensional quantum key distribution protocol. Employing threshold detectors and a variable attenuator, we can promise the security under Gsussian collective attacks with much simpler operations in practical implementation. By numerical evaluation, we show that without varying the source intensity, our protocol performs much better than one-decoy-state protocol and as well as the two-decoy-state protocol in the infinite-size regime. In the finite-size regime, our protocol can achieve better results. Specially, when the detector efficiency is lower, the advantage of the detector-decoy method becomes more prominent. PMID:27661950
Quantum Teleportation of High-dimensional Atomic Momenta State
NASA Astrophysics Data System (ADS)
Qurban, Misbah; Abbas, Tasawar; Rameez-ul-Islam; Ikram, Manzoor
2016-06-01
Atomic momenta states of the neutral atoms are known to be decoherence resistant and therefore present a viable solution for most of the quantum information tasks including the quantum teleportation. We present a systematic protocol for the teleportation of high-dimensional quantized momenta atomic states to the field state inside the cavities by applying standard cavity QED techniques. The proposal can be executed under prevailing experimental scenario.
Some Unsolved Problems, Questions, and Applications of the Brightsen Nucleon Cluster Model
NASA Astrophysics Data System (ADS)
Smarandache, Florentin
2010-10-01
Brightsen Model is opposite to the Standard Model, and it was build on John Weeler's Resonating Group Structure Model and on Linus Pauling's Close-Packed Spheron Model. Among Brightsen Model's predictions and applications we cite the fact that it derives the average number of prompt neutrons per fission event, it provides a theoretical way for understanding the low temperature / low energy reactions and for approaching the artificially induced fission, it predicts that forces within nucleon clusters are stronger than forces between such clusters within isotopes; it predicts the unmatter entities inside nuclei that result from stable and neutral union of matter and antimatter, and so on. But these predictions have to be tested in the future at the new CERN laboratory.
Cooling or Boiling? Cooling Flow Problem and MHD Instabilities in Galaxy Clusters
NASA Astrophysics Data System (ADS)
Bogdanovic, Tamara; Reynolds, C. S.; Balbus, S. A.; Parrish, I. J.
2010-03-01
In recent years our understanding of the action of thermal conduction in the atmospheres such as the intercluster matter (ICM) is undergoing a revolution. It has been realized that thermal conduction can lead to magnetohydrodynamic (MHD) instabilities at all radii in the ICM of clusters and in such way affect the evolution of their thermodynamic properties. I will describe findings based on several global models of cooling core clusters in which we explored the role of heat conduction and heat flux buoyancy instability (HBI) on the evolution of these cores. Our main finding is that a cooling core in the aftermath of HBI cannot be rescued from the cooling catastrophe by thermal conduction alone, although its action can significantly delay the catastrophic core collapse. This is because HBI tends to wrap the lines of magnetic field onto spherical surfaces surrounding the cooling core and in such way greatly suppress further conductive heating along the lines of magnetic field. We speculate that in real clusters, the central AGN and possibly mergers play the role of "stirrers", periodically disrupting the azimuthal field structure and allowing thermal conduction to sporadically heat the core. Support for this project is provided by NASA through Einstein Postdoctoral Fellowship Award PF9-00061 and by the National Science Foundation under grant AST0908212.
Mustanski, B.; Metzger, A.; Pine, D. S.; Kistner-Griffin, E.; Cook, E.; Wakschlag, L. S.
2013-01-01
This study illustrates the application of a latent modeling approach to genotype–phenotype relationships and gene×environment interactions, using a novel, multidimensional model of adult female problem behavior, including maternal prenatal smoking. The gene of interest is the mono-amine oxidase A (MAOA) gene which has been well studied in relation to antisocial behavior. Participants were adult women (N=192) who were sampled from a prospective pregnancy cohort of non-Hispanic, white individuals recruited from a neighborhood health clinic. Structural equation modeling was used to model a female problem behavior phenotype, which included conduct problems, substance use, impulsive-sensation seeking, interpersonal aggression, and prenatal smoking. All of the female problem behavior dimensions clustered together strongly, with the exception of prenatal smoking. A main effect of MAOA genotype and a MAOA× physical maltreatment interaction were detected with the Conduct Problems factor. Our phenotypic model showed that prenatal smoking is not simply a marker of other maternal problem behaviors. The risk variant in the MAOA main effect and interaction analyses was the high activity MAOA genotype, which is discrepant from consensus findings in male samples. This result contributes to an emerging literature on sex-specific interaction effects for MAOA. PMID:22610759
Improving clustering by imposing network information
Gerber, Susanne; Horenko, Illia
2015-01-01
Cluster analysis is one of the most popular data analysis tools in a wide range of applied disciplines. We propose and justify a computationally efficient and straightforward-to-implement way of imposing the available information from networks/graphs (a priori available in many application areas) on a broad family of clustering methods. The introduced approach is illustrated on the problem of a noninvasive unsupervised brain signal classification. This task is faced with several challenging difficulties such as nonstationary noisy signals and a small sample size, combined with a high-dimensional feature space and huge noise-to-signal ratios. Applying this approach results in an exact unsupervised classification of very short signals, opening new possibilities for clustering methods in the area of a noninvasive brain-computer interface. PMID:26601225
Ma, Huanfei; Lin, Wei; Lai, Ying-Cheng
2013-05-01
Detecting unstable periodic orbits (UPOs) in chaotic systems based solely on time series is a fundamental but extremely challenging problem in nonlinear dynamics. Previous approaches were applicable but mostly for low-dimensional chaotic systems. We develop a framework, integrating approximation theory of neural networks and adaptive synchronization, to address the problem of time-series-based detection of UPOs in high-dimensional chaotic systems. An example of finding UPOs from the classic Mackey-Glass equation is presented.
Fiorentino, Lavinia; Rissling, Michelle; Liu, Lianqi; Ancoli-Israel, Sonia
2011-01-01
Breast cancer is the most commonly diagnosed cancer in women. Insomnia is a significant problem in breast cancer patients, affecting between 20% to 70% of newly diagnosed or recently treated cancer patients. Pain, fatigue, anxiety, and depression are also common conditions in breast cancer and often co-occur with insomnia in symptom clusters, exacerbating one another, and decreasing quality of life (QOL). There have been no clinical trials of drugs for sleep in cancer. Cognitive behavioral psychotherapies on the other hand, have shown some of the most positive results in alleviating the distressing symptoms that often accompany the breast cancer experience, but even these studies have not targeted the symptom cluster. Pharmacological as well as non-pharmacological treatments need to be explored. It might be that a combined pharmacological and behavioral treatment is most efficacious. In short, substantially more research is needed to fully understand and treat the symptom cluster of insomnia, fatigue, pain, depression and anxiety in breast cancer. PMID:22140397
Baker-Henningham, Helen; Scott, Stephen; Jones, Kelvyn; Walker, Susan
2012-01-01
Background There is an urgent need for effective, affordable interventions to prevent child mental health problems in low- and middle-income countries. Aims To determine the effects of a universal pre-school-based intervention on child conduct problems and social skills at school and at home. Method In a cluster randomised design, 24 community pre-schools in inner-city areas of Kingston, Jamaica, were randomly assigned to receive the Incredible Years Teacher Training intervention (n = 12) or to a control group (n = 12). Three children from each class with the highest levels of teacher-reported conduct problems were selected for evaluation, giving 225 children aged 3–6 years. The primary outcome was observed child behaviour at school. Secondary outcomes were child behaviour by parent and teacher report, child attendance and parents’ attitude to school. The study is registered as ISRCTN35476268. Results Children in intervention schools showed significantly reduced conduct problems (effect size (ES) = 0.42) and increased friendship skills (ES = 0.74) through observation, significant reductions to teacher-reported (ES = 0.47) and parent-reported (ES = 0.22) behaviour difficulties and increases in teacher-reported social skills (ES = 0.59) and child attendance (ES = 0.30). Benefits to parents’ attitude to school were not significant. Conclusions A low-cost, school-based intervention in a middle-income country substantially reduces child conduct problems and increases child social skills at home and at school. PMID:22500015
Hawking radiation of a high-dimensional rotating black hole
NASA Astrophysics Data System (ADS)
Ren, Zhao; Lichun, Zhang; Huaifan, Li; Yueqin, Wu
2010-01-01
We extend the classical Damour-Ruffini method and discuss Hawking radiation spectrum of high-dimensional rotating black hole using Tortoise coordinate transformation defined by taking the reaction of the radiation to the spacetime into consideration. Under the condition that the energy and angular momentum are conservative, taking self-gravitation action into account, we derive Hawking radiation spectrums which satisfy unitary principle in quantum mechanics. It is shown that the process that the black hole radiates particles with energy ω is a continuous tunneling process. We provide a theoretical basis for further studying the physical mechanism of black-hole radiation.
Suh, C.; Glynn, S.; Scharf, J.; Contreras, M. A.; Noufi, R.; Jones, W. B.; Biagioni, D.
2011-07-01
We demonstrate how advanced exploratory data analysis coupled to data-mining techniques can be used to scrutinize the high-dimensional data space of photovoltaics in the context of thin films of Al-doped ZnO (AZO), which are essential materials as a transparent conducting oxide (TCO) layer in CuInxGa1-xSe2 (CIGS) solar cells. AZO data space, wherein each sample is synthesized from a different process history and assessed with various characterizations, is transformed, reorganized, and visualized in order to extract optimal process conditions. The data-analysis methods used include parallel coordinates, diffusion maps, and hierarchical agglomerative clustering algorithms combined with diffusion map embedding.
MULTI-WAY BLOCKMODELS FOR ANALYZING COORDINATED HIGH-DIMENSIONAL RESPONSES
Airoldi, Edoardo M; Wang, Xiaopei; Lin, Xiaodong
2013-01-01
We consider the problem of quantifying temporal coordination between multiple high-dimensional responses. We introduce a family of multi-way stochastic blockmodels suited for this problem, which avoids pre-processing steps such as binning and thresholding commonly adopted for this type of problems, in biology. We develop two inference procedures based on collapsed Gibbs sampling and variational methods. We provide a thorough evaluation of the proposed methods on simulated data, in terms of membership and blockmodel estimation, predictions out-of-sample, and run-time. We also quantify the effects of censoring procedures such as binning and thresholding on the estimation tasks. We use these models to carry out an empirical analysis of the functional mechanisms driving the coordination between gene expression and metabolite concentrations during carbon and nitrogen starvation, in S. cerevisiae. PMID:24587846
An Adaptive ANOVA-based PCKF for High-Dimensional Nonlinear Inverse Modeling
LI, Weixuan; Lin, Guang; Zhang, Dongxiao
2014-02-01
The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos bases in the expansion helps to capture uncertainty more accurately but increases computational cost. Bases selection is particularly important for high-dimensional stochastic problems because the number of polynomial chaos bases required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE bases are pre-set based on users’ experience. Also, for sequential data assimilation problems, the bases kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE bases for different problems and automatically adjusts the number of bases in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm is tested with different examples and demonstrated great effectiveness in comparison with non-adaptive PCKF and En
An adaptive ANOVA-based PCKF for high-dimensional nonlinear inverse modeling
Li, Weixuan; Lin, Guang; Zhang, Dongxiao
2014-02-01
The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos basis functions in the expansion helps to capture uncertainty more accurately but increases computational cost. Selection of basis functions is particularly important for high-dimensional stochastic problems because the number of polynomial chaos basis functions required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE basis functions are pre-set based on users' experience. Also, for sequential data assimilation problems, the basis functions kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE basis functions for different problems and automatically adjusts the number of basis functions in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm was tested with different examples and demonstrated
Asymptotic Stability of High-dimensional Zakharov-Kuznetsov Solitons
NASA Astrophysics Data System (ADS)
Côte, Raphaël; Muñoz, Claudio; Pilod, Didier; Simpson, Gideon
2016-05-01
We prove that solitons (or solitary waves) of the Zakharov-Kuznetsov (ZK) equation, a physically relevant high dimensional generalization of the Korteweg-de Vries (KdV) equation appearing in Plasma Physics, and having mixed KdV and nonlinear Schrödinger (NLS) dynamics, are strongly asymptotically stable in the energy space. We also prove that the sum of well-arranged solitons is stable in the same space. Orbital stability of ZK solitons is well-known since the work of de Bouard [Proc R Soc Edinburgh 126:89-112, 1996]. Our proofs follow the ideas of Martel [SIAM J Math Anal 157:759-781, 2006] and Martel and Merle [Math Ann 341:391-427, 2008], applied for generalized KdV equations in one dimension. In particular, we extend to the high dimensional case several monotonicity properties for suitable half-portions of mass and energy; we also prove a new Liouville type property that characterizes ZK solitons, and a key Virial identity for the linear and nonlinear part of the ZK dynamics, obtained independently of the mixed KdV-NLS dynamics. This last Virial identity relies on a simple sign condition which is numerically tested for the two and three dimensional cases with no additional spectral assumptions required. Possible extensions to higher dimensions and different nonlinearities could be obtained after a suitable local well-posedness theory in the energy space, and the verification of a corresponding sign condition.
New data assimilation system DNDAS for high-dimensional models
NASA Astrophysics Data System (ADS)
Qun-bo, Huang; Xiao-qun, Cao; Meng-bin, Zhu; Wei-min, Zhang; Bai-nian, Liu
2016-05-01
The tangent linear (TL) models and adjoint (AD) models have brought great difficulties for the development of variational data assimilation system. It might be impossible to develop them perfectly without great efforts, either by hand, or by automatic differentiation tools. In order to break these limitations, a new data assimilation system, dual-number data assimilation system (DNDAS), is designed based on the dual-number automatic differentiation principles. We investigate the performance of DNDAS with two different optimization schemes and subsequently give a discussion on whether DNDAS is appropriate for high-dimensional forecast models. The new data assimilation system can avoid the complicated reverse integration of the adjoint model, and it only needs the forward integration in the dual-number space to obtain the cost function and its gradient vector concurrently. To verify the correctness and effectiveness of DNDAS, we implemented DNDAS on a simple ordinary differential model and the Lorenz-63 model with different optimization methods. We then concentrate on the adaptability of DNDAS to the Lorenz-96 model with high-dimensional state variables. The results indicate that whether the system is simple or nonlinear, DNDAS can accurately reconstruct the initial condition for the forecast model and has a strong anti-noise characteristic. Given adequate computing resource, the quasi-Newton optimization method performs better than the conjugate gradient method in DNDAS. Project supported by the National Natural Science Foundation of China (Grant Nos. 41475094 and 41375113).
Likelihood-Free Inference in High-Dimensional Models.
Kousathanas, Athanasios; Leuenberger, Christoph; Helfer, Jonas; Quinodoz, Mathieu; Foll, Matthieu; Wegmann, Daniel
2016-06-01
Methods that bypass analytical evaluations of the likelihood function have become an indispensable tool for statistical inference in many fields of science. These so-called likelihood-free methods rely on accepting and rejecting simulations based on summary statistics, which limits them to low-dimensional models for which the value of the likelihood is large enough to result in manageable acceptance rates. To get around these issues, we introduce a novel, likelihood-free Markov chain Monte Carlo (MCMC) method combining two key innovations: updating only one parameter per iteration and accepting or rejecting this update based on subsets of statistics approximately sufficient for this parameter. This increases acceptance rates dramatically, rendering this approach suitable even for models of very high dimensionality. We further derive that for linear models, a one-dimensional combination of statistics per parameter is sufficient and can be found empirically with simulations. Finally, we demonstrate that our method readily scales to models of very high dimensionality, using toy models as well as by jointly inferring the effective population size, the distribution of fitness effects (DFE) of segregating mutations, and selection coefficients for each locus from data of a recent experiment on the evolution of drug resistance in influenza. PMID:27052569
High-dimensional camera shake removal with given depth map.
Yue, Tao; Suo, Jinli; Dai, Qionghai
2014-06-01
Camera motion blur is drastically nonuniform for large depth-range scenes, and the nonuniformity caused by camera translation is depth dependent but not the case for camera rotations. To restore the blurry images of large-depth-range scenes deteriorated by arbitrary camera motion, we build an image blur model considering 6-degrees of freedom (DoF) of camera motion with a given scene depth map. To make this 6D depth-aware model tractable, we propose a novel parametrization strategy to reduce the number of variables and an effective method to estimate high-dimensional camera motion as well. The number of variables is reduced by temporal sampling motion function, which describes the 6-DoF camera motion by sampling the camera trajectory uniformly in time domain. To effectively estimate the high-dimensional camera motion parameters, we construct the probabilistic motion density function (PMDF) to describe the probability distribution of camera poses during exposure, and apply it as a unified constraint to guide the convergence of the iterative deblurring algorithm. Specifically, PMDF is computed through a back projection from 2D local blur kernels to 6D camera motion parameter space and robust voting. We conduct a series of experiments on both synthetic and real captured data, and validate that our method achieves better performance than existing uniform methods and nonuniform methods on large-depth-range scenes.
Power Enhancement in High Dimensional Cross-Sectional Tests
Fan, Jianqing; Liao, Yuan; Yao, Jiawei
2016-01-01
We propose a novel technique to boost the power of testing a high-dimensional vector H : θ = 0 against sparse alternatives where the null hypothesis is violated only by a couple of components. Existing tests based on quadratic forms such as the Wald statistic often suffer from low powers due to the accumulation of errors in estimating high-dimensional parameters. More powerful tests for sparse alternatives such as thresholding and extreme-value tests, on the other hand, require either stringent conditions or bootstrap to derive the null distribution and often suffer from size distortions due to the slow convergence. Based on a screening technique, we introduce a “power enhancement component”, which is zero under the null hypothesis with high probability, but diverges quickly under sparse alternatives. The proposed test statistic combines the power enhancement component with an asymptotically pivotal statistic, and strengthens the power under sparse alternatives. The null distribution does not require stringent regularity conditions, and is completely determined by that of the pivotal statistic. As specific applications, the proposed methods are applied to testing the factor pricing models and validating the cross-sectional independence in panel data models. PMID:26778846
The Sparse MLE for Ultra-High-Dimensional Feature Screening
Xu, Chen; Chen, Jiahua
2014-01-01
Feature selection is fundamental for modeling the high dimensional data, where the number of features can be huge and much larger than the sample size. Since the feature space is so large, many traditional procedures become numerically infeasible. It is hence essential to first remove most apparently non-influential features before any elaborative analysis. Recently, several procedures have been developed for this purpose, which include the sure-independent-screening (SIS) as a widely-used technique. To gain the computational efficiency, the SIS screens features based on their individual predicting power. In this paper, we propose a new screening method via the sparsity-restricted maximum likelihood estimator (SMLE). The new method naturally takes the joint effects of features in the screening process, which gives itself an edge to potentially outperform the existing methods. This conjecture is further supported by the simulation studies under a number of modeling settings. We show that the proposed method is screening consistent in the context of ultra-high-dimensional generalized linear models. PMID:25382886
Likelihood-Free Inference in High-Dimensional Models.
Kousathanas, Athanasios; Leuenberger, Christoph; Helfer, Jonas; Quinodoz, Mathieu; Foll, Matthieu; Wegmann, Daniel
2016-06-01
Methods that bypass analytical evaluations of the likelihood function have become an indispensable tool for statistical inference in many fields of science. These so-called likelihood-free methods rely on accepting and rejecting simulations based on summary statistics, which limits them to low-dimensional models for which the value of the likelihood is large enough to result in manageable acceptance rates. To get around these issues, we introduce a novel, likelihood-free Markov chain Monte Carlo (MCMC) method combining two key innovations: updating only one parameter per iteration and accepting or rejecting this update based on subsets of statistics approximately sufficient for this parameter. This increases acceptance rates dramatically, rendering this approach suitable even for models of very high dimensionality. We further derive that for linear models, a one-dimensional combination of statistics per parameter is sufficient and can be found empirically with simulations. Finally, we demonstrate that our method readily scales to models of very high dimensionality, using toy models as well as by jointly inferring the effective population size, the distribution of fitness effects (DFE) of segregating mutations, and selection coefficients for each locus from data of a recent experiment on the evolution of drug resistance in influenza.
A potential implicit particle method for high-dimensional systems
NASA Astrophysics Data System (ADS)
Weir, B.; Miller, R. N.; Spitz, Y. H.
2013-11-01
This paper presents a particle method designed for high-dimensional state estimation. Instead of weighing random forecasts by their distance to given observations, the method samples an ensemble of particles around an optimal solution based on the observations (i.e., it is implicit). It differs from other implicit methods because it includes the state at the previous assimilation time as part of the optimal solution (i.e., it is a lag-1 smoother). This is accomplished through the use of a mixture model for the background distribution of the previous state. In a high-dimensional, linear, Gaussian example, the mixture-based implicit particle smoother does not collapse. Furthermore, using only a small number of particles, the implicit approach is able to detect transitions in two nonlinear, multi-dimensional generalizations of a double-well. Adding a step that trains the sampled distribution to the target distribution prevents collapse during the transitions, which are strongly nonlinear events. To produce similar estimates, other approaches require many more particles.
A cluster of brain tumours in a New South Wales colliery: a problem in interpretation.
Brown, A M; Christie, D; Devey, P; Nie, V M; Hicks, M N
1993-12-01
Following the reporting of a cluster of cases of brain tumour in the workforce of an underground coal mine (Mine A) in the Newcastle coalfield, a study was carried out to determine whether this phenomenon was due to chance alone or whether an environmental cause could be postulated. The study design was a historical cohort study over 15 years comparing the incidence of brain tumour (ICD9 191 and 192) in the index mine with that in two control mines (Mines B and C) in the same area and with that in the general Australian population. We compared environmental exposures (ionising and nonionizing radiation and chemical exposure) in the three mines. With Australian brain tumour incidence rates as reference, the standardised incidence ratio for brain tumour in Mine A was 5.3 (95 per cent confidence interval (CI) 1.08 to 14.04) and in Mines B and C combined was 1.23 (CI 0.02 to 3.80). On most environmental assessments the three mines were similar but Mine A used larger volumes of solvents than the other mines. This study poses two questions: was the increase in cases of brain tumour in Mine A 'real' and if so, was it related to the use of solvents? Data, from an investigation of a cluster such as this, are unlikely to be conclusive. Nevertheless, such answers are demanded not only by those at risk but also by the mine management, which is responsible for a safe working environment. Some of the difficulties involved with this judgment are discussed.
A reduced-order model from high-dimensional frictional hysteresis.
Biswas, Saurabh; Chatterjee, Anindya
2014-06-01
Hysteresis in material behaviour includes both signum nonlinearities as well as high dimensionality. Available models for component-level hysteretic behaviour are empirical. Here, we derive a low-order model for rate-independent hysteresis from a high-dimensional massless frictional system. The original system, being given in terms of signs of velocities, is first solved incrementally using a linear complementarity problem formulation. From this numerical solution, to develop a reduced-order model, basis vectors are chosen using the singular value decomposition. The slip direction in generalized coordinates is identified as the minimizer of a dissipation-related function. That function includes terms for frictional dissipation through signum nonlinearities at many friction sites. Luckily, it allows a convenient analytical approximation. Upon solution of the approximated minimization problem, the slip direction is found. A final evolution equation for a few states is then obtained that gives a good match with the full solution. The model obtained here may lead to new insights into hysteresis as well as better empirical modelling thereof.
Arif, Muhammad
2012-06-01
In pattern classification problems, feature extraction is an important step. Quality of features in discriminating different classes plays an important role in pattern classification problems. In real life, pattern classification may require high dimensional feature space and it is impossible to visualize the feature space if the dimension of feature space is greater than four. In this paper, we have proposed a Similarity-Dissimilarity plot which can project high dimensional space to a two dimensional space while retaining important characteristics required to assess the discrimination quality of the features. Similarity-dissimilarity plot can reveal information about the amount of overlap of features of different classes. Separable data points of different classes will also be visible on the plot which can be classified correctly using appropriate classifier. Hence, approximate classification accuracy can be predicted. Moreover, it is possible to know about whom class the misclassified data points will be confused by the classifier. Outlier data points can also be located on the similarity-dissimilarity plot. Various examples of synthetic data are used to highlight important characteristics of the proposed plot. Some real life examples from biomedical data are also used for the analysis. The proposed plot is independent of number of dimensions of the feature space.
Reconstructing high-dimensional two-photon entangled states via compressive sensing.
Tonolini, Francesco; Chan, Susan; Agnew, Megan; Lindsay, Alan; Leach, Jonathan
2014-10-13
Accurately establishing the state of large-scale quantum systems is an important tool in quantum information science; however, the large number of unknown parameters hinders the rapid characterisation of such states, and reconstruction procedures can become prohibitively time-consuming. Compressive sensing, a procedure for solving inverse problems by incorporating prior knowledge about the form of the solution, provides an attractive alternative to the problem of high-dimensional quantum state characterisation. Using a modified version of compressive sensing that incorporates the principles of singular value thresholding, we reconstruct the density matrix of a high-dimensional two-photon entangled system. The dimension of each photon is equal to d = 17, corresponding to a system of 83521 unknown real parameters. Accurate reconstruction is achieved with approximately 2500 measurements, only 3% of the total number of unknown parameters in the state. The algorithm we develop is fast, computationally inexpensive, and applicable to a wide range of quantum states, thus demonstrating compressive sensing as an effective technique for measuring the state of large-scale quantum systems.
High dimensional biological data retrieval optimization with NoSQL technology
2014-01-01
Background High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data
Nam, Julia EunJu; Mueller, Klaus
2013-02-01
Gaining a true appreciation of high-dimensional space remains difficult since all of the existing high-dimensional space exploration techniques serialize the space travel in some way. This is not so foreign to us since we, when traveling, also experience the world in a serial fashion. But we typically have access to a map to help with positioning, orientation, navigation, and trip planning. Here, we propose a multivariate data exploration tool that compares high-dimensional space navigation with a sightseeing trip. It decomposes this activity into five major tasks: 1) Identify the sights: use a map to identify the sights of interest and their location; 2) Plan the trip: connect the sights of interest along a specifyable path; 3) Go on the trip: travel along the route; 4) Hop off the bus: experience the location, look around, zoom into detail; and 5) Orient and localize: regain bearings in the map. We describe intuitive and interactive tools for all of these tasks, both global navigation within the map and local exploration of the data distributions. For the latter, we describe a polygonal touchpad interface which enables users to smoothly tilt the projection plane in high-dimensional space to produce multivariate scatterplots that best convey the data relationships under investigation. Motion parallax and illustrative motion trails aid in the perception of these transient patterns. We describe the use of our system within two applications: 1) the exploratory discovery of data configurations that best fit a personal preference in the presence of tradeoffs and 2) interactive cluster analysis via cluster sculpting in N-D.
Nam, Julia EunJu; Mueller, Klaus
2013-02-01
Gaining a true appreciation of high-dimensional space remains difficult since all of the existing high-dimensional space exploration techniques serialize the space travel in some way. This is not so foreign to us since we, when traveling, also experience the world in a serial fashion. But we typically have access to a map to help with positioning, orientation, navigation, and trip planning. Here, we propose a multivariate data exploration tool that compares high-dimensional space navigation with a sightseeing trip. It decomposes this activity into five major tasks: 1) Identify the sights: use a map to identify the sights of interest and their location; 2) Plan the trip: connect the sights of interest along a specifyable path; 3) Go on the trip: travel along the route; 4) Hop off the bus: experience the location, look around, zoom into detail; and 5) Orient and localize: regain bearings in the map. We describe intuitive and interactive tools for all of these tasks, both global navigation within the map and local exploration of the data distributions. For the latter, we describe a polygonal touchpad interface which enables users to smoothly tilt the projection plane in high-dimensional space to produce multivariate scatterplots that best convey the data relationships under investigation. Motion parallax and illustrative motion trails aid in the perception of these transient patterns. We describe the use of our system within two applications: 1) the exploratory discovery of data configurations that best fit a personal preference in the presence of tradeoffs and 2) interactive cluster analysis via cluster sculpting in N-D. PMID:22350201
Nakano, Takashi; Otsuka, Makoto; Yoshimoto, Junichiro; Doya, Kenji
2015-01-01
A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the past, even though these are inevitable and constraining features of learning in real environments. This class of problem is formally known as partially observable reinforcement learning (PORL) problems. It provides a generalization of reinforcement learning to partially observable domains. In addition, observations in the real world tend to be rich and high-dimensional. In this work, we use a spiking neural network model to approximate the free energy of a restricted Boltzmann machine and apply it to the solution of PORL problems with high-dimensional observations. Our spiking network model solves maze tasks with perceptually ambiguous high-dimensional observations without knowledge of the true environment. An extended model with working memory also solves history-dependent tasks. The way spiking neural networks handle PORL problems may provide a glimpse into the underlying laws of neural information processing which can only be discovered through such a top-down approach. PMID:25734662
Algorithmic Tools for Mining High-Dimensional Cytometry Data.
Chester, Cariad; Maecker, Holden T
2015-08-01
The advent of mass cytometry has led to an unprecedented increase in the number of analytes measured in individual cells, thereby increasing the complexity and information content of cytometric data. Although this technology is ideally suited to the detailed examination of the immune system, the applicability of the different methods for analyzing such complex data is less clear. Conventional data analysis by manual gating of cells in biaxial dot plots is often subjective, time consuming, and neglectful of much of the information contained in a highly dimensional cytometric dataset. Algorithmic data mining has the promise to eliminate these concerns, and several such tools have been applied recently to mass cytometry data. We review computational data mining tools that have been used to analyze mass cytometry data, outline their differences, and comment on their strengths and limitations. This review will help immunologists to identify suitable algorithmic tools for their particular projects.
Additivity principle in high-dimensional deterministic systems.
Saito, Keiji; Dhar, Abhishek
2011-12-16
The additivity principle (AP), conjectured by Bodineau and Derrida [Phys. Rev. Lett. 92, 180601 (2004)], is discussed for the case of heat conduction in three-dimensional disordered harmonic lattices to consider the effects of deterministic dynamics, higher dimensionality, and different transport regimes, i.e., ballistic, diffusive, and anomalous transport. The cumulant generating function (CGF) for heat transfer is accurately calculated and compared with the one given by the AP. In the diffusive regime, we find a clear agreement with the conjecture even if the system is high dimensional. Surprisingly, even in the anomalous regime the CGF is also well fitted by the AP. Lower-dimensional systems are also studied and the importance of three dimensionality for the validity is stressed. PMID:22243060
Parsimonious description for predicting high-dimensional dynamics
NASA Astrophysics Data System (ADS)
Hirata, Yoshito; Takeuchi, Tomoya; Horai, Shunsuke; Suzuki, Hideyuki; Aihara, Kazuyuki
2015-10-01
When we observe a system, we often cannot observe all its variables and may have some of its limited measurements. Under such a circumstance, delay coordinates, vectors made of successive measurements, are useful to reconstruct the states of the whole system. Although the method of delay coordinates is theoretically supported for high-dimensional dynamical systems, practically there is a limitation because the calculation for higher-dimensional delay coordinates becomes more expensive. Here, we propose a parsimonious description of virtually infinite-dimensional delay coordinates by evaluating their distances with exponentially decaying weights. This description enables us to predict the future values of the measurements faster because we can reuse the calculated distances, and more accurately because the description naturally reduces the bias of the classical delay coordinates toward the stable directions. We demonstrate the proposed method with toy models of the atmosphere and real datasets related to renewable energy.
High-dimensional quantum nature of ghost angular Young's diffraction
Chen Lixiang; Leach, Jonathan; Jack, Barry; Padgett, Miles J.; Franke-Arnold, Sonja; She Weilong
2010-09-15
We propose a technique to characterize the dimensionality of entangled sources affected by any environment, including phase and amplitude masks or atmospheric turbulence. We illustrate this technique on the example of angular ghost diffraction using the orbital angular momentum (OAM) spectrum generated by a nonlocal double slit. We realize a nonlocal angular double slit by placing single angular slits in the paths of the signal and idler modes of the entangled light field generated by parametric down-conversion. Based on the observed OAM spectrum and the measured Shannon dimensionality spectrum of the possible quantum channels that contribute to Young's ghost diffraction, we calculate the associated dimensionality D{sub total}. The measured D{sub total} ranges between 1 and 2.74 depending on the opening angle of the angular slits. The ability to quantify the nature of high-dimensional entanglement is vital when considering quantum information protocols.
Parsimonious description for predicting high-dimensional dynamics
Hirata, Yoshito; Takeuchi, Tomoya; Horai, Shunsuke; Suzuki, Hideyuki; Aihara, Kazuyuki
2015-01-01
When we observe a system, we often cannot observe all its variables and may have some of its limited measurements. Under such a circumstance, delay coordinates, vectors made of successive measurements, are useful to reconstruct the states of the whole system. Although the method of delay coordinates is theoretically supported for high-dimensional dynamical systems, practically there is a limitation because the calculation for higher-dimensional delay coordinates becomes more expensive. Here, we propose a parsimonious description of virtually infinite-dimensional delay coordinates by evaluating their distances with exponentially decaying weights. This description enables us to predict the future values of the measurements faster because we can reuse the calculated distances, and more accurately because the description naturally reduces the bias of the classical delay coordinates toward the stable directions. We demonstrate the proposed method with toy models of the atmosphere and real datasets related to renewable energy. PMID:26510518
Future of High-Dimensional Data-Driven Exoplanet Science
NASA Astrophysics Data System (ADS)
Ford, Eric B.
2016-03-01
The detection and characterization of exoplanets has come a long way since the 1990’s. For example, instruments specifically designed for Doppler planet surveys feature environmental controls to minimize instrumental effects and advanced calibration systems. Combining these instruments with powerful telescopes, astronomers have detected thousands of exoplanets. The application of Bayesian algorithms has improved the quality and reliability with which astronomers characterize the mass and orbits of exoplanets. Thanks to continued improvements in instrumentation, now the detection of extrasolar low-mass planets is limited primarily by stellar activity, rather than observational uncertainties. This presents a new set of challenges which will require cross-disciplinary research to combine improved statistical algorithms with an astrophysical understanding of stellar activity and the details of astronomical instrumentation. I describe these challenges and outline the roles of parameter estimation over high-dimensional parameter spaces, marginalizing over uncertainties in stellar astrophysics and machine learning for the next generation of Doppler planet searches.
Modeling for Process Control: High-Dimensional Systems
Lev S. Tsimring
2008-09-15
Most of other technologically important systems (among them, powders and other granular systems) are intrinsically nonlinear. This project is focused on building the dynamical models for granular systems as a prototype for nonlinear high-dimensional systems exhibiting complex non-equilibrium phenomena. Granular materials present a unique opportunity to study these issues in a technologically important and yet fundamentally interesting setting. Granular systems exhibit a rich variety of regimes from gas-like to solid-like depending on the external excitation. Based the combination of the rigorous asymptotic analysis, available experimental data and nonlinear signal processing tools, we developed a multi-scale approach to the modeling of granular systems from detailed description of grain-grain interaction on a micro-scale to continuous modeling of large-scale granular flows with important geophysical applications.
High dimensional reflectance analysis of soil organic matter
NASA Technical Reports Server (NTRS)
Henderson, T. L.; Baumgardner, M. F.; Franzmeier, D. P.; Stott, D. E.; Coster, D. C.
1992-01-01
Recent breakthroughs in remote-sensing technology have led to the development of high spectral resolution imaging sensors for observation of earth surface features. This research was conducted to evaluate the effects of organic matter content and composition on narrowband soil reflectance across the visible and reflective infrared spectral ranges. Organic matter from four Indiana agricultural soils, ranging in organic C content from 0.99 to 1.72 percent, was extracted, fractionated, and purified. Six components of each soil were isolated and prepared for spectral analysis. Reflectance was measured in 210 narrow bands in the 400- to 2500-nm wavelength range. Statistical analysis of reflectance values indicated the potential of high dimensional reflectance data in specific visible, near-infrared, and middle-infrared bands to provide information about soil organic C content, but not organic matter composition. These bands also responded significantly to Fe- and Mn-oxide content.
High-Dimensional Single-Cell Cancer Biology
Doxie, Deon B.
2014-01-01
Cancer cells are distinguished from each other and from healthy cells by features that drive clonal evolution and therapy resistance. New advances in high-dimensional flow cytometry make it possible to systematically measure mechanisms of tumor initiation, progression, and therapy resistance on millions of cells from human tumors. Here we describe flow cytometry techniques that enable a ‘single-cell systems biology’ view of cancer. High-dimensional techniques like mass cytometry enable multiplexed single-cell analysis of cell identity, clinical biomarkers, signaling network phospho-proteins, transcription factors, and functional readouts of proliferation, cell cycle status, and apoptosis. This capability pairs well with a signaling profiles approach that dissects mechanism by systematically perturbing and measuring many nodes in a signaling network. Single-cell approaches enable study of cellular heterogeneity of primary tissues and turn cell subsets into experimental controls or opportunities for new discovery. Rare populations of stem cells or therapy resistant cancer cells can be identified and compared to other types of cells within the same sample. In the long term, these techniques will enable tracking of minimal residual disease (MRD) and disease progression. By better understanding biological systems that control development and cell-cell interactions in healthy and diseased contexts, we can learn to program cells to become therapeutic agents or target malignant signaling events to specifically kill cancer cells. Single-cell approaches that provide deep insight into cell signaling and fate decisions will be critical to optimizing the next generation of cancer treatments combining targeted approaches and immunotherapy. PMID:24671264
Spectral feature design in high dimensional multispectral data
NASA Technical Reports Server (NTRS)
Chen, Chih-Chien Thomas; Landgrebe, David A.
1988-01-01
The High resolution Imaging Spectrometer (HIRIS) is designed to acquire images simultaneously in 192 spectral bands in the 0.4 to 2.5 micrometers wavelength region. It will make possible the collection of essentially continuous reflectance spectra at a spectral resolution sufficient to extract significantly enhanced amounts of information from return signals as compared to existing systems. The advantages of such high dimensional data come at a cost of increased system and data complexity. For example, since the finer the spectral resolution, the higher the data rate, it becomes impractical to design the sensor to be operated continuously. It is essential to find new ways to preprocess the data which reduce the data rate while at the same time maintaining the information content of the high dimensional signal produced. Four spectral feature design techniques are developed from the Weighted Karhunen-Loeve Transforms: (1) non-overlapping band feature selection algorithm; (2) overlapping band feature selection algorithm; (3) Walsh function approach; and (4) infinite clipped optimal function approach. The infinite clipped optimal function approach is chosen since the features are easiest to find and their classification performance is the best. After the preprocessed data has been received at the ground station, canonical analysis is further used to find the best set of features under the criterion that maximal class separability is achieved. Both 100 dimensional vegetation data and 200 dimensional soil data were used to test the spectral feature design system. It was shown that the infinite clipped versions of the first 16 optimal features had excellent classification performance. The overall probability of correct classification is over 90 percent while providing for a reduced downlink data rate by a factor of 10.
Shaffer, Patrick; Valsson, Omar; Parrinello, Michele
2016-01-01
The capabilities of molecular simulations have been greatly extended by a number of widely used enhanced sampling methods that facilitate escaping from metastable states and crossing large barriers. Despite these developments there are still many problems which remain out of reach for these methods which has led to a vigorous effort in this area. One of the most important problems that remains unsolved is sampling high-dimensional free-energy landscapes and systems that are not easily described by a small number of collective variables. In this work we demonstrate a new way to compute free-energy landscapes of high dimensionality based on the previously introduced variationally enhanced sampling, and we apply it to the miniprotein chignolin. PMID:26787868
Williams, Kristine; Herman, Ruth; Bontempo, Daniel
2014-01-01
Purpose of the study Assisted living (AL) residents are at risk for cognitive and functional declines that eventually reduce their ability to care for themselves, thereby triggering nursing home placement. In developing a method to slow this decline, the efficacy of Reasoning Exercises in Assisted Living (REAL), a cognitive training intervention that teaches everyday reasoning and problem-solving skills to AL residents, was tested. Design and methods At thirteen randomized Midwestern facilities, AL residents whose Mini Mental State Examination scores ranged from 19–29 either were trained in REAL or a vitamin education attention control program or received no treatment at all. For 3 weeks, treated groups received personal training in their respective programs. Results Scores on the Every Day Problems Test for Cognitively Challenged Elders (EPCCE) and on the Direct Assessment of Functional Status (DAFS) showed significant increases only for the REAL group. For EPCCE, change from baseline immediately postintervention was +3.10 (P<0.01), and there was significant retention at the 3-month follow-up (d=2.71; P<0.01). For DAFS, change from baseline immediately postintervention was +3.52 (P<0.001), although retention was not as strong. Neither the attention nor the no-treatment control groups had significant gains immediately postintervention or at follow-up assessments. Post hoc across-group comparison of baseline change also highlights the benefits of REAL training. For EPCCE, the magnitude of gain was significantly larger in the REAL group versus the no-treatment control group immediately postintervention (d=3.82; P<0.01) and at the 3-month follow-up (d=3.80; P<0.01). For DAFS, gain magnitude immediately postintervention for REAL was significantly greater compared with in the attention control group (d=4.73; P<0.01). Implications REAL improves skills in everyday problem solving, which may allow AL residents to maintain self-care and extend AL residency. This benefit
Smart sampling and incremental function learning for very large high dimensional data.
Loyola R, Diego G; Pedergnana, Mattia; Gimeno García, Sebastián
2016-06-01
Very large high dimensional data are common nowadays and they impose new challenges to data-driven and data-intensive algorithms. Computational Intelligence techniques have the potential to provide powerful tools for addressing these challenges, but the current literature focuses mainly on handling scalability issues related to data volume in terms of sample size for classification tasks. This work presents a systematic and comprehensive approach for optimally handling regression tasks with very large high dimensional data. The proposed approach is based on smart sampling techniques for minimizing the number of samples to be generated by using an iterative approach that creates new sample sets until the input and output space of the function to be approximated are optimally covered. Incremental function learning takes place in each sampling iteration, the new samples are used to fine tune the regression results of the function learning algorithm. The accuracy and confidence levels of the resulting approximation function are assessed using the probably approximately correct computation framework. The smart sampling and incremental function learning techniques can be easily used in practical applications and scale well in the case of extremely large data. The feasibility and good results of the proposed techniques are demonstrated using benchmark functions as well as functions from real-world problems.
Using High-Dimensional Image Models to Perform Highly Undetectable Steganography
NASA Astrophysics Data System (ADS)
Pevný, Tomáš; Filler, Tomáš; Bas, Patrick
This paper presents a complete methodology for designing practical and highly-undetectable stegosystems for real digital media. The main design principle is to minimize a suitably-defined distortion by means of efficient coding algorithm. The distortion is defined as a weighted difference of extended state-of-the-art feature vectors already used in steganalysis. This allows us to "preserve" the model used by steganalyst and thus be undetectable even for large payloads. This framework can be efficiently implemented even when the dimensionality of the feature set used by the embedder is larger than 107. The high dimensional model is necessary to avoid known security weaknesses. Although high-dimensional models might be problem in steganalysis, we explain, why they are acceptable in steganography. As an example, we introduce HUGO, a new embedding algorithm for spatial-domain digital images and we contrast its performance with LSB matching. On the BOWS2 image database and in contrast with LSB matching, HUGO allows the embedder to hide 7× longer message with the same level of security level.
Visualization of High-Dimensional Point Clouds Using Their Density Distribution's Topology.
Oesterling, P; Heine, C; Janicke, H; Scheuermann, G; Heyer, G
2011-11-01
We present a novel method to visualize multidimensional point clouds. While conventional visualization techniques, like scatterplot matrices or parallel coordinates, have issues with either overplotting of entities or handling many dimensions, we abstract the data using topological methods before presenting it. We assume the input points to be samples of a random variable with a high-dimensional probability distribution which we approximate using kernel density estimates on a suitably reconstructed mesh. From the resulting scalar field we extract the join tree and present it as a topological landscape, a visualization metaphor that utilizes the human capability of understanding natural terrains. In this landscape, dense clusters of points show up as hills. The nesting of hills indicates the nesting of clusters. We augment the landscape with the data points to allow selection and inspection of single points and point sets. We also present optimizations to make our algorithm applicable to large data sets and to allow interactive adaption of our visualization to the kernel window width used in the density estimation.
A Dynamical Clustering Model of Brain Connectivity Inspired by the N -Body Problem
Prasad, Gautam; Burkart, Josh; Joshi, Shantanu H.; Nir, Talia M.; Toga, Arthur W.; Thompson, Paul M.
2014-01-01
We present a method for studying brain connectivity by simulating a dynamical evolution of the nodes of the network. The nodes are treated as particles, and evolved under a simulated force analogous to gravitational acceleration in the well-known N -body problem. The particle nodes correspond to regions of the cortex. The locations of particles are defined as the centers of the respective regions on the cortex and their masses are proportional to each region’s volume. The force of attraction is modeled on the gravitational force, and explicitly made proportional to the elements of a connectivity matrix derived from diffusion imaging data. We present experimental results of the simulation on a population of 110 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), consisting of healthy elderly controls, early mild cognitively impaired (eMCI), late MCI (LMCI), and Alzheimer’s disease (AD) patients. Results show significant differences in the dynamic properties of connectivity networks in healthy controls, compared to eMCI as well as AD patients. PMID:25340177
A Dynamical Clustering Model of Brain Connectivity Inspired by the N -Body Problem.
Prasad, Gautam; Burkart, Josh; Joshi, Shantanu H; Nir, Talia M; Toga, Arthur W; Thompson, Paul M
2013-01-01
We present a method for studying brain connectivity by simulating a dynamical evolution of the nodes of the network. The nodes are treated as particles, and evolved under a simulated force analogous to gravitational acceleration in the well-known N -body problem. The particle nodes correspond to regions of the cortex. The locations of particles are defined as the centers of the respective regions on the cortex and their masses are proportional to each region's volume. The force of attraction is modeled on the gravitational force, and explicitly made proportional to the elements of a connectivity matrix derived from diffusion imaging data. We present experimental results of the simulation on a population of 110 subjects from the Alzheimer's Disease Neuroimaging Initiative (ADNI), consisting of healthy elderly controls, early mild cognitively impaired (eMCI), late MCI (LMCI), and Alzheimer's disease (AD) patients. Results show significant differences in the dynamic properties of connectivity networks in healthy controls, compared to eMCI as well as AD patients. PMID:25340177
Hopf Method Applied to Low and High Dimensional Dynamical Systems
NASA Astrophysics Data System (ADS)
Ma, Seungwook; Marston, Brad
2004-03-01
With an eye towards the goal of directly extracting statistical information from general circulation models (GCMs) of climate, thereby avoiding lengthy time integrations, we investigate the usage of the Hopf functional method(Uriel Frisch, Turbulence: The Legacy of A. N. Kolmogorov) (Cambridge University Press, 1995) chapter 9.5.. We use the method to calculate statistics over low-dimensional attractors, and for fluid flow on a rotating sphere. For the cases of the 3-dimensional Lorenz attractor, and a 5-dimensional nonlinear system introduced by Orszag as a toy model of turbulence(Steven Orszag in Fluid Dynamics: Les Houches (1977))., a comparison of results obtained by low-order truncations of the cumulant expansion against statistics calculated by direct numerical integration forward in time shows surprisingly good agreement. The extension of the Hopf method to a high-dimensional barotropic model of inviscid fluid flow on a rotating sphere, which employs Arakawa's method to conserve energy and enstrophy(Akio Arakawa, J. Comp. Phys. 1), 119 (1966)., is discussed.
HASE: Framework for efficient high-dimensional association analyses
Roshchupkin, G. V.; Adams, H. H. H.; Vernooij, M. W.; Hofman, A.; Van Duijn, C. M.; Ikram, M. A.; Niessen, W. J.
2016-01-01
High-throughput technology can now provide rich information on a person’s biological makeup and environmental surroundings. Important discoveries have been made by relating these data to various health outcomes in fields such as genomics, proteomics, and medical imaging. However, cross-investigations between several high-throughput technologies remain impractical due to demanding computational requirements (hundreds of years of computing resources) and unsuitability for collaborative settings (terabytes of data to share). Here we introduce the HASE framework that overcomes both of these issues. Our approach dramatically reduces computational time from years to only hours and also requires several gigabytes to be exchanged between collaborators. We implemented a novel meta-analytical method that yields identical power as pooled analyses without the need of sharing individual participant data. The efficiency of the framework is illustrated by associating 9 million genetic variants with 1.5 million brain imaging voxels in three cohorts (total N = 4,034) followed by meta-analysis, on a standard computational infrastructure. These experiments indicate that HASE facilitates high-dimensional association studies enabling large multicenter association studies for future discoveries. PMID:27782180
High-dimensional quantum cryptography with twisted light
NASA Astrophysics Data System (ADS)
Mirhosseini, Mohammad; Magaña-Loaiza, Omar S.; O'Sullivan, Malcolm N.; Rodenburg, Brandon; Malik, Mehul; Lavery, Martin P. J.; Padgett, Miles J.; Gauthier, Daniel J.; Boyd, Robert W.
2015-03-01
Quantum key distribution (QKD) systems often rely on polarization of light for encoding, thus limiting the amount of information that can be sent per photon and placing tight bounds on the error rates that such a system can tolerate. Here we describe a proof-of-principle experiment that indicates the feasibility of high-dimensional QKD based on the transverse structure of the light field allowing for the transfer of more than 1 bit per photon. Our implementation uses the orbital angular momentum (OAM) of photons and the corresponding mutually unbiased basis of angular position (ANG). Our experiment uses a digital micro-mirror device for the rapid generation of OAM and ANG modes at 4 kHz, and a mode sorter capable of sorting single photons based on their OAM and ANG content with a separation efficiency of 93%. Through the use of a seven-dimensional alphabet encoded in the OAM and ANG bases, we achieve a channel capacity of 2.05 bits per sifted photon. Our experiment demonstrates that, in addition to having an increased information capacity, multilevel QKD systems based on spatial-mode encoding can be more resilient against intercept-resend eavesdropping attacks.
An efficient chemical kinetics solver using high dimensional model representation
Shorter, J.A.; Ip, P.C.; Rabitz, H.A.
1999-09-09
A high dimensional model representation (HDMR) technique is introduced to capture the input-output behavior of chemical kinetic models. The HDMR expresses the output chemical species concentrations as a rapidly convergent hierarchical correlated function expansion in the input variables. In this paper, the input variables are taken as the species concentrations at time t{sub i} and the output is the concentrations at time t{sub i} + {delta}, where {delta} can be much larger than conventional integration time steps. A specially designed set of model runs is performed to determine the correlated functions making up the HDMR. The resultant HDMR can be used to (1) identify the key input variables acting independently or cooperatively on the output, and (2) create a high speed fully equivalent operational model (FEOM) serving to replace the original kinetic model and its differential equation solver. A demonstration of the HDMR technique is presented for stratospheric chemical kinetics. The FEOM proved to give accurate and stable chemical concentrations out to long times of many years. In addition, the FEOM was found to be orders of magnitude faster than a conventional stiff equation solver. This computational acceleration should have significance in many chemical kinetic applications.
Numerical Bifurcation Theory for High-Dimensional Neural Models.
Laing, Carlo R
2014-12-01
Numerical bifurcation theory involves finding and then following certain types of solutions of differential equations as parameters are varied, and determining whether they undergo any bifurcations (qualitative changes in behaviour). The primary technique for doing this is numerical continuation, where the solution of interest satisfies a parametrised set of algebraic equations, and branches of solutions are followed as the parameter is varied. An effective way to do this is with pseudo-arclength continuation. We give an introduction to pseudo-arclength continuation and then demonstrate its use in investigating the behaviour of a number of models from the field of computational neuroscience. The models we consider are high dimensional, as they result from the discretisation of neural field models-nonlocal differential equations used to model macroscopic pattern formation in the cortex. We consider both stationary and moving patterns in one spatial dimension, and then translating patterns in two spatial dimensions. A variety of results from the literature are discussed, and a number of extensions of the technique are given.
Multigroup Equivalence Analysis for High-Dimensional Expression Data
Yang, Celeste; Bartolucci, Alfred A.; Cui, Xiangqin
2015-01-01
Hypothesis tests of equivalence are typically known for their application in bioequivalence studies and acceptance sampling. Their application to gene expression data, in particular high-dimensional gene expression data, has only recently been studied. In this paper, we examine how two multigroup equivalence tests, the F-test and the range test, perform when applied to microarray expression data. We adapted these tests to a well-known equivalence criterion, the difference ratio. Our simulation results showed that both tests can achieve moderate power while controlling the type I error at nominal level for typical expression microarray studies with the benefit of easy-to-interpret equivalence limits. For the range of parameters simulated in this paper, the F-test is more powerful than the range test. However, for comparing three groups, their powers are similar. Finally, the two multigroup tests were applied to a prostate cancer microarray dataset to identify genes whose expression follows a prespecified trajectory across five prostate cancer stages. PMID:26628859
A Hyperspherical Adaptive Sparse-Grid Method for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max D.; Burkardt, John V.
2015-06-24
This study proposes and analyzes a hyperspherical adaptive hierarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces. The method is motivated by the theoretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a function representation of the discontinuity hypersurface of an N-dimensional discontinuous quantity of interest, by virtue of a hyperspherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyperspherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. In addition, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous complexity analyses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.
Shrinkage-based diagonal discriminant analysis and its applications in high-dimensional data.
Pang, Herbert; Tong, Tiejun; Zhao, Hongyu
2009-12-01
High-dimensional data such as microarrays have brought us new statistical challenges. For example, using a large number of genes to classify samples based on a small number of microarrays remains a difficult problem. Diagonal discriminant analysis, support vector machines, and k-nearest neighbor have been suggested as among the best methods for small sample size situations, but none was found to be superior to others. In this article, we propose an improved diagonal discriminant approach through shrinkage and regularization of the variances. The performance of our new approach along with the existing methods is studied through simulations and applications to real data. These studies show that the proposed shrinkage-based and regularization diagonal discriminant methods have lower misclassification rates than existing methods in many cases.
High dimensional linear regression models under long memory dependence and measurement error
NASA Astrophysics Data System (ADS)
Kaul, Abhishek
This dissertation consists of three chapters. The first chapter introduces the models under consideration and motivates problems of interest. A brief literature review is also provided in this chapter. The second chapter investigates the properties of Lasso under long range dependent model errors. Lasso is a computationally efficient approach to model selection and estimation, and its properties are well studied when the regression errors are independent and identically distributed. We study the case, where the regression errors form a long memory moving average process. We establish a finite sample oracle inequality for the Lasso solution. We then show the asymptotic sign consistency in this setup. These results are established in the high dimensional setup (p> n) where p can be increasing exponentially with n. Finally, we show the consistency, n½ --d-consistency of Lasso, along with the oracle property of adaptive Lasso, in the case where p is fixed. Here d is the memory parameter of the stationary error sequence. The performance of Lasso is also analysed in the present setup with a simulation study. The third chapter proposes and investigates the properties of a penalized quantile based estimator for measurement error models. Standard formulations of prediction problems in high dimension regression models assume the availability of fully observed covariates and sub-Gaussian and homogeneous model errors. This makes these methods inapplicable to measurement errors models where covariates are unobservable and observations are possibly non sub-Gaussian and heterogeneous. We propose weighted penalized corrected quantile estimators for the regression parameter vector in linear regression models with additive measurement errors, where unobservable covariates are nonrandom. The proposed estimators forgo the need for the above mentioned model assumptions. We study these estimators in both the fixed dimension and high dimensional sparse setups, in the latter setup, the
Chen, Yi; Jakeman, John; Gittelson, Claude; Xiu, Dongbin
2015-01-08
In this paper we present a localized polynomial chaos expansion for partial differential equations (PDE) with random inputs. In particular, we focus on time independent linear stochastic problems with high dimensional random inputs, where the traditional polynomial chaos methods, and most of the existing methods, incur prohibitively high simulation cost. Furthermore, the local polynomial chaos method employs a domain decomposition technique to approximate the stochastic solution locally. In each subdomain, a subdomain problem is solved independently and, more importantly, in a much lower dimensional random space. In a postprocesing stage, accurate samples of the original stochastic problems are obtained from the samples of the local solutions by enforcing the correct stochastic structure of the random inputs and the coupling conditions at the interfaces of the subdomains. Overall, the method is able to solve stochastic PDEs in very large dimensions by solving a collection of low dimensional local problems and can be highly efficient. In our paper we present the general mathematical framework of the methodology and use numerical examples to demonstrate the properties of the method.
NASA Astrophysics Data System (ADS)
Krumholz, Mark R.
2014-06-01
Star formation lies at the center of a web of processes that drive cosmic evolution: generation of radiant energy, synthesis of elements, formation of planets, and development of life. Decades of observations have yielded a variety of empirical rules about how it operates, but at present we have no comprehensive, quantitative theory. In this review I discuss the current state of the field of star formation, focusing on three central questions: What controls the rate at which gas in a galaxy converts to stars? What determines how those stars are clustered, and what fraction of the stellar population ends up in gravitationally-bound structures? What determines the stellar initial mass function, and does it vary with star-forming environment? I use these three questions as a lens to introduce the basics of star formation, beginning with a review of the observational phenomenology and the basic physical processes. I then review the status of current theories that attempt to solve each of the three problems, pointing out links between them and opportunities for theoretical and numerical work that crosses the scale between them. I conclude with a discussion of prospects for theoretical progress in the coming years.
High dimensional spatial modeling of extremes with applications to United States Rainfalls
NASA Astrophysics Data System (ADS)
Zhou, Jie
2007-12-01
Spatial statistical models are used to predict unobserved variables based on observed variables and to estimate unknown model parameters. Extreme value theory(EVT) is used to study large or small observations from a random phenomenon. Both spatial statistics and extreme value theory have been studied in a lot of areas such as agriculture, finance, industry and environmental science. This dissertation proposes two spatial statistical models which concentrate on non-Gaussian probability densities with general spatial covariance structures. The two models are also applied in analyzing United States Rainfalls and especially, rainfall extremes. When the data set is not too large, the first model is used. The model constructs a generalized linear mixed model(GLMM) which can be considered as an extension of Diggle's model-based geostatistical approach(Diggle et al. 1998). The approach improves conventional kriging with a form of generalized linear mixed structure. As for high dimensional problems, two different methods are established to improve the computational efficiency of Markov Chain Monte Carlo(MCMC) implementation. The first method is based on spectral representation of spatial dependence structures which provides good approximations on each MCMC iteration. The other method embeds high dimensional covariance matrices in matrices with block circulant structures. The eigenvalues and eigenvectors of block circulant matrices can be calculated exactly by Fast Fourier Transforms(FFT). The computational efficiency is gained by transforming the posterior matrices into lower dimensional matrices. This method gives us exact update on each MCMC iteration. Future predictions are also made by keeping spatial dependence structures fixed and using the relationship between present days and future days provided by some Global Climate Model(GCM). The predictions are refined by sampling techniques. Both ways of handling high dimensional covariance matrices are novel to analyze large
NASA Astrophysics Data System (ADS)
Taşkin Kaya, Gülşen
2013-10-01
-output relationships in high-dimensional systems for many problems in science and engineering. The HDMR method is developed to improve the efficiency of the deducing high dimensional behaviors. The method is formed by a particular organization of low dimensional component functions, in which each function is the contribution of one or more input variables to the output variables.
Fast and accurate probability density estimation in large high dimensional astronomical datasets
NASA Astrophysics Data System (ADS)
Gupta, Pramod; Connolly, Andrew J.; Gardner, Jeffrey P.
2015-01-01
Astronomical surveys will generate measurements of hundreds of attributes (e.g. color, size, shape) on hundreds of millions of sources. Analyzing these large, high dimensional data sets will require efficient algorithms for data analysis. An example of this is probability density estimation that is at the heart of many classification problems such as the separation of stars and quasars based on their colors. Popular density estimation techniques use binning or kernel density estimation. Kernel density estimation has a small memory footprint but often requires large computational resources. Binning has small computational requirements but usually binning is implemented with multi-dimensional arrays which leads to memory requirements which scale exponentially with the number of dimensions. Hence both techniques do not scale well to large data sets in high dimensions. We present an alternative approach of binning implemented with hash tables (BASH tables). This approach uses the sparseness of data in the high dimensional space to ensure that the memory requirements are small. However hashing requires some extra computation so a priori it is not clear if the reduction in memory requirements will lead to increased computational requirements. Through an implementation of BASH tables in C++ we show that the additional computational requirements of hashing are negligible. Hence this approach has small memory and computational requirements. We apply our density estimation technique to photometric selection of quasars using non-parametric Bayesian classification and show that the accuracy of the classification is same as the accuracy of earlier approaches. Since the BASH table approach is one to three orders of magnitude faster than the earlier approaches it may be useful in various other applications of density estimation in astrostatistics.
High-Dimensional Data Reduction, Image Inpainting and their Astronomical Applications
NASA Astrophysics Data System (ADS)
Pesenson, M.; Pesenson, I.; Carey, S.; McCollum, B.; Roby, W.
2009-09-01
Technological advances are revolutionizing multispectral astrophysics as well as the detection and study of transient sources. This new era of multitemporal and multispectral data sets demands new ways of data representation, processing and management thus making data dimension reduction instrumental in efficient data organization, retrieval, analysis and information visualization. Other astrophysical applications of data dimension reduction which require new paradigms of data analysis include knowledge discovery, cluster analysis, feature extraction and object classification, de-correlating data elements, discovering meaningful patterns and finding essential representation of correlated variables that form a manifold (e.g. the manifold of galaxies), tagging astronomical images, multiscale analysis synchronized across all available wavelengths, denoising, etc. The second part of this paper is dedicated to a new, active area of image processing: image inpainting that consists of automated methods for filling in missing or damaged regions in images. Inpainting has multiple astronomical applications including restoring images corrupted by instrument artifacts, removing undesirable objects like bright stars and their halos, sky estimating, and pre-processing for the Fourier or wavelet transforms. Applications of high-dimensional data reduction and mitigation of instrument artifacts are demonstrated on images taken by the Spitzer Space Telescope.
Unfold High-Dimensional Clouds for Exhaustive Gating of Flow Cytometry Data.
Qiu, Peng
2014-01-01
Flow cytometry is able to measure the expressions of multiple proteins simultaneously at the single-cell level. A flow cytometry experiment on one biological sample provides measurements of several protein markers on or inside a large number of individual cells in that sample. Analysis of such data often aims to identify subpopulations of cells with distinct phenotypes. Currently, the most widely used analytical approach in the flow cytometry community is manual gating on a sequence of nested biaxial plots, which is highly subjective, labor intensive, and not exhaustive. To address those issues, a number of methods have been developed to automate the gating analysis by clustering algorithms. However, completely removing the subjectivity can be quite challenging. This paper describes an alternative approach. Instead of automating the analysis, we develop novel visualizations to facilitate manual gating. The proposed method views single-cell data of one biological sample as a high-dimensional point cloud of cells, derives the skeleton of the cloud, and unfolds the skeleton to generate 2D visualizations. We demonstrate the utility of the proposed visualization using real data, and provide quantitative comparison to visualizations generated from principal component analysis and multidimensional scaling.
Gude, Tore; Hoffart, Asle
2008-04-01
The aim was to study whether patients with panic disorder with agoraphobia and co-occurring Cluster C traits would respond differently regarding change in interpersonal problems as part of their personality functioning when receiving two different treatment modalities. Two cohorts of patients were followed through three months' in-patient treatment programs and assessed at follow-up one year after end of treatment. The one cohort comprised 18 patients treated with "treatment as usual" according to psychodynamic principles, the second comprised 24 patients treated in a cognitive agoraphobia and schema-focused therapy program. Patients in the cognitive condition showed greater improvement in interpersonal problems than patients in the treatment as usual condition. Although this quasi-experimental study has serious limitations, the results may indicate that agoraphobic patients with Cluster C traits should be treated in cognitive agoraphobia and schema-focused programs rather than in psychodynamic treatment as usual programs in order to reduce their level of interpersonal problems.
Dimensional strategies and the minimization problem: Barrier-avoiding algorithms
Faken, D.B.; Voter, A.F.; Freeman, D.L.; Doll, J.D.
1999-11-25
In the present paper the authors examine the role of dimensionality in the minimization problem. Since it has such a powerful influence on the topology of the associated potential energy landscape, the authors argue that it may prove useful to alter the dimensionality of the space of the original minimization problem. The general idea is explored in the context of finding the minimum energy geometries of Lennard-Jones clusters. It is shown that it is possible to locate barrier-free, high-dimensional pathways that connect local, three-dimensional cluster minima. The performance of the resulting, barrier-avoiding minimization algorithm is examined for clusters containing as many as 55 atoms.
Penalized likelihood methods for estimation of sparse high-dimensional directed acyclic graphs
SHOJAIE, ALI; MICHAILIDIS, GEORGE
2010-01-01
Summary Directed acyclic graphs are commonly used to represent causal relationships among random variables in graphical models. Applications of these models arise in the study of physical and biological systems where directed edges between nodes represent the influence of components of the system on each other. Estimation of directed graphs from observational data is computationally NP-hard. In addition, directed graphs with the same structure may be indistinguishable based on observations alone. When the nodes exhibit a natural ordering, the problem of estimating directed graphs reduces to the problem of estimating the structure of the network. In this paper, we propose an efficient penalized likelihood method for estimation of the adjacency matrix of directed acyclic graphs, when variables inherit a natural ordering. We study variable selection consistency of lasso and adaptive lasso penalties in high-dimensional sparse settings, and propose an error-based choice for selecting the tuning parameter. We show that although the lasso is only variable selection consistent under stringent conditions, the adaptive lasso can consistently estimate the true graph under the usual regularity assumptions. PMID:22434937
NASA Astrophysics Data System (ADS)
Friedenberg, David
2010-10-01
the rate of falsely detected active regions. Additionally we examine the more general field of clustering and develop a framework for clustering algorithms based around diffusion maps. Diffusion maps can be used to project high-dimensional data into a lower dimensional space while preserving much of the structure in the data. We demonstrate how diffusion maps can be used to solve clustering problems and examine the influence of tuning parameters on the results. We introduce two novel methods, the self-tuning diffusion map which replaces the global scaling parameter in the typical diffusion map framework with a local scaling parameter and an algorithm for automatically selecting tuning parameters based on a cross-validation style score called prediction strength. The methods are tested on several example datasets.
Arif, Muhammad; Basalamah, Saleh
2013-06-01
In real life biomedical classification applications, it is difficult to visualize the feature space due to high dimensionality of the feature space. In this paper, we have proposed 3D similarity-dissimilarity plot to project the high dimensional space to a three dimensional space in which important information about the feature space can be extracted in the context of pattern classification. In this plot it is possible to visualize good data points (data points near to their own class as compared to other classes) and bad data points (data points far away from their own class) and outlier points (data points away from both their own class and other classes). Hence separation of classes can easily be visualized. Density of the data points near each other can provide some useful information about the compactness of the clusters within certain class. Moreover, an index called percentage of data points above the similarity-dissimilarity line (PAS) is proposed which is the fraction of data points above the similarity-dissimilarity line. Several synthetic and real life biomedical datasets are used to show the effectiveness of the proposed 3D similarity-dissimilarity plot.
High-dimensional analysis of the murine myeloid cell system.
Becher, Burkhard; Schlitzer, Andreas; Chen, Jinmiao; Mair, Florian; Sumatoh, Hermi R; Teng, Karen Wei Weng; Low, Donovan; Ruedl, Christiane; Riccardi-Castagnoli, Paola; Poidinger, Michael; Greter, Melanie; Ginhoux, Florent; Newell, Evan W
2014-12-01
Advances in cell-fate mapping have revealed the complexity in phenotype, ontogeny and tissue distribution of the mammalian myeloid system. To capture this phenotypic diversity, we developed a 38-antibody panel for mass cytometry and used dimensionality reduction with machine learning-aided cluster analysis to build a composite of murine (mouse) myeloid cells in the steady state across lymphoid and nonlymphoid tissues. In addition to identifying all previously described myeloid populations, higher-order analysis allowed objective delineation of otherwise ambiguous subsets, including monocyte-macrophage intermediates and an array of granulocyte variants. Using mice that cannot sense granulocyte macrophage-colony stimulating factor GM-CSF (Csf2rb(-/-)), which have discrete alterations in myeloid development, we confirmed differences in barrier tissue dendritic cells, lung macrophages and eosinophils. The methodology further identified variations in the monocyte and innate lymphoid cell compartment that were unexpected, which confirmed that this approach is a powerful tool for unambiguous and unbiased characterization of the myeloid system. PMID:25306126
ERIC Educational Resources Information Center
Jitendra, Asha K.; Harwell, Michael R.; Dupuis, Danielle N.; Karl, Stacy R.; Lein, Amy E.; Simonson, Gregory; Slater, Susan C.
2015-01-01
This experimental study evaluated the effectiveness of a research-based intervention, schema-based instruction (SBI), on students' proportional problem solving. SBI emphasizes the underlying mathematical structure of problems, uses schematic diagrams to represent information in the problem text, provides explicit problem-solving and metacognitive…
NASA Technical Reports Server (NTRS)
Soderblom, David R.; King, Jeremy R.; Hanson, Robert B.; Jones, Burton F.; Fischer, Debra; Stauffer, John R.; Pinsonneault, Marc H.
1998-01-01
This paper examines the discrepancy between distances to nearby open clusters as determined by parallaxes from Hipparcos compared to traditional main-sequence fitting. The biggest difference is seen for the Pleiades, and our hypothesis is that if the Hipparcos distance to the Pleiades is correct, then similar subluminous zero-age main-sequence (ZAMS) stars should exist elsewhere, including in the immediate solar neighborhood. We examine a color-magnitude diagram of very young and nearby solar-type stars and show that none of them lie below the traditional ZAMS, despite the fact that the Hipparcos Pleiades parallax would place its members 0.3 mag below that ZAMS. We also present analyses and observations of solar-type stars that do lie below the ZAMS, and we show that they are subluminous because of low metallicity and that they have the kinematics of old stars.
A decision-theory approach to interpretable set analysis for high-dimensional data.
Boca, Simina M; Bravo, Héctor Céorrada; Caffo, Brian; Leek, Jeffrey T; Parmigiani, Giovanni
2013-09-01
A key problem in high-dimensional significance analysis is to find pre-defined sets that show enrichment for a statistical signal of interest; the classic example is the enrichment of gene sets for differentially expressed genes. Here, we propose a new decision-theory approach to the analysis of gene sets which focuses on estimating the fraction of non-null variables in a set. We introduce the idea of "atoms," non-overlapping sets based on the original pre-defined set annotations. Our approach focuses on finding the union of atoms that minimizes a weighted average of the number of false discoveries and missed discoveries. We introduce a new false discovery rate for sets, called the atomic false discovery rate (afdr), and prove that the optimal estimator in our decision-theory framework is to threshold the afdr. These results provide a coherent and interpretable framework for the analysis of sets that addresses the key issues of overlapping annotations and difficulty in interpreting p values in both competitive and self-contained tests. We illustrate our method and compare it to a popular existing method using simulated examples, as well as gene-set and brain ROI data analyses.
A hyper-spherical adaptive sparse-grid method for high-dimensional discontinuity detection
Zhang, Guannan; Webster, Clayton G; Gunzburger, Max D; Burkardt, John V
2014-03-01
This work proposes and analyzes a hyper-spherical adaptive hi- erarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces is proposed. The method is motivated by the the- oretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a func- tion representation of the discontinuity hyper-surface of an N-dimensional dis- continuous quantity of interest, by virtue of a hyper-spherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyper-spherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smooth- ness of the hyper-surface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous error estimates and complexity anal- yses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.
Individual-based models for adaptive diversification in high-dimensional phenotype spaces.
Ispolatov, Iaroslav; Madhok, Vaibhav; Doebeli, Michael
2016-02-01
Most theories of evolutionary diversification are based on equilibrium assumptions: they are either based on optimality arguments involving static fitness landscapes, or they assume that populations first evolve to an equilibrium state before diversification occurs, as exemplified by the concept of evolutionary branching points in adaptive dynamics theory. Recent results indicate that adaptive dynamics may often not converge to equilibrium points and instead generate complicated trajectories if evolution takes place in high-dimensional phenotype spaces. Even though some analytical results on diversification in complex phenotype spaces are available, to study this problem in general we need to reconstruct individual-based models from the adaptive dynamics generating the non-equilibrium dynamics. Here we first provide a method to construct individual-based models such that they faithfully reproduce the given adaptive dynamics attractor without diversification. We then show that a propensity to diversify can be introduced by adding Gaussian competition terms that generate frequency dependence while still preserving the same adaptive dynamics. For sufficiently strong competition, the disruptive selection generated by frequency-dependence overcomes the directional evolution along the selection gradient and leads to diversification in phenotypic directions that are orthogonal to the selection gradient. PMID:26598329
Individual-based models for adaptive diversification in high-dimensional phenotype spaces.
Ispolatov, Iaroslav; Madhok, Vaibhav; Doebeli, Michael
2016-02-01
Most theories of evolutionary diversification are based on equilibrium assumptions: they are either based on optimality arguments involving static fitness landscapes, or they assume that populations first evolve to an equilibrium state before diversification occurs, as exemplified by the concept of evolutionary branching points in adaptive dynamics theory. Recent results indicate that adaptive dynamics may often not converge to equilibrium points and instead generate complicated trajectories if evolution takes place in high-dimensional phenotype spaces. Even though some analytical results on diversification in complex phenotype spaces are available, to study this problem in general we need to reconstruct individual-based models from the adaptive dynamics generating the non-equilibrium dynamics. Here we first provide a method to construct individual-based models such that they faithfully reproduce the given adaptive dynamics attractor without diversification. We then show that a propensity to diversify can be introduced by adding Gaussian competition terms that generate frequency dependence while still preserving the same adaptive dynamics. For sufficiently strong competition, the disruptive selection generated by frequency-dependence overcomes the directional evolution along the selection gradient and leads to diversification in phenotypic directions that are orthogonal to the selection gradient.
Sivakumar, Vidyashankar; Banerjee, Arindam; Ravikumar, Pradeep
2016-01-01
We consider the problem of high-dimensional structured estimation with norm-regularized estimators, such as Lasso, when the design matrix and noise are drawn from sub-exponential distributions. Existing results only consider sub-Gaussian designs and noise, and both the sample complexity and non-asymptotic estimation error have been shown to depend on the Gaussian width of suitable sets. In contrast, for the sub-exponential setting, we show that the sample complexity and the estimation error will depend on the exponential width of the corresponding sets, and the analysis holds for any norm. Further, using generic chaining, we show that the exponential width for any set will be at most logp times the Gaussian width of the set, yielding Gaussian width based results even for the sub-exponential case. Further, for certain popular estimators, viz Lasso and Group Lasso, using a VC-dimension based analysis, we show that the sample complexity will in fact be the same order as Gaussian designs. Our general analysis and results are the first in the sub-exponential setting, and are readily applicable to special sub-exponential families such as log-concave and extreme-value distributions. PMID:27563230
A decision-theory approach to interpretable set analysis for high-dimensional data.
Boca, Simina M; Bravo, Héctor Céorrada; Caffo, Brian; Leek, Jeffrey T; Parmigiani, Giovanni
2013-09-01
A key problem in high-dimensional significance analysis is to find pre-defined sets that show enrichment for a statistical signal of interest; the classic example is the enrichment of gene sets for differentially expressed genes. Here, we propose a new decision-theory approach to the analysis of gene sets which focuses on estimating the fraction of non-null variables in a set. We introduce the idea of "atoms," non-overlapping sets based on the original pre-defined set annotations. Our approach focuses on finding the union of atoms that minimizes a weighted average of the number of false discoveries and missed discoveries. We introduce a new false discovery rate for sets, called the atomic false discovery rate (afdr), and prove that the optimal estimator in our decision-theory framework is to threshold the afdr. These results provide a coherent and interpretable framework for the analysis of sets that addresses the key issues of overlapping annotations and difficulty in interpreting p values in both competitive and self-contained tests. We illustrate our method and compare it to a popular existing method using simulated examples, as well as gene-set and brain ROI data analyses. PMID:23909925
A Hyperspherical Adaptive Sparse-Grid Method for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max D.; Burkardt, John V.
2015-06-24
This study proposes and analyzes a hyperspherical adaptive hierarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces. The method is motivated by the theoretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a function representation of the discontinuity hypersurface of an N-dimensional discontinuous quantity of interest, by virtue of a hyperspherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyperspherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the newmore » technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. In addition, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous complexity analyses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.« less
Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics
Luo, Le; Li, Li
2014-01-01
Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications. PMID:24416136
Thompson, Paul M; Hayashi, Kiralee M; de Zubicaray, Greig; Janke, Andrew L; Rose, Stephen E; Semple, James; Doddrell, David M; Cannon, Tyrone D; Toga, Arthur W
2002-01-01
We briefly describe a set of algorithms to detect and visualize effects of disease and genetic factors on the brain. Extreme variations in cortical anatomy, even among normal subjects, complicate the detection and mapping of systematic effects on brain structure in human populations. We tackle this problem in two stages. First, we develop a cortical pattern matching approach, based on metrically covariant partial differential equations (PDEs), to associate corresponding regions of cortex in an MRI brain image database (N=102 scans). Second, these high-dimensional deformation maps are used to transfer within-subject cortical signals, including measures of gray matter distribution, shape asymmetries, and degenerative rates, to a common anatomic template for statistical analysis. We illustrate these techniques in two applications: (1) mapping dynamic patterns of gray matter loss in longitudinally scanned Alzheimer's disease patients; and (2) mapping genetic influences on brain structure. We extend statistics used widely in behavioral genetics to cortical manifolds. Specifically, we introduce methods based on h-squared distributed random fields to map hereditary influences on brain structure in human populations. PMID:19759832
A simple new filter for nonlinear high-dimensional data assimilation
NASA Astrophysics Data System (ADS)
Tödter, Julian; Kirchgessner, Paul; Ahrens, Bodo
2015-04-01
performance with a realistic ensemble size. The results confirm that, in principle, it can be applied successfully and as simple as the ETKF in high-dimensional problems without further modifications of the algorithm, even though it is only based on the particle weights. This proves that the suggested method constitutes a useful filter for nonlinear, high-dimensional data assimilation, and is able to overcome the curse of dimensionality even in deterministic systems.
Approximating high-dimensional dynamics by barycentric coordinates with linear programming
Hirata, Yoshito Aihara, Kazuyuki; Suzuki, Hideyuki; Shiro, Masanori; Takahashi, Nozomu; Mas, Paloma
2015-01-15
The increasing development of novel methods and techniques facilitates the measurement of high-dimensional time series but challenges our ability for accurate modeling and predictions. The use of a general mathematical model requires the inclusion of many parameters, which are difficult to be fitted for relatively short high-dimensional time series observed. Here, we propose a novel method to accurately model a high-dimensional time series. Our method extends the barycentric coordinates to high-dimensional phase space by employing linear programming, and allowing the approximation errors explicitly. The extension helps to produce free-running time-series predictions that preserve typical topological, dynamical, and/or geometric characteristics of the underlying attractors more accurately than the radial basis function model that is widely used. The method can be broadly applied, from helping to improve weather forecasting, to creating electronic instruments that sound more natural, and to comprehensively understanding complex biological data.
Nadeau, R.M.
1995-10-01
This document contains information about the characterization and application of microearthquake clusters and fault zone dynamics. Topics discussed include: Seismological studies; fault-zone dynamics; periodic recurrence; scaling of microearthquakes to large earthquakes; implications of fault mechanics and seismic hazards; and wave propagation and temporal changes.
Hagen, Nathan; Kester, Robert T.; Gao, Liang; Tkaczyk, Tomasz S.
2012-01-01
The snapshot advantage is a large increase in light collection efficiency available to high-dimensional measurement systems that avoid filtering and scanning. After discussing this advantage in the context of imaging spectrometry, where the greatest effort towards developing snapshot systems has been made, we describe the types of measurements where it is applicable. We then generalize it to the larger context of high-dimensional measurements, where the advantage increases geometrically with measurement dimensionality. PMID:22791926
NASA Astrophysics Data System (ADS)
Wagstaff, Kiri L.
2012-03-01
particular application involves considerations of the kind of data being analyzed, algorithm runtime efficiency, and how much prior knowledge is available about the problem domain, which can dictate the nature of clusters sought. Fundamentally, the clustering method and its representations of clusters carries with it a definition of what a cluster is, and it is important that this be aligned with the analysis goals for the problem at hand. In this chapter, I emphasize this point by identifying for each algorithm the cluster representation as a model, m_j , even for algorithms that are not typically thought of as creating a “model.” This chapter surveys a basic collection of clustering methods useful to any practitioner who is interested in applying clustering to a new data set. The algorithms include k-means (Section 25.2), EM (Section 25.3), agglomerative (Section 25.4), and spectral (Section 25.5) clustering, with side mentions of variants such as kernel k-means and divisive clustering. The chapter also discusses each algorithm’s strengths and limitations and provides pointers to additional in-depth reading for each subject. Section 25.6 discusses methods for incorporating domain knowledge into the clustering process. This chapter concludes with a brief survey of interesting applications of clustering methods to astronomy data (Section 25.7). The chapter begins with k-means because it is both generally accessible and so widely used that understanding it can be considered a necessary prerequisite for further work in the field. EM can be viewed as a more sophisticated version of k-means that uses a generative model for each cluster and probabilistic item assignments. Agglomerative clustering is the most basic form of hierarchical clustering and provides a basis for further exploration of algorithms in that vein. Spectral clustering permits a departure from feature-vector-based clustering and can operate on data sets instead represented as affinity, or similarity
NASA Astrophysics Data System (ADS)
Tripathy, Rohit; Bilionis, Ilias; Gonzalez, Marcial
2016-09-01
Uncertainty quantification (UQ) tasks, such as model calibration, uncertainty propagation, and optimization under uncertainty, typically require several thousand evaluations of the underlying computer codes. To cope with the cost of simulations, one replaces the real response surface with a cheap surrogate based, e.g., on polynomial chaos expansions, neural networks, support vector machines, or Gaussian processes (GP). However, the number of simulations required to learn a generic multivariate response grows exponentially as the input dimension increases. This curse of dimensionality can only be addressed, if the response exhibits some special structure that can be discovered and exploited. A wide range of physical responses exhibit a special structure known as an active subspace (AS). An AS is a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. If the dimensionality of the AS is low enough, then learning the link function is a much easier problem than the original problem of learning a high-dimensional function. The classic approach to discovering the AS requires gradient information, a fact that severely limits its applicability. Furthermore, and partly because of its reliance to gradients, it is not able to handle noisy observations. The latter is an essential trait if one wants to be able to propagate uncertainty through stochastic simulators, e.g., through molecular dynamics codes. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction. In particular, the AS is represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the
A Personalized Collaborative Recommendation Approach Based on Clustering of Customers
NASA Astrophysics Data System (ADS)
Wang, Pu
Collaborative filtering has been known to be the most successful recommender techniques in recommendation systems. Collaborative methods recommend items based on aggregated user ratings of those items and these techniques do not depend on the availability of textual descriptions. They share the common goal of assisting in the users search for items of interest, and thus attempt to address one of the key research problems of the information overload. Collaborative filtering systems can deal with large numbers of customers and with many different products. However there is a problem that the set of ratings is sparse, such that any two customers will most likely have only a few co-rated products. The high dimensional sparsity of the rating matrix and the problem of scalability result in low quality recommendations. In this paper, a personalized collaborative recommendation approach based on clustering of customers is presented. This method uses the clustering technology to form the customers centers. The personalized collaborative filtering approach based on clustering of customers can alleviate the scalability problem in the collaborative recommendations.
SketchPadN-D: WYDIWYG sculpting and editing in high-dimensional space.
Wang, Bing; Ruchikachorn, Puripant; Mueller, Klaus
2013-12-01
High-dimensional data visualization has been attracting much attention. To fully test related software and algorithms, researchers require a diverse pool of data with known and desired features. Test data do not always provide this, or only partially. Here we propose the paradigm WYDIWYGS (What You Draw Is What You Get). Its embodiment, SketchPadND, is a tool that allows users to generate high-dimensional data in the same interface they also use for visualization. This provides for an immersive and direct data generation activity, and furthermore it also enables users to interactively edit and clean existing high-dimensional data from possible artifacts. SketchPadND offers two visualization paradigms, one based on parallel coordinates and the other based on a relatively new framework using an N-D polygon to navigate in high-dimensional space. The first interface allows users to draw arbitrary profiles of probability density functions along each dimension axis and sketch shapes for data density and connections between adjacent dimensions. The second interface embraces the idea of sculpting. Users can carve data at arbitrary orientations and refine them wherever necessary. This guarantees that the data generated is truly high-dimensional. We demonstrate our tool's usefulness in real data visualization scenarios.
Competing Risks Data Analysis with High-dimensional Covariates: An Application in Bladder Cancer
Tapak, Leili; Saidijam, Massoud; Sadeghifar, Majid; Poorolajal, Jalal; Mahjub, Hossein
2015-01-01
Analysis of microarray data is associated with the methodological problems of high dimension and small sample size. Various methods have been used for variable selection in high-dimension and small sample size cases with a single survival endpoint. However, little effort has been directed toward addressing competing risks where there is more than one failure risks. This study compared three typical variable selection techniques including Lasso, elastic net, and likelihood-based boosting for high-dimensional time-to-event data with competing risks. The performance of these methods was evaluated via a simulation study by analyzing a real dataset related to bladder cancer patients using time-dependent receiver operator characteristic (ROC) curve and bootstrap .632+ prediction error curves. The elastic net penalization method was shown to outperform Lasso and boosting. Based on the elastic net, 33 genes out of 1381 genes related to bladder cancer were selected. By fitting to the Fine and Gray model, eight genes were highly significant (P < 0.001). Among them, expression of RTN4, SON, IGF1R, SNRPE, PTGR1, PLEK, and ETFDH was associated with a decrease in survival time, whereas SMARCAD1 expression was associated with an increase in survival time. This study indicates that the elastic net has a higher capacity than the Lasso and boosting for the prediction of survival time in bladder cancer patients. Moreover, genes selected by all methods improved the predictive power of the model based on only clinical variables, indicating the value of information contained in the microarray features. PMID:25907251
Extracting sparse signals from high-dimensional data: A statistical mechanics approach
NASA Astrophysics Data System (ADS)
Ramezanali, Mohammad
Sparse reconstruction algorithms aim to retrieve high-dimensional sparse signals from a limited amount of measurements under suitable conditions. As the number of variables go to infinity, these algorithms exhibit sharp phase transition boundaries where the sparse retrieval breaks down. Several sparse reconstruction algorithms are formulated as optimization problems. Few of the prominent ones among these have been analyzed in the literature by statistical mechanical methods. The function to be optimized plays the role of energy. The treatment involves finite temperature replica mean-field theory followed by the zero temperature limit. Although this approach has been successful in reproducing the algorithmic phase transition boundaries, the replica trick and the non-trivial zero temperature limit obscure the underlying reasons for the failure of the algorithms. In this thesis, we employ the "cavity method" to give an alternative derivation of the phase transition boundaries, working directly in the zero-temperature limit. This approach provides insight into the origin of the different terms in the mean field self-consistency equations. The cavity method naturally generates a local susceptibility which leads to an identity that clearly indicates the existence of two phases. The identity also gives us a novel route to the known parametric expressions for the phase boundary of the Basis Pursuit algorithm and to the new ones for the Elastic Net. These transitions being continuous (second order), we explore the scaling laws and critical exponents that are uniquely determined by the nature of the distribution of the density of the nonzero components of the sparse signal. Not only is the phase boundary of the Elastic Net different from that of the Basis Pursuit, we show that the critical behavior of the two algorithms are from different universality classes.
Machine learning etudes in astrophysics: selection functions for mock cluster catalogs
Hajian, Amir; Alvarez, Marcelo A.; Bond, J. Richard E-mail: malvarez@cita.utoronto.ca
2015-01-01
Making mock simulated catalogs is an important component of astrophysical data analysis. Selection criteria for observed astronomical objects are often too complicated to be derived from first principles. However the existence of an observed group of objects is a well-suited problem for machine learning classification. In this paper we use one-class classifiers to learn the properties of an observed catalog of clusters of galaxies from ROSAT and to pick clusters from mock simulations that resemble the observed ROSAT catalog. We show how this method can be used to study the cross-correlations of thermal Sunya'ev-Zeldovich signals with number density maps of X-ray selected cluster catalogs. The method reduces the bias due to hand-tuning the selection function and is readily scalable to large catalogs with a high-dimensional space of astrophysical features.
Fickler, Robert; Lapkiewicz, Radek; Huber, Marcus; Lavery, Martin P J; Padgett, Miles J; Zeilinger, Anton
2014-07-30
Photonics has become a mature field of quantum information science, where integrated optical circuits offer a way to scale the complexity of the set-up as well as the dimensionality of the quantum state. On photonic chips, paths are the natural way to encode information. To distribute those high-dimensional quantum states over large distances, transverse spatial modes, like orbital angular momentum possessing Laguerre Gauss modes, are favourable as flying information carriers. Here we demonstrate a quantum interface between these two vibrant photonic fields. We create three-dimensional path entanglement between two photons in a nonlinear crystal and use a mode sorter as the quantum interface to transfer the entanglement to the orbital angular momentum degree of freedom. Thus our results show a flexible way to create high-dimensional spatial mode entanglement. Moreover, they pave the way to implement broad complex quantum networks where high-dimensionally entangled states could be distributed over distant photonic chips.
Metamodel-based global optimization using fuzzy clustering for design space reduction
NASA Astrophysics Data System (ADS)
Li, Yulin; Liu, Li; Long, Teng; Dong, Weili
2013-09-01
High fidelity analysis are utilized in modern engineering design optimization problems which involve expensive black-box models. For computation-intensive engineering design problems, efficient global optimization methods must be developed to relieve the computational burden. A new metamodel-based global optimization method using fuzzy clustering for design space reduction (MGO-FCR) is presented. The uniformly distributed initial sample points are generated by Latin hypercube design to construct the radial basis function metamodel, whose accuracy is improved with increasing number of sample points gradually. Fuzzy c-mean method and Gath-Geva clustering method are applied to divide the design space into several small interesting cluster spaces for low and high dimensional problems respectively. Modeling efficiency and accuracy are directly related to the design space, so unconcerned spaces are eliminated by the proposed reduction principle and two pseudo reduction algorithms. The reduction principle is developed to determine whether the current design space should be reduced and which space is eliminated. The first pseudo reduction algorithm improves the speed of clustering, while the second pseudo reduction algorithm ensures the design space to be reduced. Through several numerical benchmark functions, comparative studies with adaptive response surface method, approximated unimodal region elimination method and mode-pursuing sampling are carried out. The optimization results reveal that this method captures the real global optimum for all the numerical benchmark functions. And the number of function evaluations show that the efficiency of this method is favorable especially for high dimensional problems. Based on this global design optimization method, a design optimization of a lifting surface in high speed flow is carried out and this method saves about 10 h compared with genetic algorithms. This method possesses favorable performance on efficiency, robustness
Reduced basis ANOVA methods for partial differential equations with high-dimensional random inputs
NASA Astrophysics Data System (ADS)
Liao, Qifeng; Lin, Guang
2016-07-01
In this paper we present a reduced basis ANOVA approach for partial deferential equations (PDEs) with random inputs. The ANOVA method combined with stochastic collocation methods provides model reduction in high-dimensional parameter space through decomposing high-dimensional inputs into unions of low-dimensional inputs. In this work, to further reduce the computational cost, we investigate spatial low-rank structures in the ANOVA-collocation method, and develop efficient spatial model reduction techniques using hierarchically generated reduced bases. We present a general mathematical framework of the methodology, validate its accuracy and demonstrate its efficiency with numerical experiments.
Wilkinson, Leland; Anand, Anushka; Grossman, Robert
2006-01-01
We introduce a method for organizing multivariate displays and for guiding interactive exploration through high-dimensional data. The method is based on nine characterizations of the 2D distributions of orthogonal pairwise projections on a set of points in multidimensional Euclidean space. These characterizations include such measures as density, skewness, shape, outliers, and texture. Statistical analysis of these measures leads to ways for 1) organizing 2D scatterplots of points for coherent viewing, 2) locating unusual (outlying) marginal 2D distributions of points for anomaly detection, and 3) sorting multivariate displays based on high-dimensional data, such as trees, parallel coordinates, and glyphs.
Kandrup, H.E. ); Morrison, P.J. . Inst. for Fusion Studies)
1992-11-01
The Hamiltonian formulation of the Vlasov-Einstein system, which is appropriate for collisionless, self-gravitating systems like clusters of stars that are so dense that gravity must be described by the Einstein equation, is presented. In particular, it is demonstrated explicitly in the context of a 3 + 1 splitting that, for spherically symmetric configurations, the Vlasov-Einstein system can be viewed as a Hamiltonian system, where the dynamics is generated by a noncanonical Poisson bracket, with the Hamiltonian generating the evolution of the distribution function f (a noncanonical variable) being the conserved ADM mass-energy H[sub ADM]. An explicit expression is derived for the energy [delta]([sup 2])H[sub ADM] associated with an arbitrary phase space preserving perturbation of an arbitrary spherical equilibrium, and it is shown that the equilibrium must be linearly stable if [delta]([sup 2])H[sub ADM] is positive semi-definite. Insight into the Hamiltonian reformulation is provided by a description of general finite degree of freedom systems.
Kandrup, H.E.; Morrison, P.J.
1992-11-01
The Hamiltonian formulation of the Vlasov-Einstein system, which is appropriate for collisionless, self-gravitating systems like clusters of stars that are so dense that gravity must be described by the Einstein equation, is presented. In particular, it is demonstrated explicitly in the context of a 3 + 1 splitting that, for spherically symmetric configurations, the Vlasov-Einstein system can be viewed as a Hamiltonian system, where the dynamics is generated by a noncanonical Poisson bracket, with the Hamiltonian generating the evolution of the distribution function f (a noncanonical variable) being the conserved ADM mass-energy H{sub ADM}. An explicit expression is derived for the energy {delta}({sup 2})H{sub ADM} associated with an arbitrary phase space preserving perturbation of an arbitrary spherical equilibrium, and it is shown that the equilibrium must be linearly stable if {delta}({sup 2})H{sub ADM} is positive semi-definite. Insight into the Hamiltonian reformulation is provided by a description of general finite degree of freedom systems.
High-Dimensional Explanatory Random Item Effects Models for Rater-Mediated Assessments
ERIC Educational Resources Information Center
Kelcey, Ben; Wang, Shanshan; Cox, Kyle
2016-01-01
Valid and reliable measurement of unobserved latent variables is essential to understanding and improving education. A common and persistent approach to assessing latent constructs in education is the use of rater inferential judgment. The purpose of this study is to develop high-dimensional explanatory random item effects models designed for…
Controlling chaos in a high dimensional system with periodic parametric perturbations
Mirus, K.A.; Sprott, J.C.
1998-10-01
The effect of applying a periodic perturbation to an accessible parameter of a high-dimensional (coupled-Lorenz) chaotic system is examined. Numerical results indicate that perturbation frequencies near the natural frequencies of the unstable periodic orbits of the chaotic system can result in limit cycles or significantly reduced dimension for relatively small perturbations.
High-Dimensional Exploratory Item Factor Analysis by a Metropolis-Hastings Robbins-Monro Algorithm
ERIC Educational Resources Information Center
Cai, Li
2010-01-01
A Metropolis-Hastings Robbins-Monro (MH-RM) algorithm for high-dimensional maximum marginal likelihood exploratory item factor analysis is proposed. The sequence of estimates from the MH-RM algorithm converges with probability one to the maximum likelihood solution. Details on the computer implementation of this algorithm are provided. The…
NASA Astrophysics Data System (ADS)
Ahmad, Farooq; Malik, Manzoor A.; Bhat, M. Maqbool
2016-07-01
We derive the spatial pair correlation function in gravitational clustering for extended structure of galaxies (e.g. galaxies with halos) by using statistical mechanics of cosmological many-body problem. Our results indicate that in the limit of point masses (ɛ=0) the two-point correlation function varies as inverse square of relative separation of two galaxies. The effect of softening parameter `ɛ' on the pair correlation function is also studied and results indicate that two-point correlation function is affected by the softening parameter when the distance between galaxies is small. However, for larger distance between galaxies, the two-point correlation function is not affected at all. The correlation length r0 derived by our method depends on the random dispersion velocities < v2rangle^{1/2} and mean number density bar{n}, which is in agreement with N-body simulations and observations. Further, our results are applicable to the clusters of galaxies for their correlation functions and we apply our results to obtain the correlation length r0 for such systems which again agrees with the data of N-body simulations and observations.
van der Leij, Aryan; Blok, Henk; de Jong, Peter F.
2010-01-01
This study investigated the role of speech perception accuracy and speed in fluent word decoding of reading disabled (RD) children. A same-different phoneme discrimination task with natural speech tested the perception of single consonants and consonant clusters by young but persistent RD children. RD children were slower than chronological age (CA) controls in recognizing identical sounds, suggesting less distinct phonemic categories. In addition, after controlling for phonetic similarity Tallal’s (Brain Lang 9:182–198, 1980) fast transitions account of RD children’s speech perception problems was contrasted with Studdert-Kennedy’s (Read Writ Interdiscip J 15:5–14, 2002) similarity explanation. Results showed no specific RD deficit in perceiving fast transitions. Both phonetic similarity and fast transitions influenced accurate speech perception for RD children as well as CA controls. PMID:20652455
Snellings, Patrick; van der Leij, Aryan; Blok, Henk; de Jong, Peter F
2010-12-01
This study investigated the role of speech perception accuracy and speed in fluent word decoding of reading disabled (RD) children. A same-different phoneme discrimination task with natural speech tested the perception of single consonants and consonant clusters by young but persistent RD children. RD children were slower than chronological age (CA) controls in recognizing identical sounds, suggesting less distinct phonemic categories. In addition, after controlling for phonetic similarity Tallal's (Brain Lang 9:182-198, 1980) fast transitions account of RD children's speech perception problems was contrasted with Studdert-Kennedy's (Read Writ Interdiscip J 15:5-14, 2002) similarity explanation. Results showed no specific RD deficit in perceiving fast transitions. Both phonetic similarity and fast transitions influenced accurate speech perception for RD children as well as CA controls.
NASA Astrophysics Data System (ADS)
Denis, Pablo A.
2014-04-01
By means of coupled cluster theory and correlation consistent basis sets we investigated the thermochemistry of dimethyl sulphide (DMS), dimethyl disulphide (DMDS) and four closely related sulphur-containing molecules: CH3SS, CH3S, CH3SH and CH3CH2SH. For the four closed-shell molecules studied, their enthalpies of formation (EOFs) were derived using bomb calorimetry. We found that the deviation of the EOF with respect to experiment was 0.96, 0.65, 1.24 and 1.29 kcal/mol, for CH3SH, CH3CH2SH, DMS and DMDS, respectively, when ΔHf,0 = 65.6 kcal/mol was utilised (JANAF value). However, if the recently proposed ΔHf,0 = 66.2 kcal/mol was used to estimate EOF, the errors dropped to 0.36, 0.05, 0.64 and 0.09 kcal/mol, respectively. In contrast, for the CH3SS radical, a better agreement with experiment was obtained if the 65.6 kcal/mol value was used. To compare with experiment avoiding the problem of the ΔHf,0 (S), we determined the CH3-S and CH3-SS bond dissociation energies (BDEs) in CH3S and CH3SS. At the coupled cluster with singles doubles and perturbative triples correction level of theory, these values are 48.0 and 71.4 kcal/mol, respectively. The latter BDEs are 1.5 and 1.2 kcal/mol larger than the experimental values. The agreement can be considered to be acceptable if we take into consideration that these two radicals present important challenges when determining their EOFs. It is our hope that this work stimulates new studies which help elucidate the problem of the EOF of atomic sulphur.
Xue, Hongqi; Wu, Yichao; Wu, Hulin
2013-01-01
In many regression problems, the relations between the covariates and the response may be nonlinear. Motivated by the application of reconstructing a gene regulatory network, we consider a sparse high-dimensional additive model with the additive components being some known nonlinear functions with unknown parameters. To identify the subset of important covariates, we propose a new method for simultaneous variable selection and parameter estimation by iteratively combining a large-scale variable screening (the nonlinear independence screening, NLIS) and a moderate-scale model selection (the nonnegative garrote, NNG) for the nonlinear additive regressions. We have shown that the NLIS procedure possesses the sure screening property and it is able to handle problems with non-polynomial dimensionality; and for finite dimension problems, the NNG for the nonlinear additive regressions has selection consistency for the unimportant covariates and also estimation consistency for the parameter estimates of the important covariates. The proposed method is applied to simulated data and a real data example for identifying gene regulations to illustrate its numerical performance. PMID:25170239
Elfwing, Stefan; Uchibe, Eiji; Doya, Kenji
2013-01-01
Free-energy based reinforcement learning (FERL) was proposed for learning in high-dimensional state- and action spaces, which cannot be handled by standard function approximation methods. In this study, we propose a scaled version of free-energy based reinforcement learning to achieve more robust and more efficient learning performance. The action-value function is approximated by the negative free-energy of a restricted Boltzmann machine, divided by a constant scaling factor that is related to the size of the Boltzmann machine (the square root of the number of state nodes in this study). Our first task is a digit floor gridworld task, where the states are represented by images of handwritten digits from the MNIST data set. The purpose of the task is to investigate the proposed method's ability, through the extraction of task-relevant features in the hidden layer, to cluster images of the same digit and to cluster images of different digits that corresponds to states with the same optimal action. We also test the method's robustness with respect to different exploration schedules, i.e., different settings of the initial temperature and the temperature discount rate in softmax action selection. Our second task is a robot visual navigation task, where the robot can learn its position by the different colors of the lower part of four landmarks and it can infer the correct corner goal area by the color of the upper part of the landmarks. The state space consists of binarized camera images with, at most, nine different colors, which is equal to 6642 binary states. For both tasks, the learning performance is compared with standard FERL and with function approximation where the action-value function is approximated by a two-layered feedforward neural network. PMID:23450126
NASA Astrophysics Data System (ADS)
Mandrà, Salvatore; Zhu, Zheng; Wang, Wenlong; Perdomo-Ortiz, Alejandro; Katzgraber, Helmut G.
2016-08-01
To date, a conclusive detection of quantum speedup remains elusive. Recently, a team by Google Inc. [V. S. Denchev et al., Phys. Rev. X 6, 031015 (2016), 10.1103/PhysRevX.6.031015] proposed a weak-strong cluster model tailored to have tall and narrow energy barriers separating local minima, with the aim to highlight the value of finite-range tunneling. More precisely, results from quantum Monte Carlo simulations as well as the D-Wave 2X quantum annealer scale considerably better than state-of-the-art simulated annealing simulations. Moreover, the D-Wave 2X quantum annealer is ˜108 times faster than simulated annealing on conventional computer hardware for problems with approximately 103 variables. Here, an overview of different sequential, nontailored, as well as specialized tailored algorithms on the Google instances is given. We show that the quantum speedup is limited to sequential approaches and study the typical complexity of the benchmark problems using insights from the study of spin glasses.
Amniotic fluid: the use of high-dimensional biology to understand fetal well-being.
Kamath-Rayne, Beena D; Smith, Heather C; Muglia, Louis J; Morrow, Ardythe L
2014-01-01
Our aim was to review the use of high-dimensional biology techniques, specifically transcriptomics, proteomics, and metabolomics, in amniotic fluid to elucidate the mechanisms behind preterm birth or assessment of fetal development. We performed a comprehensive MEDLINE literature search on the use of transcriptomic, proteomic, and metabolomic technologies for amniotic fluid analysis. All abstracts were reviewed for pertinence to preterm birth or fetal maturation in human subjects. Nineteen articles qualified for inclusion. Most articles described the discovery of biomarker candidates, but few larger, multicenter replication or validation studies have been done. We conclude that the use of high-dimensional systems biology techniques to analyze amniotic fluid has significant potential to elucidate the mechanisms of preterm birth and fetal maturation. However, further multicenter collaborative efforts are needed to replicate and validate candidate biomarkers before they can become useful tools for clinical practice. Ideally, amniotic fluid biomarkers should be translated to a noninvasive test performed in maternal serum or urine.
Gentry, Amanda Elswick; Jackson-Cook, Colleen K; Lyon, Debra E; Archer, Kellie J
2015-01-01
The pathological description of the stage of a tumor is an important clinical designation and is considered, like many other forms of biomedical data, an ordinal outcome. Currently, statistical methods for predicting an ordinal outcome using clinical, demographic, and high-dimensional correlated features are lacking. In this paper, we propose a method that fits an ordinal response model to predict an ordinal outcome for high-dimensional covariate spaces. Our method penalizes some covariates (high-throughput genomic features) without penalizing others (such as demographic and/or clinical covariates). We demonstrate the application of our method to predict the stage of breast cancer. In our model, breast cancer subtype is a nonpenalized predictor, and CpG site methylation values from the Illumina Human Methylation 450K assay are penalized predictors. The method has been made available in the ordinalgmifs package in the R programming environment. PMID:26052223
Efficient scheme for experimental quantification of non-Markovianity in high-dimensional systems
NASA Astrophysics Data System (ADS)
Dong, S.-J.; Liu, B.-H.; Han, Y.-J.; Guo, G.-C.; He, Lixin
2015-04-01
The non-Markovianity is a prominent concept of the dynamics of open quantum systems, which is of fundamental importance in quantum mechanics and quantum information. Despite lots of efforts, the experimental measurement of non-Markovianity of an open system is still limited to very small systems. Presently, it is still impossible to experimentally quantify the non-Markovianity of high-dimensional systems with the widely used Breuer-Laine-Piilo trace distance measure. In this paper, we propose a method, combining experimental measurements and numerical calculations, that allow quantifying the non-Markovianity of an N -dimensional system only scaled as N2, successfully avoiding the exponential scaling with the dimension of the open system in the current method. After the benchmark with a two-dimensional open system, we demonstrate the method in quantifying the non-Markovanity of a high-dimensional open quantum random walk system.
Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data
Deng, Yi; Chang, Changgee; Ido, Moges Seyoum; Long, Qi
2016-01-01
Multiple imputation (MI) has been widely used for handling missing data in biomedical research. In the presence of high-dimensional data, regularized regression has been used as a natural strategy for building imputation models, but limited research has been conducted for handling general missing data patterns where multiple variables have missing values. Using the idea of multiple imputation by chained equations (MICE), we investigate two approaches of using regularized regression to impute missing values of high-dimensional data that can handle general missing data patterns. We compare our MICE methods with several existing imputation methods in simulation studies. Our simulation results demonstrate the superiority of the proposed MICE approach based on an indirect use of regularized regression in terms of bias. We further illustrate the proposed methods using two data examples. PMID:26868061
Simple, Scalable Proteomic Imaging for High-Dimensional Profiling of Intact Systems.
Murray, Evan; Cho, Jae Hun; Goodwin, Daniel; Ku, Taeyun; Swaney, Justin; Kim, Sung-Yon; Choi, Heejin; Park, Young-Gyun; Park, Jeong-Yoon; Hubbert, Austin; McCue, Margaret; Vassallo, Sara; Bakh, Naveed; Frosch, Matthew P; Wedeen, Van J; Seung, H Sebastian; Chung, Kwanghun
2015-12-01
Combined measurement of diverse molecular and anatomical traits that span multiple levels remains a major challenge in biology. Here, we introduce a simple method that enables proteomic imaging for scalable, integrated, high-dimensional phenotyping of both animal tissues and human clinical samples. This method, termed SWITCH, uniformly secures tissue architecture, native biomolecules, and antigenicity across an entire system by synchronizing the tissue preservation reaction. The heat- and chemical-resistant nature of the resulting framework permits multiple rounds (>20) of relabeling. We have performed 22 rounds of labeling of a single tissue with precise co-registration of multiple datasets. Furthermore, SWITCH synchronizes labeling reactions to improve probe penetration depth and uniformity of staining. With SWITCH, we performed combinatorial protein expression profiling of the human cortex and also interrogated the geometric structure of the fiber pathways in mouse brains. Such integrated high-dimensional information may accelerate our understanding of biological systems at multiple levels. PMID:26638076
Atom-centered symmetry functions for constructing high-dimensional neural network potentials
NASA Astrophysics Data System (ADS)
Behler, Jörg
2011-02-01
Neural networks offer an unbiased and numerically very accurate approach to represent high-dimensional ab initio potential-energy surfaces. Once constructed, neural network potentials can provide the energies and forces many orders of magnitude faster than electronic structure calculations, and thus enable molecular dynamics simulations of large systems. However, Cartesian coordinates are not a good choice to represent the atomic positions, and a transformation to symmetry functions is required. Using simple benchmark systems, the properties of several types of symmetry functions suitable for the construction of high-dimensional neural network potential-energy surfaces are discussed in detail. The symmetry functions are general and can be applied to all types of systems such as molecules, crystalline and amorphous solids, and liquids.
Compressively Characterizing High-Dimensional Entangled States with Complementary, Random Filtering
NASA Astrophysics Data System (ADS)
Howland, Gregory A.; Knarr, Samuel H.; Schneeloch, James; Lum, Daniel J.; Howell, John C.
2016-04-01
The resources needed to conventionally characterize a quantum system are overwhelmingly large for high-dimensional systems. This obstacle may be overcome by abandoning traditional cornerstones of quantum measurement, such as general quantum states, strong projective measurement, and assumption-free characterization. Following this reasoning, we demonstrate an efficient technique for characterizing high-dimensional, spatial entanglement with one set of measurements. We recover sharp distributions with local, random filtering of the same ensemble in momentum followed by position—something the uncertainty principle forbids for projective measurements. Exploiting the expectation that entangled signals are highly correlated, we use fewer than 5000 measurements to characterize a 65,536-dimensional state. Finally, we use entropic inequalities to witness entanglement without a density matrix. Our method represents the sea change unfolding in quantum measurement, where methods influenced by the information theory and signal-processing communities replace unscalable, brute-force techniques—a progression previously followed by classical sensing.
Buckley-James Boosting for Survival Analysis with High-Dimensional Biomarker Data*
Wang, Zhu; Wang, C.Y.
2010-01-01
There has been increasing interest in predicting patients’ survival after therapy by investigating gene expression microarray data. In the regression and classification models with high-dimensional genomic data, boosting has been successfully applied to build accurate predictive models and conduct variable selection simultaneously. We propose the Buckley-James boosting for the semiparametric accelerated failure time models with right censored survival data, which can be used to predict survival of future patients using the high-dimensional genomic data. In the spirit of adaptive LASSO, twin boosting is also incorporated to fit more sparse models. The proposed methods have a unified approach to fit linear models, non-linear effects models with possible interactions. The methods can perform variable selection and parameter estimation simultaneously. The proposed methods are evaluated by simulations and applied to a recent microarray gene expression data set for patients with diffuse large B-cell lymphoma under the current gold standard therapy. PMID:20597850
A Shell Multi-dimensional Hierarchical Cubing Approach for High-Dimensional Cube
NASA Astrophysics Data System (ADS)
Zou, Shuzhi; Zhao, Li; Hu, Kongfa
The pre-computation of data cubes is critical for improving the response time of OLAP systems and accelerating data mining tasks in large data warehouses. However, as the sizes of data warehouses grow, the time it takes to perform this pre-computation becomes a significant performance bottleneck. In a high dimensional data warehouse, it might not be practical to build all these cuboids and their indices. In this paper, we propose a shell multi-dimensional hierarchical cubing algorithm, based on an extension of the previous minimal cubing approach. This method partitions the high dimensional data cube into low multi-dimensional hierarchical cube. Experimental results show that the proposed method is significantly more efficient than other existing cubing methods.
Luan, Xiaoli; Chen, Qiang; Liu, Fei
2014-09-01
This article presents a new scheme to design full matrix controller for high dimensional multivariable processes based on equivalent transfer function (ETF). Differing from existing ETF method, the proposed ETF is derived directly by exploiting the relationship between the equivalent closed-loop transfer function and the inverse of open-loop transfer function. Based on the obtained ETF, the full matrix controller is designed utilizing the existing PI tuning rules. The new proposed ETF model can more accurately represent the original processes. Furthermore, the full matrix centralized controller design method proposed in this paper is applicable to high dimensional multivariable systems with satisfactory performance. Comparison with other multivariable controllers shows that the designed ETF based controller is superior with respect to design-complexity and obtained performance.
Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data
Xiong, Lie; Kuan, Pei-Fen; Tian, Jianan; Keles, Sunduz; Wang, Sijian
2015-01-01
In this paper, we propose a novel multivariate component-wise boosting method for fitting multivariate response regression models under the high-dimension, low sample size setting. Our method is motivated by modeling the association among different biological molecules based on multiple types of high-dimensional genomic data. Particularly, we are interested in two applications: studying the influence of DNA copy number alterations on RNA transcript levels and investigating the association between DNA methylation and gene expression. For this purpose, we model the dependence of the RNA expression levels on DNA copy number alterations and the dependence of gene expression on DNA methylation through multivariate regression models and utilize boosting-type method to handle the high dimensionality as well as model the possible nonlinear associations. The performance of the proposed method is demonstrated through simulation studies. Finally, our multivariate boosting method is applied to two breast cancer studies. PMID:26609213
Controlling chaos in low and high dimensional systems with periodic parametric perturbations
Mirus, K.A.; Sprott, J.C.
1998-06-01
The effect of applying a periodic perturbation to an accessible parameter of various chaotic systems is examined. Numerical results indicate that perturbation frequencies near the natural frequencies of the unstable periodic orbits of the chaotic systems can result in limit cycles for relatively small perturbations. Such perturbations can also control or significantly reduce the dimension of high-dimensional systems. Initial application to the control of fluctuations in a prototypical magnetic fusion plasma device will be reviewed.
Hirata, Yoshito; Aihara, Kazuyuki
2012-06-01
We introduce a low-dimensional description for a high-dimensional system, which is a piecewise affine model whose state space is divided by permutations. We show that the proposed model tends to predict wind speeds and photovoltaic outputs for the time scales from seconds to 100 s better than by global affine models. In addition, computations using the piecewise affine model are much faster than those of usual nonlinear models such as radial basis function models.
Prediction of high-dimensional states subject to respiratory motion: a manifold learning approach.
Liu, Wenyang; Sawant, Amit; Ruan, Dan
2016-07-01
The development of high-dimensional imaging systems in image-guided radiotherapy provides important pathways to the ultimate goal of real-time full volumetric motion monitoring. Effective motion management during radiation treatment usually requires prediction to account for system latency and extra signal/image processing time. It is challenging to predict high-dimensional respiratory motion due to the complexity of the motion pattern combined with the curse of dimensionality. Linear dimension reduction methods such as PCA have been used to construct a linear subspace from the high-dimensional data, followed by efficient predictions on the lower-dimensional subspace. In this study, we extend such rationale to a more general manifold and propose a framework for high-dimensional motion prediction with manifold learning, which allows one to learn more descriptive features compared to linear methods with comparable dimensions. Specifically, a kernel PCA is used to construct a proper low-dimensional feature manifold, where accurate and efficient prediction can be performed. A fixed-point iterative pre-image estimation method is used to recover the predicted value in the original state space. We evaluated and compared the proposed method with a PCA-based approach on level-set surfaces reconstructed from point clouds captured by a 3D photogrammetry system. The prediction accuracy was evaluated in terms of root-mean-squared-error. Our proposed method achieved consistent higher prediction accuracy (sub-millimeter) for both 200 ms and 600 ms lookahead lengths compared to the PCA-based approach, and the performance gain was statistically significant.
Prediction of high-dimensional states subject to respiratory motion: a manifold learning approach.
Liu, Wenyang; Sawant, Amit; Ruan, Dan
2016-07-01
The development of high-dimensional imaging systems in image-guided radiotherapy provides important pathways to the ultimate goal of real-time full volumetric motion monitoring. Effective motion management during radiation treatment usually requires prediction to account for system latency and extra signal/image processing time. It is challenging to predict high-dimensional respiratory motion due to the complexity of the motion pattern combined with the curse of dimensionality. Linear dimension reduction methods such as PCA have been used to construct a linear subspace from the high-dimensional data, followed by efficient predictions on the lower-dimensional subspace. In this study, we extend such rationale to a more general manifold and propose a framework for high-dimensional motion prediction with manifold learning, which allows one to learn more descriptive features compared to linear methods with comparable dimensions. Specifically, a kernel PCA is used to construct a proper low-dimensional feature manifold, where accurate and efficient prediction can be performed. A fixed-point iterative pre-image estimation method is used to recover the predicted value in the original state space. We evaluated and compared the proposed method with a PCA-based approach on level-set surfaces reconstructed from point clouds captured by a 3D photogrammetry system. The prediction accuracy was evaluated in terms of root-mean-squared-error. Our proposed method achieved consistent higher prediction accuracy (sub-millimeter) for both 200 ms and 600 ms lookahead lengths compared to the PCA-based approach, and the performance gain was statistically significant. PMID:27299958
Maximal violation of tight Bell inequalities for maximal high-dimensional entanglement
Lee, Seung-Woo; Jaksch, Dieter
2009-07-15
We propose a Bell inequality for high-dimensional bipartite systems obtained by binning local measurement outcomes and show that it is tight. We find a binning method for even d-dimensional measurement outcomes for which this Bell inequality is maximally violated by maximally entangled states. Furthermore, we demonstrate that the Bell inequality is applicable to continuous variable systems and yields strong violations for two-mode squeezed states.
Prediction of high-dimensional states subject to respiratory motion: a manifold learning approach
NASA Astrophysics Data System (ADS)
Liu, Wenyang; Sawant, Amit; Ruan, Dan
2016-07-01
The development of high-dimensional imaging systems in image-guided radiotherapy provides important pathways to the ultimate goal of real-time full volumetric motion monitoring. Effective motion management during radiation treatment usually requires prediction to account for system latency and extra signal/image processing time. It is challenging to predict high-dimensional respiratory motion due to the complexity of the motion pattern combined with the curse of dimensionality. Linear dimension reduction methods such as PCA have been used to construct a linear subspace from the high-dimensional data, followed by efficient predictions on the lower-dimensional subspace. In this study, we extend such rationale to a more general manifold and propose a framework for high-dimensional motion prediction with manifold learning, which allows one to learn more descriptive features compared to linear methods with comparable dimensions. Specifically, a kernel PCA is used to construct a proper low-dimensional feature manifold, where accurate and efficient prediction can be performed. A fixed-point iterative pre-image estimation method is used to recover the predicted value in the original state space. We evaluated and compared the proposed method with a PCA-based approach on level-set surfaces reconstructed from point clouds captured by a 3D photogrammetry system. The prediction accuracy was evaluated in terms of root-mean-squared-error. Our proposed method achieved consistent higher prediction accuracy (sub-millimeter) for both 200 ms and 600 ms lookahead lengths compared to the PCA-based approach, and the performance gain was statistically significant.
Survey on granularity clustering.
Ding, Shifei; Du, Mingjing; Zhu, Hong
2015-12-01
With the rapid development of uncertain artificial intelligent and the arrival of big data era, conventional clustering analysis and granular computing fail to satisfy the requirements of intelligent information processing in this new case. There is the essential relationship between granular computing and clustering analysis, so some researchers try to combine granular computing with clustering analysis. In the idea of granularity, the researchers expand the researches in clustering analysis and look for the best clustering results with the help of the basic theories and methods of granular computing. Granularity clustering method which is proposed and studied has attracted more and more attention. This paper firstly summarizes the background of granularity clustering and the intrinsic connection between granular computing and clustering analysis, and then mainly reviews the research status and various methods of granularity clustering. Finally, we analyze existing problem and propose further research.
Algamal, Zakariya Yahya; Lee, Muhammad Hisyam
2015-12-01
Cancer classification and gene selection in high-dimensional data have been popular research topics in genetics and molecular biology. Recently, adaptive regularized logistic regression using the elastic net regularization, which is called the adaptive elastic net, has been successfully applied in high-dimensional cancer classification to tackle both estimating the gene coefficients and performing gene selection simultaneously. The adaptive elastic net originally used elastic net estimates as the initial weight, however, using this weight may not be preferable for certain reasons: First, the elastic net estimator is biased in selecting genes. Second, it does not perform well when the pairwise correlations between variables are not high. Adjusted adaptive regularized logistic regression (AAElastic) is proposed to address these issues and encourage grouping effects simultaneously. The real data results indicate that AAElastic is significantly consistent in selecting genes compared to the other three competitor regularization methods. Additionally, the classification performance of AAElastic is comparable to the adaptive elastic net and better than other regularization methods. Thus, we can conclude that AAElastic is a reliable adaptive regularized logistic regression method in the field of high-dimensional cancer classification.
Tian, Xinyu; Wang, Xuefeng; Chen, Jun
2014-01-01
Classic multinomial logit model, commonly used in multiclass regression problem, is restricted to few predictors and does not take into account the relationship among variables. It has limited use for genomic data, where the number of genomic features far exceeds the sample size. Genomic features such as gene expressions are usually related by an underlying biological network. Efficient use of the network information is important to improve classification performance as well as the biological interpretability. We proposed a multinomial logit model that is capable of addressing both the high dimensionality of predictors and the underlying network information. Group lasso was used to induce model sparsity, and a network-constraint was imposed to induce the smoothness of the coefficients with respect to the underlying network structure. To deal with the non-smoothness of the objective function in optimization, we developed a proximal gradient algorithm for efficient computation. The proposed model was compared to models with no prior structure information in both simulations and a problem of cancer subtype prediction with real TCGA (the cancer genome atlas) gene expression data. The network-constrained mode outperformed the traditional ones in both cases. PMID:25635165
Convex Discriminative Multitask Clustering.
Zhang, Xiao-Lei
2015-01-01
Multitask clustering tries to improve the clustering performance of multiple tasks simultaneously by taking their relationship into account. Most existing multitask clustering algorithms fall into the type of generative clustering, and none are formulated as convex optimization problems. In this paper, we propose two convex Discriminative Multitask Clustering (DMTC) objectives to address the problems. The first one aims to learn a shared feature representation, which can be seen as a technical combination of the convex multitask feature learning and the convex Multiclass Maximum Margin Clustering (M3C). The second one aims to learn the task relationship, which can be seen as a combination of the convex multitask relationship learning and M3C. The objectives of the two algorithms are solved in a uniform procedure by the efficient cutting-plane algorithm and further unified in the Bayesian framework. Experimental results on a toy problem and two benchmark data sets demonstrate the effectiveness of the proposed algorithms. PMID:26353206
ERIC Educational Resources Information Center
Dishion, Thomas J.; Ha, Thao; Veronneau, Marie-Helene
2012-01-01
The authors propose that peer relationships should be included in a life history perspective on adolescent problem behavior. Longitudinal analyses were used to examine deviant peer clustering as the mediating link between attenuated family ties, peer marginalization, and social disadvantage in early adolescence and sexual promiscuity in middle…
Matlab Cluster Ensemble Toolbox
Sapio, Vincent De; Kegelmeyer, Philip
2009-04-27
This is a Matlab toolbox for investigating the application of cluster ensembles to data classification, with the objective of improving the accuracy and/or speed of clustering. The toolbox divides the cluster ensemble problem into four areas, providing functionality for each. These include, (1) synthetic data generation, (2) clustering to generate individual data partitions and similarity matrices, (3) consensus function generation and final clustering to generate ensemble data partitioning, and (4) implementation of accuracy metrics. With regard to data generation, Gaussian data of arbitrary dimension can be generated. The kcenters algorithm can then be used to generate individual data partitions by either, (a) subsampling the data and clustering each subsample, or by (b) randomly initializing the algorithm and generating a clustering for each initialization. In either case an overall similarity matrix can be computed using a consensus function operating on the individual similarity matrices. A final clustering can be performed and performance metrics are provided for evaluation purposes.
Jiang, Xia; Cai, Binghuang; Xue, Diyang; Lu, Xinghua; Cooper, Gregory F; Neapolitan, Richard E
2014-01-01
Objective The objective of this investigation is to evaluate binary prediction methods for predicting disease status using high-dimensional genomic data. The central hypothesis is that the Bayesian network (BN)-based method called efficient Bayesian multivariate classifier (EBMC) will do well at this task because EBMC builds on BN-based methods that have performed well at learning epistatic interactions. Method We evaluate how well eight methods perform binary prediction using high-dimensional discrete genomic datasets containing epistatic interactions. The methods are as follows: naive Bayes (NB), model averaging NB (MANB), feature selection NB (FSNB), EBMC, logistic regression (LR), support vector machines (SVM), Lasso, and extreme learning machines (ELM). We use a hundred 1000-single nucleotide polymorphism (SNP) simulated datasets, ten 10 000-SNP datasets, six semi-synthetic sets, and two real genome-wide association studies (GWAS) datasets in our evaluation. Results In fivefold cross-validation studies, the SVM performed best on the 1000-SNP dataset, while the BN-based methods performed best on the other datasets, with EBMC exhibiting the best overall performance. In-sample testing indicates that LR, SVM, Lasso, ELM, and NB tend to overfit the data. Discussion EBMC performed better than NB when there are several strong predictors, whereas NB performed better when there are many weak predictors. Furthermore, for all BN-based methods, prediction capability did not degrade as the dimension increased. Conclusions Our results support the hypothesis that EBMC performs well at binary outcome prediction using high-dimensional discrete datasets containing epistatic-like interactions. Future research using more GWAS datasets is needed to further investigate the potential of EBMC. PMID:24737607
Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso
Kong, Shengchun; Nan, Bin
2013-01-01
We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses. PMID:24516328
Scale-Invariant Sparse PCA on High Dimensional Meta-elliptical Data
Han, Fang; Liu, Han
2014-01-01
We propose a semiparametric method for conducting scale-invariant sparse principal component analysis (PCA) on high dimensional non-Gaussian data. Compared with sparse PCA, our method has weaker modeling assumption and is more robust to possible data contamination. Theoretically, the proposed method achieves a parametric rate of convergence in estimating the parameter of interests under a flexible semiparametric distribution family; Computationally, the proposed method exploits a rank-based procedure and is as efficient as sparse PCA; Empirically, our method outperforms most competing methods on both synthetic and real-world datasets. PMID:24932056
High-dimensional chaos from self-sustained collisions of solitons
Yildirim, O. Ozgur E-mail: oozgury@gmail.com; Ham, Donhee E-mail: oozgury@gmail.com
2014-06-16
We experimentally demonstrate chaos generation based on collisions of electrical solitons on a nonlinear transmission line. The nonlinear line creates solitons, and an amplifier connected to it provides gain to these solitons for their self-excitation and self-sustenance. Critically, the amplifier also provides a mechanism to enable and intensify collisions among solitons. These collisional interactions are of intrinsically nonlinear nature, modulating the phase and amplitude of solitons, thus causing chaos. This chaos generated by the exploitation of the nonlinear wave phenomena is inherently high-dimensional, which we also demonstrate.
Fast time-series prediction using high-dimensional data: Evaluating confidence interval credibility
NASA Astrophysics Data System (ADS)
Hirata, Yoshito
2014-05-01
I propose an index for evaluating the credibility of confidence intervals for future observables predicted from high-dimensional time-series data. The index evaluates the distance from the current state to the data manifold. I demonstrate the index with artificial datasets generated from the Lorenz'96 II model [Lorenz, in Proceedings of the Seminar on Predictability, Vol. 1 (ECMWF, Reading, UK, 1996), p. 1], the Lorenz'96 I model [Hansen and Smith, J. Atmos. Sci. 57, 2859 (2000), 10.1175/1520-0469(2000)057<2859:TROOCI>2.0.CO;2], and the coupled map lattice, and a real dataset for the solar irradiation around Japan.
Nesting Monte Carlo EM for high-dimensional item factor analysis
An, Xinming; Bentler, Peter M.
2012-01-01
The item factor analysis model for investigating multidimensional latent spaces has proved to be useful. Parameter estimation in this model requires computationally demanding high-dimensional integrations. While several approaches to approximate such integrations have been proposed, they suffer various computational difficulties. This paper proposes a Nesting Monte Carlo Expectation-Maximization (MCEM) algorithm for item factor analysis with binary data. Simulation studies and a real data example suggest that the Nesting MCEM approach can significantly improve computational efficiency while also enjoying the good properties of stable convergence and easy implementation. PMID:23329857
Inferring biological tasks using Pareto analysis of high-dimensional data.
Hart, Yuval; Sheftel, Hila; Hausser, Jean; Szekely, Pablo; Ben-Moshe, Noa Bossel; Korem, Yael; Tendler, Avichai; Mayo, Avraham E; Alon, Uri
2015-03-01
We present the Pareto task inference method (ParTI; http://www.weizmann.ac.il/mcb/UriAlon/download/ParTI) for inferring biological tasks from high-dimensional biological data. Data are described as a polytope, and features maximally enriched closest to the vertices (or archetypes) allow identification of the tasks the vertices represent. We demonstrate that human breast tumors and mouse tissues are well described by tetrahedrons in gene expression space, with specific tumor types and biological functions enriched at each of the vertices, suggesting four key tasks.
Inferring biological tasks using Pareto analysis of high-dimensional data.
Hart, Yuval; Sheftel, Hila; Hausser, Jean; Szekely, Pablo; Ben-Moshe, Noa Bossel; Korem, Yael; Tendler, Avichai; Mayo, Avraham E; Alon, Uri
2015-03-01
We present the Pareto task inference method (ParTI; http://www.weizmann.ac.il/mcb/UriAlon/download/ParTI) for inferring biological tasks from high-dimensional biological data. Data are described as a polytope, and features maximally enriched closest to the vertices (or archetypes) allow identification of the tasks the vertices represent. We demonstrate that human breast tumors and mouse tissues are well described by tetrahedrons in gene expression space, with specific tumor types and biological functions enriched at each of the vertices, suggesting four key tasks. PMID:25622107
NASA Astrophysics Data System (ADS)
Laloy, Eric; Linde, Niklas; Jacques, Diederik; Mariethoz, Grégoire
2016-04-01
The sequential geostatistical resampling (SGR) algorithm is a Markov chain Monte Carlo (MCMC) scheme for sampling from possibly non-Gaussian, complex spatially-distributed prior models such as geologic facies or categorical fields. In this work, we highlight the limits of standard SGR for posterior inference of high-dimensional categorical fields with realistically complex likelihood landscapes and benchmark a parallel tempering implementation (PT-SGR). Our proposed PT-SGR approach is demonstrated using synthetic (error corrupted) data from steady-state flow and transport experiments in categorical 7575- and 10,000-dimensional 2D conductivity fields. In both case studies, every SGR trial gets trapped in a local optima while PT-SGR maintains an higher diversity in the sampled model states. The advantage of PT-SGR is most apparent in an inverse transport problem where the posterior distribution is made bimodal by construction. PT-SGR then converges towards the appropriate data misfit much faster than SGR and partly recovers the two modes. In contrast, for the same computational resources SGR does not fit the data to the appropriate error level and hardly produces a locally optimal solution that looks visually similar to one of the two reference modes. Although PT-SGR clearly surpasses SGR in performance, our results also indicate that using a small number (16-24) of temperatures (and thus parallel cores) may not permit complete sampling of the posterior distribution by PT-SGR within a reasonable computational time (less than 1-2 weeks).
NASA Astrophysics Data System (ADS)
Venghaus, Florian; Eisfeld, Wolfgang
2016-03-01
Robust diabatization techniques are key for the development of high-dimensional coupled potential energy surfaces (PESs) to be used in multi-state quantum dynamics simulations. In the present study we demonstrate that, besides the actual diabatization technique, common problems with the underlying electronic structure calculations can be the reason why a diabatization fails. After giving a short review of the theoretical background of diabatization, we propose a method based on the block-diagonalization to analyse the electronic structure data. This analysis tool can be used in three different ways: First, it allows to detect issues with the ab initio reference data and is used to optimize the setup of the electronic structure calculations. Second, the data from the block-diagonalization are utilized for the development of optimal parametrized diabatic model matrices by identifying the most significant couplings. Third, the block-diagonalization data are used to fit the parameters of the diabatic model, which yields an optimal initial guess for the non-linear fitting required by standard or more advanced energy based diabatization methods. The new approach is demonstrated by the diabatization of 9 electronic states of the propargyl radical, yielding fully coupled full-dimensional (12D) PESs in closed form.
Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression
Laimighofer, Michael; Krumsiek, Jan; Theis, Fabian J.
2016-01-01
Abstract With widespread availability of omics profiling techniques, the analysis and interpretation of high-dimensional omics data, for example, for biomarkers, is becoming an increasingly important part of clinical medicine because such datasets constitute a promising resource for predicting survival outcomes. However, early experience has shown that biomarkers often generalize poorly. Thus, it is crucial that models are not overfitted and give accurate results with new data. In addition, reliable detection of multivariate biomarkers with high predictive power (feature selection) is of particular interest in clinical settings. We present an approach that addresses both aspects in high-dimensional survival models. Within a nested cross-validation (CV), we fit a survival model, evaluate a dataset in an unbiased fashion, and select features with the best predictive power by applying a weighted combination of CV runs. We evaluate our approach using simulated toy data, as well as three breast cancer datasets, to predict the survival of breast cancer patients after treatment. In all datasets, we achieve more reliable estimation of predictive power for unseen cases and better predictive performance compared to the standard CoxLasso model. Taken together, we present a comprehensive and flexible framework for survival models, including performance estimation, final feature selection, and final model construction. The proposed algorithm is implemented in an open source R package (SurvRank) available on CRAN. PMID:26894327
Yu Yafei; Wen Hua; Li Hua; Peng Xinhua
2011-03-15
One can uniquely identify an unknown state of a quantum system S by measuring a ''quorum'' consisting of a complete set of noncommuting observables. It is also possible to determine the quantum state by repeated measurements of a single and factorized observable, when the system S is coupled to an assistant system A whose initial state is known. This is because that the redistribution of the information about the unknown quantum state into the composite system A+S results in a one-to-one mapping between the unknown density matrix elements and the probabilities of the occurrence of the eigenvalues of a single, factorized observable of the composite system. Here we focus on quantum state tomography of high-dimensional quantum systems (e.g., a spin greater than 1/2) via a single observable. We determine the condition for the best determination and the upper bound to achieve the most robust measurements. From the experimental view we require a suitable interaction Hamiltonian to maximize the measure efficiency. For this we numerically investigate a three-level system. Moreover, the error analysis for the different-dimensional quantum states shows that the present measurement method is still very effective in determining an unknown state of a high-dimensional quantum system.
The cross-validated AUC for MCP-logistic regression with high-dimensional data.
Jiang, Dingfeng; Huang, Jian; Zhang, Ying
2013-10-01
We propose a cross-validated area under the receiving operator characteristic (ROC) curve (CV-AUC) criterion for tuning parameter selection for penalized methods in sparse, high-dimensional logistic regression models. We use this criterion in combination with the minimax concave penalty (MCP) method for variable selection. The CV-AUC criterion is specifically designed for optimizing the classification performance for binary outcome data. To implement the proposed approach, we derive an efficient coordinate descent algorithm to compute the MCP-logistic regression solution surface. Simulation studies are conducted to evaluate the finite sample performance of the proposed method and its comparison with the existing methods including the Akaike information criterion (AIC), Bayesian information criterion (BIC) or Extended BIC (EBIC). The model selected based on the CV-AUC criterion tends to have a larger predictive AUC and smaller classification error than those with tuning parameters selected using the AIC, BIC or EBIC. We illustrate the application of the MCP-logistic regression with the CV-AUC criterion on three microarray datasets from the studies that attempt to identify genes related to cancers. Our simulation studies and data examples demonstrate that the CV-AUC is an attractive method for tuning parameter selection for penalized methods in high-dimensional logistic regression models.
Challenges and approaches to statistical design and inference in high-dimensional investigations.
Gadbury, Gary L; Garrett, Karen A; Allison, David B
2009-01-01
Advances in modern technologies have facilitated high-dimensional experiments (HDEs) that generate tremendous amounts of genomic, proteomic, and other "omic" data. HDEs involving whole-genome sequences and polymorphisms, expression levels of genes, protein abundance measurements, and combinations thereof have become a vanguard for new analytic approaches to the analysis of HDE data. Such situations demand creative approaches to the processes of statistical inference, estimation, prediction, classification, and study design. The novel and challenging biological questions asked from HDE data have resulted in many specialized analytic techniques being developed. This chapter discusses some of the unique statistical challenges facing investigators studying high-dimensional biology and describes some approaches being developed by statistical scientists. We have included some focus on the increasing interest in questions involving testing multiple propositions simultaneously, appropriate inferential indicators for the types of questions biologists are interested in, and the need for replication of results across independent studies, investigators, and settings. A key consideration inherent throughout is the challenge in providing methods that a statistician judges to be sound and a biologist finds informative.
High-dimensional quantum state transfer in a noisy network environment
NASA Astrophysics Data System (ADS)
Qin, Wei; Li, Jun-Lin; Long, Gui-Lu
2015-04-01
We propose and analyze an efficient high-dimensional quantum state transfer protocol in an XX coupling spin network with a hypercube structure or chain structure. Under free spin wave approximation, unitary evolution results in a perfect high-dimensional quantum swap operation requiring neither external manipulation nor weak coupling. Evolution time is independent of either distance between registers or dimensions of sent states, which can improve the computational efficiency. In the low temperature regime and thermodynamic limit, the decoherence caused by a noisy environment is studied with a model of an antiferromagnetic spin bath coupled to quantum channels via an Ising-type interaction. It is found that while the decoherence reduces the fidelity of state transfer, increasing intra-channel coupling can strongly suppress such an effect. These observations demonstrate the robustness of the proposed scheme. Project supported by the National Natural Science Foundation of China (Grant Nos. 11175094 and 91221205) and the National Basic Research Program of China (Grant No. 2011CB9216002). Long Gui-Lu also thanks the support of Center of Atomic and Molecular Nanoscience of Tsinghua University, China.
Validi, AbdoulAhad
2014-03-01
This study introduces a non-intrusive approach in the context of low-rank separated representation to construct a surrogate of high-dimensional stochastic functions, e.g., PDEs/ODEs, in order to decrease the computational cost of Markov Chain Monte Carlo simulations in Bayesian inference. The surrogate model is constructed via a regularized alternative least-square regression with Tikhonov regularization using a roughening matrix computing the gradient of the solution, in conjunction with a perturbation-based error indicator to detect optimal model complexities. The model approximates a vector of a continuous solution at discrete values of a physical variable. The required number of random realizations to achieve a successful approximation linearly depends on the function dimensionality. The computational cost of the model construction is quadratic in the number of random inputs, which potentially tackles the curse of dimensionality in high-dimensional stochastic functions. Furthermore, this vector-valued separated representation-based model, in comparison to the available scalar-valued case, leads to a significant reduction in the cost of approximation by an order of magnitude equal to the vector size. The performance of the method is studied through its application to three numerical examples including a 41-dimensional elliptic PDE and a 21-dimensional cavity flow.
A Robust Supervised Variable Selection for Noisy High-Dimensional Data
Kalina, Jan; Schlenker, Anna
2015-01-01
The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers. PMID:26137474
Pang, Herbert; Jung, Sin-Ho
2013-04-01
A variety of prediction methods are used to relate high-dimensional genome data with a clinical outcome using a prediction model. Once a prediction model is developed from a data set, it should be validated using a resampling method or an independent data set. Although the existing prediction methods have been intensively evaluated by many investigators, there has not been a comprehensive study investigating the performance of the validation methods, especially with a survival clinical outcome. Understanding the properties of the various validation methods can allow researchers to perform more powerful validations while controlling for type I error. In addition, sample size calculation strategy based on these validation methods is lacking. We conduct extensive simulations to examine the statistical properties of these validation strategies. In both simulations and a real data example, we have found that 10-fold cross-validation with permutation gave the best power while controlling type I error close to the nominal level. Based on this, we have also developed a sample size calculation method that will be used to design a validation study with a user-chosen combination of prediction. Microarray and genome-wide association studies data are used as illustrations. The power calculation method in this presentation can be used for the design of any biomedical studies involving high-dimensional data and survival outcomes.
Methods for discovery and characterization of cell subsets in high dimensional mass cytometry data.
Diggins, Kirsten E; Ferrell, P Brent; Irish, Jonathan M
2015-07-01
The flood of high-dimensional data resulting from mass cytometry experiments that measure more than 40 features of individual cells has stimulated creation of new single cell computational biology tools. These tools draw on advances in the field of machine learning to capture multi-parametric relationships and reveal cells that are easily overlooked in traditional analysis. Here, we introduce a workflow for high dimensional mass cytometry data that emphasizes unsupervised approaches and visualizes data in both single cell and population level views. This workflow includes three central components that are common across mass cytometry analysis approaches: (1) distinguishing initial populations, (2) revealing cell subsets, and (3) characterizing subset features. In the implementation described here, viSNE, SPADE, and heatmaps were used sequentially to comprehensively characterize and compare healthy and malignant human tissue samples. The use of multiple methods helps provide a comprehensive view of results, and the largely unsupervised workflow facilitates automation and helps researchers avoid missing cell populations with unusual or unexpected phenotypes. Together, these methods develop a framework for future machine learning of cell identity.
Prediction of Incident Diabetes in the Jackson Heart Study Using High-Dimensional Machine Learning
Casanova, Ramon; Saldana, Santiago; Simpson, Sean L.; Lacy, Mary E.; Subauste, Angela R.; Blackshear, Chad; Wagenknecht, Lynne; Bertoni, Alain G.
2016-01-01
Statistical models to predict incident diabetes are often based on limited variables. Here we pursued two main goals: 1) investigate the relative performance of a machine learning method such as Random Forests (RF) for detecting incident diabetes in a high-dimensional setting defined by a large set of observational data, and 2) uncover potential predictors of diabetes. The Jackson Heart Study collected data at baseline and in two follow-up visits from 5,301 African Americans. We excluded those with baseline diabetes and no follow-up, leaving 3,633 individuals for analyses. Over a mean 8-year follow-up, 584 participants developed diabetes. The full RF model evaluated 93 variables including demographic, anthropometric, blood biomarker, medical history, and echocardiogram data. We also used RF metrics of variable importance to rank variables according to their contribution to diabetes prediction. We implemented other models based on logistic regression and RF where features were preselected. The RF full model performance was similar (AUC = 0.82) to those more parsimonious models. The top-ranked variables according to RF included hemoglobin A1C, fasting plasma glucose, waist circumference, adiponectin, c-reactive protein, triglycerides, leptin, left ventricular mass, high-density lipoprotein cholesterol, and aldosterone. This work shows the potential of RF for incident diabetes prediction while dealing with high-dimensional data. PMID:27727289
Incremental isometric embedding of high-dimensional data using connected neighborhood graphs.
Zhao, Dongfang; Yang, Li
2009-01-01
Most nonlinear data embedding methods use bottom-up approaches for capturing the underlying structure of data distributed on a manifold in high dimensional space. These methods often share the first step which defines neighbor points of every data point by building a connected neighborhood graph so that all data points can be embedded to a single coordinate system. These methods are required to work incrementally for dimensionality reduction in many applications. Because input data stream may be under-sampled or skewed from time to time, building connected neighborhood graph is crucial to the success of incremental data embedding using these methods. This paper presents algorithms for updating $k$-edge-connected and $k$-connected neighborhood graphs after a new data point is added or an old data point is deleted. It further utilizes a simple algorithm for updating all-pair shortest distances on the neighborhood graph. Together with incremental classical multidimensional scaling using iterative subspace approximation, this paper devises an incremental version of Isomap with enhancements to deal with under-sampled or unevenly distributed data. Experiments on both synthetic and real-world data sets show that the algorithm is efficient and maintains low dimensional configurations of high dimensional data under various data distributions. PMID:19029548
A novel divide-and-merge classification for high dimensional datasets.
Seo, Minseok; Oh, Sejong
2013-02-01
High dimensional datasets contain up to thousands of features, and can result in immense computational costs for classification tasks. Therefore, these datasets need a feature selection step before the classification process. The main idea behind feature selection is to choose a useful subset of features to significantly improve the comprehensibility of a classifier and maximize the performance of a classification algorithm. In this paper, we propose a one-per-class model for high dimensional datasets. In the proposed method, we extract different feature subsets for each class in a dataset and apply the classification process on the multiple feature subsets. Finally, we merge the prediction results of the feature subsets and determine the final class label of an unknown instance data. The originality of the proposed model is to use appropriate feature subsets for each class. To show the usefulness of the proposed approach, we have developed an application method following the proposed model. From our results, we confirm that our method produces higher classification accuracy than previous novel feature selection and classification methods.
Array-representation Integration Factor Method for High-dimensional Systems
Wang, Dongyong; Zhang, Lei; Nie, Qing
2013-01-01
High order spatial derivatives and stiff reactions often introduce severe temporal stability constraints on the time step in numerical methods. Implicit integration method (IIF) method, which treats diffusion exactly and reaction implicitly, provides excellent stability properties with good efficiency by decoupling the treatment of reactions and diffusions. One major challenge for IIF is storage and calculation of the potential dense exponential matrices of the sparse discretization matrices resulted from the linear differential operators. Motivated by a compact representation for IIF (cIIF) for Laplacian operators in two and three dimensions, we introduce an array-representation technique for efficient handling of exponential matrices from a general linear differential operator that may include cross-derivatives and non-constant diffusion coefficients. In this approach, exponentials are only needed for matrices of small size that depend only on the order of derivatives and number of discretization points, independent of the size of spatial dimensions. This method is particularly advantageous for high dimensional systems, and it can be easily incorporated with IIF to preserve the excellent stability of IIF. Implementation and direct simulations of the array-representation compact IIF (AcIIF) on systems, such as Fokker-Planck equations in three and four dimensions and chemical master equations, in addition to reaction-diffusion equations, show efficiency, accuracy, and robustness of the new method. Such array-presentation based on methods may have broad applications for simulating other complex systems involving high-dimensional data. PMID:24415797
Origins of Stochasticity and Burstiness in High-Dimensional Biochemical Networks
2009-01-01
Two major approaches are known in the field of stochastic dynamics of intracellular biochemical networks. The first one places the focus of attention on the fact that many biochemical constituents vitally important for the network functionality may be present only in small quantities within the cell, and therefore the regulatory process is essentially discrete and prone to relatively big fluctuations. The second approach treats the regulatory process as essentially continuous. Complex pseudostochastic behavior in such processes may occur due to multistability and oscillatory motions within limit cycles. In this paper we outline the third scenario of stochasticity in the regulatory process. This scenario is only conceivable in high-dimensional highly nonlinear systems. In particular, we show that burstiness, a well-known phenomenon in the biology of gene expression, is a natural consequence of high dimensionality coupled with high nonlinearity. In mathematical terms, burstiness is associated with heavy-tailed probability distributions of stochastic processes describing the dynamics of the system. We demonstrate how the "shot" noise originates from purely deterministic behavior of the underlying dynamical system. We conclude that the limiting stochastic process may be accurately approximated by the "heavy-tailed" generalized Pareto process which is a direct mathematical expression of burstiness. PMID:18946549
Methods for discovery and characterization of cell subsets in high dimensional mass cytometry data
Diggins, Kirsten E.; Ferrell, P. Brent; Irish, Jonathan M.
2015-01-01
The flood of high-dimensional data resulting from mass cytometry experiments that measure more than 40 features of individual cells has stimulated creation of new single cell computational biology tools. These tools draw on advances in the field of machine learning to capture multi-parametric relationships and reveal cells that are easily overlooked in traditional analysis. Here, we introduce a workflow for high dimensional mass cytometry data that emphasizes unsupervised approaches and visualizes data in both single cell and population level views. This workflow includes three central components that are common across mass cytometry analysis approaches: 1) distinguishing initial populations, 2) revealing cell subsets, and 3) characterizing subset features. In the implementation described here, viSNE, SPADE, and heatmaps were used sequentially to comprehensively characterize and compare healthy and malignant human tissue samples. The use of multiple methods helps provide a comprehensive view of results, and the largely unsupervised workflow facilitates automation and helps researchers avoid missing cell populations with unusual or unexpected phenotypes. Together, these methods develop a framework for future machine learning of cell identity. PMID:25979346
Does the Cerebral Cortex Exploit High-Dimensional, Non-linear Dynamics for Information Processing?
Singer, Wolf; Lazar, Andreea
2016-01-01
The discovery of stimulus induced synchronization in the visual cortex suggested the possibility that the relations among low-level stimulus features are encoded by the temporal relationship between neuronal discharges. In this framework, temporal coherence is considered a signature of perceptual grouping. This insight triggered a large number of experimental studies which sought to investigate the relationship between temporal coordination and cognitive functions. While some core predictions derived from the initial hypothesis were confirmed, these studies, also revealed a rich dynamical landscape beyond simple coherence whose role in signal processing is still poorly understood. In this paper, a framework is presented which establishes links between the various manifestations of cortical dynamics by assigning specific coding functions to low-dimensional dynamic features such as synchronized oscillations and phase shifts on the one hand and high-dimensional non-linear, non-stationary dynamics on the other. The data serving as basis for this synthetic approach have been obtained with chronic multisite recordings from the visual cortex of anesthetized cats and from monkeys trained to solve cognitive tasks. It is proposed that the low-dimensional dynamics characterized by synchronized oscillations and large-scale correlations are substates that represent the results of computations performed in the high-dimensional state-space provided by recurrently coupled networks. PMID:27713697
Quality metrics in high-dimensional data visualization: an overview and systematization.
Bertini, Enrico; Tatu, Andrada; Keim, Daniel
2011-12-01
In this paper, we present a systematization of techniques that use quality metrics to help in the visual exploration of meaningful patterns in high-dimensional data. In a number of recent papers, different quality metrics are proposed to automate the demanding search through large spaces of alternative visualizations (e.g., alternative projections or ordering), allowing the user to concentrate on the most promising visualizations suggested by the quality metrics. Over the last decade, this approach has witnessed a remarkable development but few reflections exist on how these methods are related to each other and how the approach can be developed further. For this purpose, we provide an overview of approaches that use quality metrics in high-dimensional data visualization and propose a systematization based on a thorough literature review. We carefully analyze the papers and derive a set of factors for discriminating the quality metrics, visualization techniques, and the process itself. The process is described through a reworked version of the well-known information visualization pipeline. We demonstrate the usefulness of our model by applying it to several existing approaches that use quality metrics, and we provide reflections on implications of our model for future research.
Compound Structure-Independent Activity Prediction in High-Dimensional Target Space.
Balfer, Jenny; Hu, Ye; Bajorath, Jürgen
2014-08-01
Profiling of compound libraries against arrays of targets has become an important approach in pharmaceutical research. The prediction of multi-target compound activities also represents an attractive task for machine learning with potential for drug discovery applications. Herein, we have explored activity prediction in high-dimensional target space. Different types of models were derived to predict multi-target activities. The models included naïve Bayesian (NB) and support vector machine (SVM) classifiers based upon compound structure information and NB models derived on the basis of activity profiles, without considering compound structure. Because the latter approach can be applied to incomplete training data and principally depends on the feature independence assumption, SVM modeling was not applicable in this case. Furthermore, iterative hybrid NB models making use of both activity profiles and compound structure information were built. In high-dimensional target space, NB models utilizing activity profile data were found to yield more accurate activity predictions than structure-based NB and SVM models or hybrid models. An in-depth analysis of activity profile-based models revealed the presence of correlation effects across different targets and rationalized prediction accuracy. Taken together, the results indicate that activity profile information can be effectively used to predict the activity of test compounds against novel targets.
Quantum secret sharing based on modulated high-dimensional time-bin entanglement
Takesue, Hiroki; Inoue, Kyo
2006-07-15
We propose a scheme for quantum secret sharing (QSS) that uses a modulated high-dimensional time-bin entanglement. By modulating the relative phase randomly by {l_brace}0,{pi}{r_brace}, a sender with the entanglement source can randomly change the sign of the correlation of the measurement outcomes obtained by two distant recipients. The two recipients must cooperate if they are to obtain the sign of the correlation, which is used as a secret key. We show that our scheme is secure against intercept-and-resend (IR) and beam splitting attacks by an outside eavesdropper thanks to the nonorthogonality of high-dimensional time-bin entangled states. We also show that a cheating attempt based on an IR attack by one of the recipients can be detected by changing the dimension of the time-bin entanglement randomly and inserting two 'vacant' slots between the packets. Then, cheating attempts can be detected by monitoring the count rate in the vacant slots. The proposed scheme has better experimental feasibility than previously proposed entanglement-based QSS schemes.
Steinwand, Daniel R.; Maddox, Brian; Beckmann, Tim; Hamer, George
2003-01-01
Beowulf clusters can provide a cost-effective way to compute numerical models and process large amounts of remote sensing image data. Usually a Beowulf cluster is designed to accomplish a specific set of processing goals, and processing is very efficient when the problem remains inside the constraints of the original design. There are cases, however, when one might wish to compute a problem that is beyond the capacity of the local Beowulf system. In these cases, spreading the problem to multiple clusters or to other machines on the network may provide a cost-effective solution.
An Efficient Initialization Method for K-Means Clustering of Hyperspectral Data
NASA Astrophysics Data System (ADS)
Alizade Naeini, A.; Jamshidzadeh, A.; Saadatseresht, M.; Homayouni, S.
2014-10-01
K-means is definitely the most frequently used partitional clustering algorithm in the remote sensing community. Unfortunately due to its gradient decent nature, this algorithm is highly sensitive to the initial placement of cluster centers. This problem deteriorates for the high-dimensional data such as hyperspectral remotely sensed imagery. To tackle this problem, in this paper, the spectral signatures of the endmembers in the image scene are extracted and used as the initial positions of the cluster centers. For this purpose, in the first step, A Neyman-Pearson detection theory based eigen-thresholding method (i.e., the HFC method) has been employed to estimate the number of endmembers in the image. Afterwards, the spectral signatures of the endmembers are obtained using the Minimum Volume Enclosing Simplex (MVES) algorithm. Eventually, these spectral signatures are used to initialize the k-means clustering algorithm. The proposed method is implemented on a hyperspectral dataset acquired by ROSIS sensor with 103 spectral bands over the Pavia University campus, Italy. For comparative evaluation, two other commonly used initialization methods (i.e., Bradley & Fayyad (BF) and Random methods) are implemented and compared. The confusion matrix, overall accuracy and Kappa coefficient are employed to assess the methods' performance. The evaluations demonstrate that the proposed solution outperforms the other initialization methods and can be applied for unsupervised classification of hyperspectral imagery for landcover mapping.
High-Dimensional Circular Quantum Secret Sharing Using Orbital Angular Momentum
NASA Astrophysics Data System (ADS)
Tang, Dawei; Wang, Tie-jun; Mi, Sichen; Geng, Xiao-Meng; Wang, Chuan
2016-07-01
Quantum secret sharing is to distribute secret message securely between multi-parties. Here exploiting orbital angular momentum (OAM) state of single photons as the information carrier, we propose a high-dimensional circular quantum secret sharing protocol which increases the channel capacity largely. In the proposed protocol, the secret message is split into two parts, and each encoded on the OAM state of single photons. The security of the protocol is guaranteed by the laws of non-cloning theorem. And the secret messages could not be recovered except that the two receivers collaborated with each other. Moreover, the proposed protocol could be extended into high-level quantum systems, and the enhanced security could be achieved.
A two-state hysteresis model from high-dimensional friction.
Biswas, Saurabh; Chatterjee, Anindya
2015-07-01
In prior work (Biswas & Chatterjee 2014 Proc. R. Soc. A 470, 20130817 (doi:10.1098/rspa.2013.0817)), we developed a six-state hysteresis model from a high-dimensional frictional system. Here, we use a more intuitively appealing frictional system that resembles one studied earlier by Iwan. The basis functions now have simple analytical description. The number of states required decreases further, from six to the theoretical minimum of two. The number of fitted parameters is reduced by an order of magnitude, to just six. An explicit and faster numerical solution method is developed. Parameter fitting to match different specified hysteresis loops is demonstrated. In summary, a new two-state model of hysteresis is presented that is ready for practical implementation. Essential Matlab code is provided. PMID:26587279
INFUSE: Interactive Feature Selection for Predictive Modeling of High Dimensional Data.
Krause, Josua; Perer, Adam; Bertini, Enrico
2014-12-01
Predictive modeling techniques are increasingly being used by data scientists to understand the probability of predicted outcomes. However, for data that is high-dimensional, a critical step in predictive modeling is determining which features should be included in the models. Feature selection algorithms are often used to remove non-informative features from models. However, there are many different classes of feature selection algorithms. Deciding which one to use is problematic as the algorithmic output is often not amenable to user interpretation. This limits the ability for users to utilize their domain expertise during the modeling process. To improve on this limitation, we developed INFUSE, a novel visual analytics system designed to help analysts understand how predictive features are being ranked across feature selection algorithms, cross-validation folds, and classifiers. We demonstrate how our system can lead to important insights in a case study involving clinical researchers predicting patient outcomes from electronic medical records.
DecisionFlow: Visual Analytics for High-Dimensional Temporal Event Sequence Data.
Gotz, David; Stavropoulos, Harry
2014-12-01
Temporal event sequence data is increasingly commonplace, with applications ranging from electronic medical records to financial transactions to social media activity. Previously developed techniques have focused on low-dimensional datasets (e.g., with less than 20 distinct event types). Real-world datasets are often far more complex. This paper describes DecisionFlow, a visual analysis technique designed to support the analysis of high-dimensional temporal event sequence data (e.g., thousands of event types). DecisionFlow combines a scalable and dynamic temporal event data structure with interactive multi-view visualizations and ad hoc statistical analytics. We provide a detailed review of our methods, and present the results from a 12-person user study. The study results demonstrate that DecisionFlow enables the quick and accurate completion of a range of sequence analysis tasks for datasets containing thousands of event types and millions of individual events.
A two-state hysteresis model from high-dimensional friction.
Biswas, Saurabh; Chatterjee, Anindya
2015-07-01
In prior work (Biswas & Chatterjee 2014 Proc. R. Soc. A 470, 20130817 (doi:10.1098/rspa.2013.0817)), we developed a six-state hysteresis model from a high-dimensional frictional system. Here, we use a more intuitively appealing frictional system that resembles one studied earlier by Iwan. The basis functions now have simple analytical description. The number of states required decreases further, from six to the theoretical minimum of two. The number of fitted parameters is reduced by an order of magnitude, to just six. An explicit and faster numerical solution method is developed. Parameter fitting to match different specified hysteresis loops is demonstrated. In summary, a new two-state model of hysteresis is presented that is ready for practical implementation. Essential Matlab code is provided.
Detection meeting control: Unstable steady states in high-dimensional nonlinear dynamical systems.
Ma, Huanfei; Ho, Daniel W C; Lai, Ying-Cheng; Lin, Wei
2015-10-01
We articulate an adaptive and reference-free framework based on the principle of random switching to detect and control unstable steady states in high-dimensional nonlinear dynamical systems, without requiring any a priori information about the system or about the target steady state. Starting from an arbitrary initial condition, a proper control signal finds the nearest unstable steady state adaptively and drives the system to it in finite time, regardless of the type of the steady state. We develop a mathematical analysis based on fast-slow manifold separation and Markov chain theory to validate the framework. Numerical demonstration of the control and detection principle using both classic chaotic systems and models of biological and physical significance is provided. PMID:26565299
Dishion, Thomas J; Ha, Thao; Véronneau, Marie-Hélène
2012-05-01
The authors propose that peer relationships should be included in a life history perspective on adolescent problem behavior. Longitudinal analyses were used to examine deviant peer clustering as the mediating link between attenuated family ties, peer marginalization, and social disadvantage in early adolescence and sexual promiscuity in middle adolescence and childbearing by early adulthood. Specifically, 998 youths, along with their families, were assessed at age 11 years and periodically through age 24 years. Structural equation modeling revealed that the peer-enhanced life history model provided a good fit to the longitudinal data, with deviant peer clustering strongly predicting adolescent sexual promiscuity and other correlated problem behaviors. Sexual promiscuity, as expected, also strongly predicted the number of children by ages 22-24 years. Consistent with a life history perspective, family social disadvantage directly predicted deviant peer clustering and number of children in early adulthood, controlling for all other variables in the model. These data suggest that deviant peer clustering is a core dimension of a fast life history strategy, with strong links to sexual activity and childbearing. The implications of these findings are discussed with respect to the need to integrate an evolutionary-based model of self-organized peer groups in developmental and intervention science.
NASA Astrophysics Data System (ADS)
Gastegger, Michael; Kauffmann, Clemens; Behler, Jörg; Marquetand, Philipp
2016-05-01
Many approaches, which have been developed to express the potential energy of large systems, exploit the locality of the atomic interactions. A prominent example is the fragmentation methods in which the quantum chemical calculations are carried out for overlapping small fragments of a given molecule that are then combined in a second step to yield the system's total energy. Here we compare the accuracy of the systematic molecular fragmentation approach with the performance of high-dimensional neural network (HDNN) potentials introduced by Behler and Parrinello. HDNN potentials are similar in spirit to the fragmentation approach in that the total energy is constructed as a sum of environment-dependent atomic energies, which are derived indirectly from electronic structure calculations. As a benchmark set, we use all-trans alkanes containing up to eleven carbon atoms at the coupled cluster level of theory. These molecules have been chosen because they allow to extrapolate reliable reference energies for very long chains, enabling an assessment of the energies obtained by both methods for alkanes including up to 10 000 carbon atoms. We find that both methods predict high-quality energies with the HDNN potentials yielding smaller errors with respect to the coupled cluster reference.
NASA Astrophysics Data System (ADS)
Haussaire, Jean-Matthieu; Bocquet, Marc
2016-04-01
Atmospheric chemistry models are becoming increasingly complex, with multiphasic chemistry, size-resolved particulate matter, and possibly coupled to numerical weather prediction models. In the meantime, data assimilation methods have also become more sophisticated. Hence, it will become increasingly difficult to disentangle the merits of data assimilation schemes, of models, and of their numerical implementation in a successful high-dimensional data assimilation study. That is why we believe that the increasing variety of problems encountered in the field of atmospheric chemistry data assimilation puts forward the need for simple low-order models, albeit complex enough to capture the relevant dynamics, physics and chemistry that could impact the performance of data assimilation schemes. Following this analysis, we developped a low-order coupled chemistry meteorology model named L95-GRS [1]. The advective wind is simulated by the Lorenz-95 model, while the chemistry is made of 6 reactive species and simulates ozone concentrations. With this model, we carried out data assimilation experiments to estimate the state of the system as well as the forcing parameter of the wind and the emissions of chemical compounds. This model proved to be a powerful playground giving insights on the hardships of online and offline estimation of atmospheric pollution. Building on the results on this low-order model, we test advanced data assimilation methods on a state-of-the-art chemical transport model to check if the conclusions obtained with our low-order model still stand. References [1] Haussaire, J.-M. and Bocquet, M.: A low-order coupled chemistry meteorology model for testing online and offline data assimilation schemes, Geosci. Model Dev. Discuss., 8, 7347-7394, doi:10.5194/gmdd-8-7347-2015, 2015.
NASA Astrophysics Data System (ADS)
Tutukov, A. V.; Dremov, V. V.; Dremova, G. N.
2009-10-01
Numerical N-body studies of the dynamical evolution of a cluster of 1000 galaxies were carried out in order to investigate the role of dark matter in the formation of cD galaxies. Two models explicitly describing the darkmatter as a full-fledged component of the cluster having its own physical characteristics are constructed. These treat the dark matter as a continuous underlying substrate and as “grainy” matter. The ratio of the masses of the dark and luminous matter of the cluster is varied in the range 3-100. The observed logarithmic spectrum dN ˜ dM / M is used as an initial mass spectrum for the galaxies. A comparative numerical analysis of the evolution of the mass spectrum, the dynamics of mergers of the cluster galaxies, and the evolution of the growth of the central, supermassive cD galaxy suggests that dynamical friction associated with dark matter accelerates the formation of the cD galaxy via the absorption of galaxies colliding with it. Taking into account a dark-matter “substrate” removes the formation of multiple mass-accumulation centers, and makes it easier to form a cD galaxy that accumulates 1-2% of the cluster mass within the Hubble time scale (3-8 billion years), consistent with observations.
Muetterties, Earl L.
1980-05-01
Metal cluster chemistry is one of the most rapidly developing areas of inorganic and organometallic chemistry. Prior to 1960 only a few metal clusters were well characterized. However, shortly after the early development of boron cluster chemistry, the field of metal cluster chemistry began to grow at a very rapid rate and a structural and a qualitative theoretical understanding of clusters came quickly. Analyzed here is the chemistry and the general significance of clusters with particular emphasis on the cluster research within my group. The importance of coordinately unsaturated, very reactive metal clusters is the major subject of discussion.
Slonim, Noam; Atwal, Gurinder Singh; Tkačik, Gašper; Bialek, William
2005-01-01
In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here, we reformulate the clustering problem from an information theoretic perspective that avoids many of these assumptions. In particular, our formulation obviates the need for defining a cluster “prototype,” does not require an a priori similarity metric, is invariant to changes in the representation of the data, and naturally captures nonlinear relations. We apply this approach to different domains and find that it consistently produces clusters that are more coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering based on collective notions of similarity rather than the traditional pairwise measures. PMID:16352721
ERIC Educational Resources Information Center
Snellings, Patrick; van der Leij, Aryan; Blok, Henk; de Jong, Peter F.
2010-01-01
This study investigated the role of speech perception accuracy and speed in fluent word decoding of reading disabled (RD) children. A same-different phoneme discrimination task with natural speech tested the perception of single consonants and consonant clusters by young but persistent RD children. RD children were slower than chronological age…
McCarthy, John F; Marx, Kenneth A; Hoffman, Patrick E; Gee, Alexander G; O'Neil, Philip; Ujwal, M L; Hotchkiss, John
2004-05-01
Recent technical advances in combinatorial chemistry, genomics, and proteomics have made available large databases of biological and chemical information that have the potential to dramatically improve our understanding of cancer biology at the molecular level. Such an understanding of cancer biology could have a substantial impact on how we detect, diagnose, and manage cancer cases in the clinical setting. One of the biggest challenges facing clinical oncologists is how to extract clinically useful knowledge from the overwhelming amount of raw molecular data that are currently available. In this paper, we discuss how the exploratory data analysis techniques of machine learning and high-dimensional visualization can be applied to extract clinically useful knowledge from a heterogeneous assortment of molecular data. After an introductory overview of machine learning and visualization techniques, we describe two proprietary algorithms (PURS and RadViz) that we have found to be useful in the exploratory analysis of large biological data sets. We next illustrate, by way of three examples, the applicability of these techniques to cancer detection, diagnosis, and management using three very different types of molecular data. We first discuss the use of our exploratory analysis techniques on proteomic mass spectroscopy data for the detection of ovarian cancer. Next, we discuss the diagnostic use of these techniques on gene expression data to differentiate between squamous and adenocarcinoma of the lung. Finally, we illustrate the use of such techniques in selecting from a database of chemical compounds those most effective in managing patients with melanoma versus leukemia.
Hou, Jiayi
2015-01-01
An ordinal scale is commonly used to measure health status and disease related outcomes in hospital settings as well as in translational medical research. In addition, repeated measurements are common in clinical practice for tracking and monitoring the progression of complex diseases. Classical methodology based on statistical inference, in particular, ordinal modeling has contributed to the analysis of data in which the response categories are ordered and the number of covariates (p) remains smaller than the sample size (n). With the emergence of genomic technologies being increasingly applied for more accurate diagnosis and prognosis, high-dimensional data where the number of covariates (p) is much larger than the number of samples (n), are generated. To meet the emerging needs, we introduce our proposed model which is a two-stage algorithm: Extend the Generalized Monotone Incremental Forward Stagewise (GMIFS) method to the cumulative logit ordinal model; and combine the GMIFS procedure with the classical mixed-effects model for classifying disease status in disease progression along with time. We demonstrate the efficiency and accuracy of the proposed models in classification using a time-course microarray dataset collected from the Inflammation and the Host Response to Injury study. PMID:25720102
Spanning high-dimensional expression space using ribosome-binding site combinatorics.
Zelcbuch, Lior; Antonovsky, Niv; Bar-Even, Arren; Levin-Karp, Ayelet; Barenholz, Uri; Dayagi, Michal; Liebermeister, Wolfram; Flamholz, Avi; Noor, Elad; Amram, Shira; Brandis, Alexander; Bareia, Tasneem; Yofe, Ido; Jubran, Halim; Milo, Ron
2013-05-01
Protein levels are a dominant factor shaping natural and synthetic biological systems. Although proper functioning of metabolic pathways relies on precise control of enzyme levels, the experimental ability to balance the levels of many genes in parallel is a major outstanding challenge. Here, we introduce a rapid and modular method to span the expression space of several proteins in parallel. By combinatorially pairing genes with a compact set of ribosome-binding sites, we modulate protein abundance by several orders of magnitude. We demonstrate our strategy by using a synthetic operon containing fluorescent proteins to span a 3D color space. Using the same approach, we modulate a recombinant carotenoid biosynthesis pathway in Escherichia coli to reveal a diversity of phenotypes, each characterized by a distinct carotenoid accumulation profile. In a single combinatorial assembly, we achieve a yield of the industrially valuable compound astaxanthin 4-fold higher than previously reported. The methodology presented here provides an efficient tool for exploring a high-dimensional expression space to locate desirable phenotypes.
Semi-implicit Integration Factor Methods on Sparse Grids for High-Dimensional Systems
Wang, Dongyong; Chen, Weitao; Nie, Qing
2015-01-01
Numerical methods for partial differential equations in high-dimensional spaces are often limited by the curse of dimensionality. Though the sparse grid technique, based on a one-dimensional hierarchical basis through tensor products, is popular for handling challenges such as those associated with spatial discretization, the stability conditions on time step size due to temporal discretization, such as those associated with high-order derivatives in space and stiff reactions, remain. Here, we incorporate the sparse grids with the implicit integration factor method (IIF) that is advantageous in terms of stability conditions for systems containing stiff reactions and diffusions. We combine IIF, in which the reaction is treated implicitly and the diffusion is treated explicitly and exactly, with various sparse grid techniques based on the finite element and finite difference methods and a multi-level combination approach. The overall method is found to be efficient in terms of both storage and computational time for solving a wide range of PDEs in high dimensions. In particular, the IIF with the sparse grid combination technique is flexible and effective in solving systems that may include cross-derivatives and non-constant diffusion coefficients. Extensive numerical simulations in both linear and nonlinear systems in high dimensions, along with applications of diffusive logistic equations and Fokker-Planck equations, demonstrate the accuracy, efficiency, and robustness of the new methods, indicating potential broad applications of the sparse grid-based integration factor method. PMID:25897178
Zhang, Zheng; Yang, Xiu; Oseledets, Ivan; Karniadakis, George E.; Daniel, Luca
2015-01-31
Hierarchical uncertainty quantification can reduce the computational cost of stochastic circuit simulation by employing spectral methods at different levels. This paper presents an efficient framework to simulate hierarchically some challenging stochastic circuits/systems that include high-dimensional subsystems. Due to the high parameter dimensionality, it is challenging to both extract surrogate models at the low level of the design hierarchy and to handle them in the high-level simulation. In this paper, we develop an efficient analysis of variance-based stochastic circuit/microelectromechanical systems simulator to efficiently extract the surrogate models at the low level. In order to avoid the curse of dimensionality, we employ tensor-train decomposition at the high level to construct the basis functions and Gauss quadrature points. As a demonstration, we verify our algorithm on a stochastic oscillator with four MEMS capacitors and 184 random parameters. This challenging example is efficiently simulated by our simulator at the cost of only 10min in MATLAB on a regular personal computer.
Revealing the diversity of extracellular vesicles using high-dimensional flow cytometry analyses
Marcoux, Geneviève; Duchez, Anne-Claire; Cloutier, Nathalie; Provost, Patrick; Nigrovic, Peter A.; Boilard, Eric
2016-01-01
Extracellular vesicles (EV) are small membrane vesicles produced by cells upon activation and apoptosis. EVs are heterogeneous according to their origin, mode of release, membrane composition, organelle and biochemical content, and other factors. Whereas it is apparent that EVs are implicated in intercellular communication, they can also be used as biomarkers. Continuous improvements in pre-analytical parameters and flow cytometry permit more efficient assessment of EVs; however, methods to more objectively distinguish EVs from cells and background, and to interpret multiple single-EV parameters are lacking. We used spanning-tree progression analysis of density-normalized events (SPADE) as a computational approach for the organization of EV subpopulations released by platelets and erythrocytes. SPADE distinguished EVs, and logically organized EVs detected by high-sensitivity flow cytofluorometry based on size estimation, granularity, mitochondrial content, and phosphatidylserine and protein receptor surface expression. Plasma EVs were organized by hierarchy, permitting appreciation of their heterogeneity. Furthermore, SPADE was used to analyze EVs present in the synovial fluid of patients with inflammatory arthritis. Its algorithm efficiently revealed subtypes of arthritic patients based on EV heterogeneity patterns. Our study reveals that computational algorithms are useful for the analysis of high-dimensional single EV data, thereby facilitating comprehension of EV functions and biomarker development. PMID:27786276
Feature Augmentation via Nonparametrics and Selection (FANS) in High-Dimensional Classification
Feng, Yang; Jiang, Jiancheng; Tong, Xin
2015-01-01
We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing. PMID:27185970
Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models.
Fan, Jianqing; Ma, Yunbei; Dai, Wei
2014-01-01
The varying-coefficient model is an important class of nonparametric statistical model that allows us to examine how the effects of covariates vary with exposure variables. When the number of covariates is large, the issue of variable selection arises. In this paper, we propose and investigate marginal nonparametric screening methods to screen variables in sparse ultra-high dimensional varying-coefficient models. The proposed nonparametric independence screening (NIS) selects variables by ranking a measure of the nonparametric marginal contributions of each covariate given the exposure variable. The sure independent screening property is established under some mild technical conditions when the dimensionality is of nonpolynomial order, and the dimensionality reduction of NIS is quantified. To enhance the practical utility and finite sample performance, two data-driven iterative NIS methods are proposed for selecting thresholding parameters and variables: conditional permutation and greedy methods, resulting in Conditional-INIS and Greedy-INIS. The effectiveness and flexibility of the proposed methods are further illustrated by simulation studies and real data applications.
Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models
Fan, Jianqing; Ma, Yunbei; Dai, Wei
2014-01-01
The varying-coefficient model is an important class of nonparametric statistical model that allows us to examine how the effects of covariates vary with exposure variables. When the number of covariates is large, the issue of variable selection arises. In this paper, we propose and investigate marginal nonparametric screening methods to screen variables in sparse ultra-high dimensional varying-coefficient models. The proposed nonparametric independence screening (NIS) selects variables by ranking a measure of the nonparametric marginal contributions of each covariate given the exposure variable. The sure independent screening property is established under some mild technical conditions when the dimensionality is of nonpolynomial order, and the dimensionality reduction of NIS is quantified. To enhance the practical utility and finite sample performance, two data-driven iterative NIS methods are proposed for selecting thresholding parameters and variables: conditional permutation and greedy methods, resulting in Conditional-INIS and Greedy-INIS. The effectiveness and flexibility of the proposed methods are further illustrated by simulation studies and real data applications. PMID:25309009
Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models.
Fan, Jianqing; Feng, Yang; Song, Rui
2011-06-01
A variable screening procedure via correlation learning was proposed in Fan and Lv (2008) to reduce dimensionality in sparse ultra-high dimensional models. Even when the true model is linear, the marginal regression can be highly nonlinear. To address this issue, we further extend the correlation learning to marginal nonparametric learning. Our nonparametric independence screening is called NIS, a specific member of the sure independence screening. Several closely related variable screening procedures are proposed. Under general nonparametric models, it is shown that under some mild technical conditions, the proposed independence screening methods enjoy a sure screening property. The extent to which the dimensionality can be reduced by independence screening is also explicitly quantified. As a methodological extension, a data-driven thresholding and an iterative nonparametric independence screening (INIS) are also proposed to enhance the finite sample performance for fitting sparse additive models. The simulation results and a real data analysis demonstrate that the proposed procedure works well with moderate sample size and large dimension and performs better than competing methods.
Vounou, Maria; Nichols, Thomas E; Montana, Giovanni
2010-11-15
There is growing interest in performing genome-wide searches for associations between genetic variants and brain imaging phenotypes. While much work has focused on single scalar valued summaries of brain phenotype, accounting for the richness of imaging data requires a brain-wide, genome-wide search. In particular, the standard approach based on mass-univariate linear modelling (MULM) does not account for the structured patterns of correlations present in each domain. In this work, we propose sparse reduced rank regression (sRRR), a strategy for multivariate modelling of high-dimensional imaging responses (measurements taken over regions of interest or individual voxels) and genetic covariates (single nucleotide polymorphisms or copy number variations), which enforces sparsity in the regression coefficients. Such sparsity constraints ensure that the model performs simultaneous genotype and phenotype selection. Using simulation procedures that accurately reflect realistic human genetic variation and imaging correlations, we present detailed evaluations of the sRRR method in comparison with the more traditional MULM approach. In all settings considered, sRRR has better power to detect deleterious genetic variants compared to MULM. Important issues concerning model selection and connections to existing latent variable models are also discussed. This work shows that sRRR offers a promising alternative for detecting brain-wide, genome-wide associations.
Viewpoints: A High-Performance High-Dimensional Exploratory Data Analysis Tool
NASA Astrophysics Data System (ADS)
Gazis, P. R.; Levit, C.; Way, M. J.
2010-12-01
Scientific data sets continue to increase in both size and complexity. In the past, dedicated graphics systems at supercomputing centers were required to visualize large data sets, but as the price of commodity graphics hardware has dropped and its capability has increased, it is now possible, in principle, to view large complex data sets on a single workstation. To do this in practice, an investigator will need software that is written to take advantage of the relevant graphics hardware. The Viewpoints visualization package described herein is an example of such software. Viewpoints is an interactive tool for exploratory visual analysis of large high-dimensional (multivariate) data. It leverages the capabilities of modern graphics boards (GPUs) to run on a single workstation or laptop. Viewpoints is minimalist: it attempts to do a small set of useful things very well (or at least very quickly) in comparison with similar packages today. Its basic feature set includes linked scatter plots with brushing, dynamic histograms, normalization, and outlier detection/removal. Viewpoints was originally designed for astrophysicists, but it has since been used in a variety of fields that range from astronomy, quantum chemistry, fluid dynamics, machine learning, bioinformatics, and finance to information technology server log mining. In this article, we describe the Viewpoints package and show examples of its usage.
A common, high-dimensional model of the representational space in human ventral temporal cortex
Haxby, James V.; Guntupalli, J. Swaroop; Connolly, Andrew C.; Halchenko, Yaroslav O.; Conroy, Bryan R.; Gobbini, M. Ida; Hanke, Michael; Ramadge, Peter J.
2011-01-01
Summary We present a high-dimensional model of the representational space in human ventral temporal (VT) cortex in which dimensions are response-tuning functions that are common across individuals and patterns of response are modeled as weighted sums of basis patterns associated with these response-tunings. We map response pattern vectors, measured with fMRI, from individual subjects’ voxel spaces into this common model space using a new method, ‘hyperalignment’. Hyperalignment parameters based on responses during one experiment – movie-viewing – identified 35 common response-tuning functions that captured fine-grained distinctions among a wide range of stimuli in the movie and in two category perception experiments. Between-subject classification (BSC, multivariate pattern classification based on other subjects’ data) of response pattern vectors in common model space greatly exceeded BSC of anatomically-aligned responses and matched within-subject classification. Results indicate that population codes for complex visual stimuli in VT cortex are based on response-tuning functions that are common across individuals. PMID:22017997
A sparse grid based method for generative dimensionality reduction of high-dimensional data
NASA Astrophysics Data System (ADS)
Bohn, Bastian; Garcke, Jochen; Griebel, Michael
2016-03-01
Generative dimensionality reduction methods play an important role in machine learning applications because they construct an explicit mapping from a low-dimensional space to the high-dimensional data space. We discuss a general framework to describe generative dimensionality reduction methods, where the main focus lies on a regularized principal manifold learning variant. Since most generative dimensionality reduction algorithms exploit the representer theorem for reproducing kernel Hilbert spaces, their computational costs grow at least quadratically in the number n of data. Instead, we introduce a grid-based discretization approach which automatically scales just linearly in n. To circumvent the curse of dimensionality of full tensor product grids, we use the concept of sparse grids. Furthermore, in real-world applications, some embedding directions are usually more important than others and it is reasonable to refine the underlying discretization space only in these directions. To this end, we employ a dimension-adaptive algorithm which is based on the ANOVA (analysis of variance) decomposition of a function. In particular, the reconstruction error is used to measure the quality of an embedding. As an application, the study of large simulation data from an engineering application in the automotive industry (car crash simulation) is performed.
Snyder, Abigail C.; Jiao, Yu
2010-10-01
Neutron experiments at the Spallation Neutron Source (SNS) at Oak Ridge National Laboratory (ORNL) frequently generate large amounts of data (on the order of 106-1012 data points). Hence, traditional data analysis tools run on a single CPU take too long to be practical and scientists are unable to efficiently analyze all data generated by experiments. Our goal is to develop a scalable algorithm to efficiently compute high-dimensional integrals of arbitrary functions. This algorithm can then be used to integrate the four-dimensional integrals that arise as part of modeling intensity from the experiments at the SNS. Here, three different one-dimensional numerical integration solvers from the GNU Scientific Library were modified and implemented to solve four-dimensional integrals. The results of these solvers on a final integrand provided by scientists at the SNS can be compared to the results of other methods, such as quasi-Monte Carlo methods, computing the same integral. A parallelized version of the most efficient method can allow scientists the opportunity to more effectively analyze all experimental data.
Spanning high-dimensional expression space using ribosome-binding site combinatorics.
Zelcbuch, Lior; Antonovsky, Niv; Bar-Even, Arren; Levin-Karp, Ayelet; Barenholz, Uri; Dayagi, Michal; Liebermeister, Wolfram; Flamholz, Avi; Noor, Elad; Amram, Shira; Brandis, Alexander; Bareia, Tasneem; Yofe, Ido; Jubran, Halim; Milo, Ron
2013-05-01
Protein levels are a dominant factor shaping natural and synthetic biological systems. Although proper functioning of metabolic pathways relies on precise control of enzyme levels, the experimental ability to balance the levels of many genes in parallel is a major outstanding challenge. Here, we introduce a rapid and modular method to span the expression space of several proteins in parallel. By combinatorially pairing genes with a compact set of ribosome-binding sites, we modulate protein abundance by several orders of magnitude. We demonstrate our strategy by using a synthetic operon containing fluorescent proteins to span a 3D color space. Using the same approach, we modulate a recombinant carotenoid biosynthesis pathway in Escherichia coli to reveal a diversity of phenotypes, each characterized by a distinct carotenoid accumulation profile. In a single combinatorial assembly, we achieve a yield of the industrially valuable compound astaxanthin 4-fold higher than previously reported. The methodology presented here provides an efficient tool for exploring a high-dimensional expression space to locate desirable phenotypes. PMID:23470993
Dan Maljovec; Bei Wang; Valerio Pascucci; Peer-Timo Bremer; Michael Pernice; Robert Nourgaliev
2013-05-01
The next generation of methodologies for nuclear reactor Probabilistic Risk Assessment (PRA) explicitly accounts for the time element in modeling the probabilistic system evolution and uses numerical simulation tools to account for possible dependencies between failure events. The Monte-Carlo (MC) and the Dynamic Event Tree (DET) approaches belong to this new class of dynamic PRA methodologies. A challenge of dynamic PRA algorithms is the large amount of data they produce which may be difficult to visualize and analyze in order to extract useful information. We present a software tool that is designed to address these goals. We model a large-scale nuclear simulation dataset as a high-dimensional scalar function defined over a discrete sample of the domain. First, we provide structural analysis of such a function at multiple scales and provide insight into the relationship between the input parameters and the output. Second, we enable exploratory analysis for users, where we help the users to differentiate features from noise through multi-scale analysis on an interactive platform, based on domain knowledge and data characterization. Our analysis is performed by exploiting the topological and geometric properties of the domain, building statistical models based on its topological segmentations and providing interactive visual interfaces to facilitate such explorations. We provide a user’s guide to our software tool by highlighting its analysis and visualization capabilities, along with a use case involving dataset from a nuclear reactor safety simulation.
Quantum tomography of near-unitary processes in high-dimensional quantum systems
NASA Astrophysics Data System (ADS)
Lysne, Nathan; Sosa Martinez, Hector; Jessen, Poul; Baldwin, Charles; Kalev, Amir; Deutsch, Ivan
2016-05-01
Quantum Tomography (QT) is often considered the ideal tool for experimental debugging of quantum devices, capable of delivering complete information about quantum states (QST) or processes (QPT). In practice, the protocols used for QT are resource intensive and scale poorly with system size. In this situation, a well behaved model system with access to large state spaces (qudits) can serve as a useful platform for examining the tradeoffs between resource cost and accuracy inherent in QT. In past years we have developed one such experimental testbed, consisting of the electron-nuclear spins in the electronic ground state of individual Cs atoms. Our available toolkit includes high fidelity state preparation, complete unitary control, arbitrary orthogonal measurements, and accurate and efficient QST in Hilbert space dimensions up to d = 16. Using these tools, we have recently completed a comprehensive study of QPT in 4, 7 and 16 dimensions. Our results show that QPT of near-unitary processes is quite feasible if one chooses optimal input states and efficient QST on the outputs. We further show that for unitary processes in high dimensional spaces, one can use informationally incomplete QPT to achieve high-fidelity process reconstruction (90% in d = 16) with greatly reduced resource requirements.
Improving the text classification using clustering and a novel HMM to reduce the dimensionality.
Seara Vieira, A; Borrajo, L; Iglesias, E L
2016-11-01
In text classification problems, the representation of a document has a strong impact on the performance of learning systems. The high dimensionality of the classical structured representations can lead to burdensome computations due to the great size of real-world data. Consequently, there is a need for reducing the quantity of handled information to improve the classification process. In this paper, we propose a method to reduce the dimensionality of a classical text representation based on a clustering technique to group documents, and a previously developed Hidden Markov Model to represent them. We have applied tests with the k-NN and SVM classifiers on the OHSUMED and TREC benchmark text corpora using the proposed dimensionality reduction technique. The experimental results obtained are very satisfactory compared to commonly used techniques like InfoGain and the statistical tests performed demonstrate the suitability of the proposed technique for the preprocessing step in a text classification task. PMID:27686709
Improving the text classification using clustering and a novel HMM to reduce the dimensionality.
Seara Vieira, A; Borrajo, L; Iglesias, E L
2016-11-01
In text classification problems, the representation of a document has a strong impact on the performance of learning systems. The high dimensionality of the classical structured representations can lead to burdensome computations due to the great size of real-world data. Consequently, there is a need for reducing the quantity of handled information to improve the classification process. In this paper, we propose a method to reduce the dimensionality of a classical text representation based on a clustering technique to group documents, and a previously developed Hidden Markov Model to represent them. We have applied tests with the k-NN and SVM classifiers on the OHSUMED and TREC benchmark text corpora using the proposed dimensionality reduction technique. The experimental results obtained are very satisfactory compared to commonly used techniques like InfoGain and the statistical tests performed demonstrate the suitability of the proposed technique for the preprocessing step in a text classification task.
Altermann, Susanne; Leavitt, Steven D; Goward, Trevor; Nelsen, Matthew P; Lumbsch, H Thorsten
2014-01-01
The inclusion of molecular data is increasingly an integral part of studies assessing species boundaries. Analyses based on predefined groups may obscure patterns of differentiation, and population assignment tests provide an alternative for identifying population structure and barriers to gene flow. In this study, we apply population assignment tests implemented in the programs STRUCTURE and BAPS to single nucleotide polymorphisms from DNA sequence data generated for three previous studies of the lichenized fungal genus Letharia. Previous molecular work employing a gene genealogical approach circumscribed six species-level lineages within the genus, four putative lineages within the nominal taxon L. columbiana (Nutt.) J.W. Thomson and two sorediate lineages. We show that Bayesian clustering implemented in the program STRUCTURE was generally able to recover the same six putative Letharia lineages. Population assignments were largely consistent across a range of scenarios, including: extensive amounts of missing data, the exclusion of SNPs from variable markers, and inferences based on SNPs from as few as three gene regions. While our study provided additional evidence corroborating the six candidate Letharia species, the equivalence of these genetic clusters with species-level lineages is uncertain due, in part, to limited phylogenetic signal. Furthermore, both the BAPS analysis and the ad hoc ΔK statistic from results of the STRUCTURE analysis suggest that population structure can possibly be captured with fewer genetic groups. Our findings also suggest that uneven sampling across taxa may be responsible for the contrasting inferences of population substructure. Our results consistently supported two distinct sorediate groups, 'L. lupina' and L. vulpina, and subtle morphological differences support this distinction. Similarly, the putative apotheciate species 'L. lucida' was also consistently supported as a distinct genetic cluster. However, additional studies
Altermann, Susanne; Leavitt, Steven D.; Goward, Trevor; Nelsen, Matthew P.; Lumbsch, H. Thorsten
2014-01-01
The inclusion of molecular data is increasingly an integral part of studies assessing species boundaries. Analyses based on predefined groups may obscure patterns of differentiation, and population assignment tests provide an alternative for identifying population structure and barriers to gene flow. In this study, we apply population assignment tests implemented in the programs STRUCTURE and BAPS to single nucleotide polymorphisms from DNA sequence data generated for three previous studies of the lichenized fungal genus Letharia. Previous molecular work employing a gene genealogical approach circumscribed six species-level lineages within the genus, four putative lineages within the nominal taxon L. columbiana (Nutt.) J.W. Thomson and two sorediate lineages. We show that Bayesian clustering implemented in the program STRUCTURE was generally able to recover the same six putative Letharia lineages. Population assignments were largely consistent across a range of scenarios, including: extensive amounts of missing data, the exclusion of SNPs from variable markers, and inferences based on SNPs from as few as three gene regions. While our study provided additional evidence corroborating the six candidate Letharia species, the equivalence of these genetic clusters with species-level lineages is uncertain due, in part, to limited phylogenetic signal. Furthermore, both the BAPS analysis and the ad hoc ΔK statistic from results of the STRUCTURE analysis suggest that population structure can possibly be captured with fewer genetic groups. Our findings also suggest that uneven sampling across taxa may be responsible for the contrasting inferences of population substructure. Our results consistently supported two distinct sorediate groups, ‘L. lupina’ and L. vulpina, and subtle morphological differences support this distinction. Similarly, the putative apotheciate species ‘L. lucida’ was also consistently supported as a distinct genetic cluster. However, additional
Giorla, Jean; Masson, Annie; Poggi, Francoise; Quach, Robert; Seytor, Patricia; Garnier, Josselin
2009-03-15
Inertial confinement fusion targets must be carefully designed to ignite their central hot spots and burn. Changes in the optimal implosion could reduce the fusion energy or even prevent ignition. Since there are unavoidable uncertainties due to technological defects and not perfect reproducibility from shot to shot, the fusion energy will remain uncertain. The degree with which a target can tolerate larger specifications than specified, and the probability with which a particular yield is exceeded, are possible measures of the robustness of that design. This robustness must be assessed in a very high-dimensional parameter space whose variables include every characteristics of the given target and of the associated laser pulse shape, using high-fidelity simulations. Therefore, these studies would remain computationally very intensive. In this paper we propose an approach which consist first of constructing an accurate metamodel of the yield on the whole parameter space with a reasonable data set of simulations. Then the robustness is very quickly assessed for any set of specifications with this surrogate. The yield is approximated by a neural network, and an iterative method adds new points in the data set by means of D-optimal experimental designs. The robustness study of the baseline Laser Megajoule target against one-dimensional defects illustrates this approach. A set of 2000 simulations is sufficient to metamodel the fusion energy on a large 22-dimensional parameter space around the nominal point. Furthermore, a metamodel of the robustness margin against all specifications has been obtained, providing guidance for target fabrication research and development.
Zhao, Lue Ping; Bolouri, Hamid
2016-04-01
Maturing omics technologies enable researchers to generate high dimension omics data (HDOD) routinely in translational clinical studies. In the field of oncology, The Cancer Genome Atlas (TCGA) provided funding support to researchers to generate different types of omics data on a common set of biospecimens with accompanying clinical data and has made the data available for the research community to mine. One important application, and the focus of this manuscript, is to build predictive models for prognostic outcomes based on HDOD. To complement prevailing regression-based approaches, we propose to use an object-oriented regression (OOR) methodology to identify exemplars specified by HDOD patterns and to assess their associations with prognostic outcome. Through computing patient's similarities to these exemplars, the OOR-based predictive model produces a risk estimate using a patient's HDOD. The primary advantages of OOR are twofold: reducing the penalty of high dimensionality and retaining the interpretability to clinical practitioners. To illustrate its utility, we apply OOR to gene expression data from non-small cell lung cancer patients in TCGA and build a predictive model for prognostic survivorship among stage I patients, i.e., we stratify these patients by their prognostic survival risks beyond histological classifications. Identification of these high-risk patients helps oncologists to develop effective treatment protocols and post-treatment disease management plans. Using the TCGA data, the total sample is divided into training and validation data sets. After building up a predictive model in the training set, we compute risk scores from the predictive model, and validate associations of risk scores with prognostic outcome in the validation data (P-value=0.015). PMID:26972839
Landfors, Mattias; Philip, Philge; Rydén, Patrik; Stenberg, Per
2011-01-01
Genome-wide analysis of gene expression or protein binding patterns using different array or sequencing based technologies is now routinely performed to compare different populations, such as treatment and reference groups. It is often necessary to normalize the data obtained to remove technical variation introduced in the course of conducting experimental work, but standard normalization techniques are not capable of eliminating technical bias in cases where the distribution of the truly altered variables is skewed, i.e. when a large fraction of the variables are either positively or negatively affected by the treatment. However, several experiments are likely to generate such skewed distributions, including ChIP-chip experiments for the study of chromatin, gene expression experiments for the study of apoptosis, and SNP-studies of copy number variation in normal and tumour tissues. A preliminary study using spike-in array data established that the capacity of an experiment to identify altered variables and generate unbiased estimates of the fold change decreases as the fraction of altered variables and the skewness increases. We propose the following work-flow for analyzing high-dimensional experiments with regions of altered variables: (1) Pre-process raw data using one of the standard normalization techniques. (2) Investigate if the distribution of the altered variables is skewed. (3) If the distribution is not believed to be skewed, no additional normalization is needed. Otherwise, re-normalize the data using a novel HMM-assisted normalization procedure. (4) Perform downstream analysis. Here, ChIP-chip data and simulated data were used to evaluate the performance of the work-flow. It was found that skewed distributions can be detected by using the novel DSE-test (Detection of Skewed Experiments). Furthermore, applying the HMM-assisted normalization to experiments where the distribution of the truly altered variables is skewed results in considerably higher
A novel multi-manifold classification model via path-based clustering for image retrieval
NASA Astrophysics Data System (ADS)
Zhu, Rong; Yuan, Zhijun; Xuan, Junying
2011-12-01
Nowadays, with digital cameras and mass storage devices becoming increasingly affordable, each day thousands of pictures are taken and images on the Internet are emerged at an astonishing rate. Image retrieval is a process of searching valuable information that user demanded from huge images. However, it is hard to find satisfied results due to the well known "semantic gap". Image classification plays an essential role in retrieval process. But traditional methods will encounter problems when dealing with high-dimensional and large-scale image sets in applications. Here, we propose a novel multi-manifold classification model for image retrieval. Firstly, we simplify the classification of images from high-dimensional space into the one on low-dimensional manifolds, largely reducing the complexity of classification process. Secondly, considering that traditional distance measures often fail to find correct visual semantics of manifolds, especially when dealing with the images having complex data distribution, we also define two new distance measures based on path-based clustering, and further applied to the construction of a multi-class image manifold. One experiment was conducted on 2890 Web images. The comparison results between three methods show that the proposed method achieves the highest classification accuracy.
Peeling the onion of order and chaos in a high-dimensional Hamiltonian system
NASA Astrophysics Data System (ADS)
Kaneko, Kunihiko; Konishi, Tetsuro
1994-02-01
Coexistence of various ordered chaotic states in a Hamiltonian system is studied with the use of a symplectic coupled map lattice. Besides the clustered states for the attractive interaction, a novel chaotic ordered state is found for a system with repulsive interaction, characterized by a dispersed state of particles. The dispersed and clustered states form an onion-like structure in phase space. The degree of order increases towards the center of the onion, while chaos is enhanced at the edge between ordered and random chaotic states. For a longer time scale, orbits itinerate over ordered and random states. The existence of these ordered states leads to anomalous long-time correlation for many quantifiers such as the global diffusion.
Bayesian Decision Theoretical Framework for Clustering
ERIC Educational Resources Information Center
Chen, Mo
2011-01-01
In this thesis, we establish a novel probabilistic framework for the data clustering problem from the perspective of Bayesian decision theory. The Bayesian decision theory view justifies the important questions: what is a cluster and what a clustering algorithm should optimize. We prove that the spectral clustering (to be specific, the…
Histamine headache; Headache - histamine; Migrainous neuralgia; Headache - cluster; Horton's headache; Vascular headache - cluster ... be related to the body's sudden release of histamine (chemical in the body released during an allergic ...
Management of cluster headache.
Tfelt-Hansen, Peer C; Jensen, Rigmor H
2012-07-01
. For most cluster headache patients there are fairly good treatment options both for acute attacks and for prophylaxis. The big problem is the diagnosis of cluster headache as demonstrated by the diagnostic delay of 7 years. However, the relatively short-lasting attack of pain in one eye with typical associated symptoms should lead the family doctor to suspect cluster headache resulting in a referral to a neurologist or a headache centre with experience in the treatment of cluster headache. PMID:22650381
Sanfilippo, Antonio P.; Calapristi, Augustin J.; Crow, Vernon L.; Hetzler, Elizabeth G.; Turner, Alan E.
2004-05-26
We present an approach to the disambiguation of cluster labels that capitalizes on the notion of semantic similarity to assign WordNet senses to cluster labels. The approach provides interesting insights on how document clustering can provide the basis for developing a novel approach to word sense disambiguation.
NASA Astrophysics Data System (ADS)
Katgert, P.; Murdin, P.
2000-11-01
Abell clusters are the most conspicuous groupings of galaxies identified by George Abell on the plates of the first photographic survey made with the SCHMIDT TELESCOPE at Mount Palomar in the 1950s. Sometimes, the term Abell clusters is used as a synonym of nearby, optically selected galaxy clusters....
Matlab Cluster Ensemble Toolbox
2009-04-27
This is a Matlab toolbox for investigating the application of cluster ensembles to data classification, with the objective of improving the accuracy and/or speed of clustering. The toolbox divides the cluster ensemble problem into four areas, providing functionality for each. These include, (1) synthetic data generation, (2) clustering to generate individual data partitions and similarity matrices, (3) consensus function generation and final clustering to generate ensemble data partitioning, and (4) implementation of accuracy metrics. Withmore » regard to data generation, Gaussian data of arbitrary dimension can be generated. The kcenters algorithm can then be used to generate individual data partitions by either, (a) subsampling the data and clustering each subsample, or by (b) randomly initializing the algorithm and generating a clustering for each initialization. In either case an overall similarity matrix can be computed using a consensus function operating on the individual similarity matrices. A final clustering can be performed and performance metrics are provided for evaluation purposes.« less
Cairoli, Andrea; Piovani, Duccio; Jensen, Henrik Jeldtoft
2014-12-31
We propose a new procedure to monitor and forecast the onset of transitions in high-dimensional complex systems. We describe our procedure by an application to the tangled nature model of evolutionary ecology. The quasistable configurations of the full stochastic dynamics are taken as input for a stability analysis by means of the deterministic mean-field equations. Numerical analysis of the high-dimensional stability matrix allows us to identify unstable directions associated with eigenvalues with a positive real part. The overlap of the instantaneous configuration vector of the full stochastic system with the eigenvectors of the unstable directions of the deterministic mean-field approximation is found to be a good early warning of the transitions occurring intermittently.
Adams, Dean C
2014-09-01
Phylogenetic signal is the tendency for closely related species to display similar trait values due to their common ancestry. Several methods have been developed for quantifying phylogenetic signal in univariate traits and for sets of traits treated simultaneously, and the statistical properties of these approaches have been extensively studied. However, methods for assessing phylogenetic signal in high-dimensional multivariate traits like shape are less well developed, and their statistical performance is not well characterized. In this article, I describe a generalization of the K statistic of Blomberg et al. that is useful for quantifying and evaluating phylogenetic signal in highly dimensional multivariate data. The method (K(mult)) is found from the equivalency between statistical methods based on covariance matrices and those based on distance matrices. Using computer simulations based on Brownian motion, I demonstrate that the expected value of K(mult) remains at 1.0 as trait variation among species is increased or decreased, and as the number of trait dimensions is increased. By contrast, estimates of phylogenetic signal found with a squared-change parsimony procedure for multivariate data change with increasing trait variation among species and with increasing numbers of trait dimensions, confounding biological interpretations. I also evaluate the statistical performance of hypothesis testing procedures based on K(mult) and find that the method displays appropriate Type I error and high statistical power for detecting phylogenetic signal in high-dimensional data. Statistical properties of K(mult) were consistent for simulations using bifurcating and random phylogenies, for simulations using different numbers of species, for simulations that varied the number of trait dimensions, and for different underlying models of trait covariance structure. Overall these findings demonstrate that K(mult) provides a useful means of evaluating phylogenetic signal in high-dimensional
NASA Astrophysics Data System (ADS)
Zhan, You-Bang; Zhang, Qun-Yong; Wang, Yu-Wu; Ma, Peng-Cheng
2010-01-01
We propose a scheme to teleport an unknown single-qubit state by using a high-dimensional entangled state as the quantum channel. As a special case, a scheme for teleportation of an unknown single-qubit state via three-dimensional entangled state is investigated in detail. Also, this scheme can be directly generalized to an unknown f-dimensional state by using a d-dimensional entangled state (d > f) as the quantum channel.
Cool Cluster Correctly Correlated
Varganov, Sergey Aleksandrovich
2005-01-01
tens of atoms. Therefore, they are quantum objects. Some qualitative information about the geometries of such clusters can be obtained with classical empirical methods, for example geometry optimization using an empirical Lennard-Jones potential. However, to predict their accurate geometries and other physical and chemical properties it is necessary to solve a Schroedinger equation. If one is not interested in dynamics of clusters it is enough to solve the stationary (time-independent) Schroedinger equation (HΦ=EΦ). This equation represents a multidimensional eigenvalue problem. The solution of the Schroedinger equation is a set of eigenvectors (wave functions) and their eigenvalues (energies). The lowest energy solution (wave function) corresponds to the ground state of the cluster. The other solutions correspond to excited states. The wave function gives all information about the quantum state of the cluster and can be used to calculate different physical and chemical properties, such as photoelectron, X-ray, NMR, EPR spectra, dipole moment, polarizability etc. The dimensionality of the Schroedinger equation is determined by the number of particles (nuclei and electrons) in the cluster. The analytic solution is only known for a two particle problem. In order to solve the equation for clusters of interest it is necessary to make a number of approximations and use numerical methods.
Schuster, Tibor; Pang, Menglan; Platt, Robert W
2016-01-01
PURPOSE The high-dimensional propensity score algorithm attempts to improve control of confounding in typical treatment effect studies in pharmacoepidemiology and is increasingly being used for the analysis of large administrative databases. Within this multi-step variable selection algorithm, the marginal prevalence of non-zero covariate values is considered to be an indicator for a count variable's potential confounding impact. We investigate the role of the marginal prevalence of confounder variables on potentially caused bias magnitudes when estimating risk ratios in point exposure studies with binary outcomes. METHODS We apply the law of total probability in conjunction with an established bias formula to derive and illustrate relative bias boundaries with respect to marginal confounder prevalence. RESULTS We show that maximum possible bias magnitudes can occur at any marginal prevalence level of a binary confounder variable. In particular, we demonstrate that, in case of rare or very common exposures, low and high prevalent confounder variables can still have large confounding impact on estimated risk ratios. CONCLUSIONS Covariate pre-selection by prevalence may lead to sub-optimal confounder sampling within the high-dimensional propensity score algorithm. While we believe that the high-dimensional propensity score has important benefits in large-scale pharmacoepidemiologic studies, we recommend omitting the prevalence-based empirical identification of candidate covariates. PMID:25866189
Simon, Richard M; Subramanian, Jyothi; Li, Ming-Chung; Menezes, Supriya
2011-05-01
Developments in whole genome biotechnology have stimulated statistical focus on prediction methods. We review here methodology for classifying patients into survival risk groups and for using cross-validation to evaluate such classifications. Measures of discrimination for survival risk models include separation of survival curves, time-dependent ROC curves and Harrell's concordance index. For high-dimensional data applications, however, computing these measures as re-substitution statistics on the same data used for model development results in highly biased estimates. Most developments in methodology for survival risk modeling with high-dimensional data have utilized separate test data sets for model evaluation. Cross-validation has sometimes been used for optimization of tuning parameters. In many applications, however, the data available are too limited for effective division into training and test sets and consequently authors have often either reported re-substitution statistics or analyzed their data using binary classification methods in order to utilize familiar cross-validation. In this article we have tried to indicate how to utilize cross-validation for the evaluation of survival risk models; specifically how to compute cross-validated estimates of survival distributions for predicted risk groups and how to compute cross-validated time-dependent ROC curves. We have also discussed evaluation of the statistical significance of a survival risk model and evaluation of whether high-dimensional genomic data adds predictive accuracy to a model based on standard covariates alone.
Bhadra, Anindya; Mallick, Bani K
2013-06-01
We describe a Bayesian technique to (a) perform a sparse joint selection of significant predictor variables and significant inverse covariance matrix elements of the response variables in a high-dimensional linear Gaussian sparse seemingly unrelated regression (SSUR) setting and (b) perform an association analysis between the high-dimensional sets of predictors and responses in such a setting. To search the high-dimensional model space, where both the number of predictors and the number of possibly correlated responses can be larger than the sample size, we demonstrate that a marginalization-based collapsed Gibbs sampler, in combination with spike and slab type of priors, offers a computationally feasible and efficient solution. As an example, we apply our method to an expression quantitative trait loci (eQTL) analysis on publicly available single nucleotide polymorphism (SNP) and gene expression data for humans where the primary interest lies in finding the significant associations between the sets of SNPs and possibly correlated genetic transcripts. Our method also allows for inference on the sparse interaction network of the transcripts (response variables) after accounting for the effect of the SNPs (predictor variables). We exploit properties of Gaussian graphical models to make statements concerning conditional independence of the responses. Our method compares favorably to existing Bayesian approaches developed for this purpose. PMID:23607608
Semi-Supervised Kernel Mean Shift Clustering.
Anand, Saket; Mittal, Sushil; Tuzel, Oncel; Meer, Peter
2014-06-01
Mean shift clustering is a powerful nonparametric technique that does not require prior knowledge of the number of clusters and does not constrain the shape of the clusters. However, being completely unsupervised, its performance suffers when the original distance metric fails to capture the underlying cluster structure. Despite recent advances in semi-supervised clustering methods, there has been little effort towards incorporating supervision into mean shift. We propose a semi-supervised framework for kernel mean shift clustering (SKMS) that uses only pairwise constraints to guide the clustering procedure. The points are first mapped to a high-dimensional kernel space where the constraints are imposed by a linear transformation of the mapped points. This is achieved by modifying the initial kernel matrix by minimizing a log det divergence-based objective function. We show the advantages of SKMS by evaluating its performance on various synthetic and real datasets while comparing with state-of-the-art semi-supervised clustering algorithms. PMID:26353281
ERIC Educational Resources Information Center
Ackerman, Brian P.; Schoff, Kristen; Levinson, Karen; Youngstrom, Eric; Izard, Carroll E.
1999-01-01
Examined relations between alternative representations of poverty cofactors and promotion processes, and problem behaviors of 6- and 7-year-olds from disadvantaged families. Found that single-index risk representations and promotion variables predicted aggression but not anxiety/depression. An additive model of individual risk indicators performed…
ERIC Educational Resources Information Center
Hou, Huei-Tse
2011-01-01
In some higher education courses that focus on case studies, teachers can provide situated scenarios (such as business bottlenecks and medical cases) and problem-solving discussion tasks for students to promote their cognitive skills. There is limited research on the content, performance, and behavioral patterns of teaching using online…
Computational analysis of high-dimensional flow cytometric data for diagnosis and discovery.
Aghaeepour, Nima; Brinkman, Ryan
2014-01-01
Recent technological advancements have enabled the flow cytometric measurement of tens of parameters on millions of cells. Conventional manual data analysis and bioinformatics tools cannot provide a complete analysis of these datasets due to this complexity. In this chapter we will provide an overview of a general data analysis pipeline both for automatic identification of cell populations of known importance (e.g., diagnosis by identification of predefined cell population) and for exploratory analysis of cohorts of flow cytometry assays (e.g., discovery of new correlates of a malignancy). We provide three real-world examples of how unsupervised discovery has been used in basic and clinical research. We also discuss challenges for evaluation of the algorithms developed for (1) identification of cell populations using clustering, (2) identification of specific cell populations, and (3) supervised analysis for discriminating between patient subgroups.
NASA Technical Reports Server (NTRS)
Stothers, Richard B.; Chin, Chao-Wen
1992-01-01
New theoretical evolutionary sequences of models for stars with low metallicities, appropriate to the Small Magellanic Cloud, are derived with both standard Cox-Stewart opacities and the new Rogers-Iglesias opacities. Only those sequences with little or no convective core overshooting are found to be capable of reproducing the two most critical observations: the maximum effective temperature displayed by the hot evolved stars and the difference between the average bolometric magnitudes of the hot and cool evolved stars. An upper limit to the ratio of the mean overshoot distance beyond the classical Schwarzschild core boundary to the local pressure scale height is set at 0.2. It is inferred from the frequency of cool supergiants in NGC 330 that the Ledoux criterion, rather than the Schwarzschild criterion, for convection and semiconvection in the envelopes of massive stars is strongly favored. Residuals from the fitting for NGC 330 suggest the possibility of fast interior rotation in the stars of this cluster. NGC 330 and NGC 458 have ages of about 3 x 10 exp 7 and about 1 x 10 exp 8 yr, respectively.
Gene expression data clustering using a multiobjective symmetry based clustering technique.
Saha, Sriparna; Ekbal, Asif; Gupta, Kshitija; Bandyopadhyay, Sanghamitra
2013-11-01
The invention of microarrays has rapidly changed the state of biological and biomedical research. Clustering algorithms play an important role in clustering microarray data sets where identifying groups of co-expressed genes are a very difficult task. Here we have posed the problem of clustering the microarray data as a multiobjective clustering problem. A new symmetry based fuzzy clustering technique is developed to solve this problem. The effectiveness of the proposed technique is demonstrated on five publicly available benchmark data sets. Results are compared with some widely used microarray clustering techniques. Statistical and biological significance tests have also been carried out. PMID:24209942
Respiration correction by clustering in ultrasound images
NASA Astrophysics Data System (ADS)
Wu, Kaizhi; Chen, Xi; Ding, Mingyue; Sang, Nong
2016-03-01
Respiratory motion is a challenging factor for image acquisition, image-guided procedures and perfusion quantification using contrast-enhanced ultrasound in the abdominal and thoracic region. In order to reduce the influence of respiratory motion, respiratory correction methods were investigated. In this paper we propose a novel, cluster-based respiratory correction method. In the proposed method, we assign the image frames of the corresponding respiratory phase using spectral clustering firstly. And then, we achieve the images correction automatically by finding a cluster in which points are close to each other. Unlike the traditional gating method, we don't need to estimate the breathing cycle accurate. It is because images are similar at the corresponding respiratory phase, and they are close in high-dimensional space. The proposed method is tested on simulation image sequence and real ultrasound image sequence. The experimental results show the effectiveness of our proposed method in quantitative and qualitative.
Pyne, Saumyadipta; Lee, Sharon X.; Wang, Kui; Irish, Jonathan; Tamayo, Pablo; Nazaire, Marc-Danie; Duong, Tarn; Ng, Shu-Kay; Hafler, David; Levy, Ronald; Nolan, Garry P.; Mesirov, Jill; McLachlan, Geoffrey J.
2014-01-01
In biomedical applications, an experimenter encounters different potential sources of variation in data such as individual samples, multiple experimental conditions, and multivariate responses of a panel of markers such as from a signaling network. In multiparametric cytometry, which is often used for analyzing patient samples, such issues are critical. While computational methods can identify cell populations in individual samples, without the ability to automatically match them across samples, it is difficult to compare and characterize the populations in typical experiments, such as those responding to various stimulations or distinctive of particular patients or time-points, especially when there are many samples. Joint Clustering and Matching (JCM) is a multi-level framework for simultaneous modeling and registration of populations across a cohort. JCM models every population with a robust multivariate probability distribution. Simultaneously, JCM fits a random-effects model to construct an overall batch template – used for registering populations across samples, and classifying new samples. By tackling systems-level variation, JCM supports practical biomedical applications involving large cohorts. Software for fitting the JCM models have been implemented in an R package EMMIX-JCM, available from http://www.maths.uq.edu.au/~gjm/mix_soft/EMMIX-JCM/. PMID:24983991
Local-Learning-Based Feature Selection for High-Dimensional Data Analysis
Sun, Yijun; Todorovic, Sinisa; Goodison, Steve
2012-01-01
This paper considers feature selection for data classification in the presence of a huge number of irrelevant features. We propose a new feature-selection algorithm that addresses several major issues with prior work, including problems with algorithm implementation, computational complexity, and solution accuracy. The key idea is to decompose an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then learn feature relevance globally within the large margin framework. The proposed algorithm is based on well-established machine learning and numerical analysis techniques, without making any assumptions about the underlying data distribution. It is capable of processing many thousands of features within minutes on a personal computer while maintaining a very high accuracy that is nearly insensitive to a growing number of irrelevant features. Theoretical analyses of the algorithm’s sample complexity suggest that the algorithm has a logarithmical sample complexity with respect to the number of features. Experiments on 11 synthetic and real-world data sets demonstrate the viability of our formulation of the feature-selection problem for supervised learning and the effectiveness of our algorithm. PMID:20634556
NASA Astrophysics Data System (ADS)
Xiongwen, Wang; Huazhong, Wang; Xiaopeng, Zheng
2014-12-01
The spatial aliasing of seismic data is usually serious because of the sub-sampling rate of the acquisition system. It induces amplitude artifacts or blurs the migration result when the spatial aliasing is not removed before migration. The compressed sensing (CS) method has been proven to be an effective tool to restore a sub-sampled signal which is compressible in another domain. Since the wave-fronts of seismic data are sparse and linear in a local spatiotemporal window, they can be significantly compressed by linear Radon transform or Fourier transform. Therefore, seismic data interpolation can be considered as a CS problem. The approximate solution of a CS problem using L0-norm can be achieved by matching pursuit (MP) algorithm. MP becomes intractable due to the high computing cost induced by the increasing dimension of the problem. In order to tackle this issue, a variant of MP—weighted matching pursuit (WMP)—is presented in this paper. Since there is little spatial aliasing in the data of low frequency and the events are supposed to be linear, the linear Radon spectrogram of the interpolated data of low frequency can be used to predict the energy distribution of data of high frequency in a frequency-wavenumber (FK) domain. The predicted energy distribution is then utilized to form the weighted factor of WMP. With this factor, WMP possesses the ability to distinguish the linear events from the spatial aliasing in the FK domain. WMP is also proven to be an efficient algorithm. Since projection onto convex sets (POCS) is another common sparsity-based method, we use Fourier POCS and WMP to realize high-dimension interpolation in numerical examples. The numerical examples show that the interpolation result of WMP significantly improves the quality of seismic data, and the quality of the migration result is also improved by the interpolation.
Okuno, Yuta; Small, Michael; Gotoda, Hiroshi
2015-04-01
We have examined the dynamics of self-excited thermoacoustic instability in a fundamentally and practically important gas-turbine model combustion system on the basis of complex network approaches. We have incorporated sophisticated complex networks consisting of cycle networks and phase space networks, neither of which has been considered in the areas of combustion physics and science. Pseudo-periodicity and high-dimensionality exist in the dynamics of thermoacoustic instability, including the possible presence of a clear power-law distribution and small-world-like nature.
The control of high-dimensional chaos in time-delay systems to an arbitrary goal dynamics.
Bunner, M. J.
1999-03-01
We present the control of high-dimensional chaos, with possibly a large number of positive Lyapunov exponents, of unknown time-delay systems to an arbitrary goal dynamics. We give an existence-and-uniqueness theorem for the control force. In the case of an unknown system, a formula to compute a model-based control force is derived. We give an example by demonstrating the control of the Mackey-Glass system toward a fixed point and a Rossler dynamics. (c) 1999 American Institute of Physics.
Okuno, Yuta; Small, Michael; Gotoda, Hiroshi
2015-04-01
We have examined the dynamics of self-excited thermoacoustic instability in a fundamentally and practically important gas-turbine model combustion system on the basis of complex network approaches. We have incorporated sophisticated complex networks consisting of cycle networks and phase space networks, neither of which has been considered in the areas of combustion physics and science. Pseudo-periodicity and high-dimensionality exist in the dynamics of thermoacoustic instability, including the possible presence of a clear power-law distribution and small-world-like nature. PMID:25933655
Dynamical analysis of Grover's search algorithm in arbitrarily high-dimensional search spaces
NASA Astrophysics Data System (ADS)
Jin, Wenliang
2016-01-01
We discuss at length the dynamical behavior of Grover's search algorithm for which all the Walsh-Hadamard transformations contained in this algorithm are exposed to their respective random perturbations inducing the augmentation of the dimension of the search space. We give the concise and general mathematical formulations for approximately characterizing the maximum success probabilities of finding a unique desired state in a large unsorted database and their corresponding numbers of Grover iterations, which are applicable to the search spaces of arbitrary dimension and are used to answer a salient open problem posed by Grover (Phys Rev Lett 80:4329-4332, 1998).
NASA Technical Reports Server (NTRS)
1999-01-01
Penetrating 25,000 light-years of obscuring dust and myriad stars, NASA's Hubble Space Telescope has provided the clearest view yet of one of the largest young clusters of stars inside our Milky Way galaxy, located less than 100 light-years from the very center of the Galaxy. Having the equivalent mass greater than 10,000 stars like our sun, the monster cluster is ten times larger than typical young star clusters scattered throughout our Milky Way. It is destined to be ripped apart in just a few million years by gravitational tidal forces in the galaxy's core. But in its brief lifetime it shines more brightly than any other star cluster in the Galaxy. Quintuplet Cluster is 4 million years old. It has stars on the verge of blowing up as supernovae. It is the home of the brightest star seen in the galaxy, called the Pistol star. This image was taken in infrared light by Hubble's NICMOS camera in September 1997. The false colors correspond to infrared wavelengths. The galactic center stars are white, the red stars are enshrouded in dust or behind dust, and the blue stars are foreground stars between us and the Milky Way's center. The cluster is hidden from direct view behind black dust clouds in the constellation Sagittarius. If the cluster could be seen from earth it would appear to the naked eye as a 3rd magnitude star, 1/6th of a full moon's diameter apart.
A Nonparametric Bayesian Model for Nested Clustering.
Lee, Juhee; Müller, Peter; Zhu, Yitan; Ji, Yuan
2016-01-01
We propose a nonparametric Bayesian model for clustering where clusters of experimental units are determined by a shared pattern of clustering another set of experimental units. The proposed model is motivated by the analysis of protein activation data, where we cluster proteins such that all proteins in one cluster give rise to the same clustering of patients. That is, we define clusters of proteins by the way that patients group with respect to the corresponding protein activations. This is in contrast to (almost) all currently available models that use shared parameters in the sampling model to define clusters. This includes in particular model based clustering, Dirichlet process mixtures, product partition models, and more. We show results for two typical biostatistical inference problems that give rise to clustering. PMID:26519174
NASA Astrophysics Data System (ADS)
Sangaletti Terçariol, César Augusto; de Moura Kiipper, Felipe; Souto Martinez, Alexandre
2007-03-01
Consider that the coordinates of N points are randomly generated along the edges of a d-dimensional hypercube (random point problem). The probability P(d,N)m,n that an arbitrary point is the mth nearest neighbour to its own nth nearest neighbour (Cox probabilities) plays an important role in spatial statistics. Also, it has been useful in the description of physical processes in disordered media. Here we propose a simpler derivation of Cox probabilities, where we stress the role played by the system dimensionality d. In the limit d → ∞, the distances between pair of points become independent (random link model) and closed analytical forms for the neighbourhood probabilities are obtained both for the thermodynamic limit and finite-size system. Breaking the distance symmetry constraint drives us to the random map model, for which the Cox probabilities are obtained for two cases: whether a point is its own nearest neighbour or not.
Joint Estimation of Multiple Graphical Models from High Dimensional Time Series
Qiu, Huitong; Han, Fang; Liu, Han; Caffo, Brian
2015-01-01
Summary In this manuscript we consider the problem of jointly estimating multiple graphical models in high dimensions. We assume that the data are collected from n subjects, each of which consists of T possibly dependent observations. The graphical models of subjects vary, but are assumed to change smoothly corresponding to a measure of closeness between subjects. We propose a kernel based method for jointly estimating all graphical models. Theoretically, under a double asymptotic framework, where both (T, n) and the dimension d can increase, we provide the explicit rate of convergence in parameter estimation. It characterizes the strength one can borrow across different individuals and the impact of data dependence on parameter estimation. Empirically, experiments on both synthetic and real resting state functional magnetic resonance imaging (rs-fMRI) data illustrate the effectiveness of the proposed method. PMID:26924939
Zawadzka-Kazimierczuk, Anna; Koźmiński, Wiktor; Billeter, Martin
2012-09-01
While NMR studies of proteins typically aim at structure, dynamics or interactions, resonance assignments represent in almost all cases the initial step of the analysis. With increasing complexity of the NMR spectra, for example due to decreasing extent of ordered structure, this task often becomes both difficult and time-consuming, and the recording of high-dimensional data with high-resolution may be essential. Random sampling of the evolution time space, combined with sparse multidimensional Fourier transform (SMFT), allows for efficient recording of very high dimensional spectra (≥4 dimensions) while maintaining high resolution. However, the nature of this data demands for automation of the assignment process. Here we present the program TSAR (Tool for SMFT-based Assignment of Resonances), which exploits all advantages of SMFT input. Moreover, its flexibility allows to process data from any type of experiments that provide sequential connectivities. The algorithm was tested on several protein samples, including a disordered 81-residue fragment of the δ subunit of RNA polymerase from Bacillus subtilis containing various repetitive sequences. For our test examples, TSAR achieves a high percentage of assigned residues without any erroneous assignments. PMID:22806130
NASA Astrophysics Data System (ADS)
Miller, Christopher J. Miller
2012-03-01
There are many examples of clustering in astronomy. Stars in our own galaxy are often seen as being gravitationally bound into tight globular or open clusters. The Solar System's Trojan asteroids cluster at the gravitational Langrangian in front of Jupiter’s orbit. On the largest of scales, we find gravitationally bound clusters of galaxies, the Virgo cluster (in the constellation of Virgo at a distance of ˜50 million light years) being a prime nearby example. The Virgo cluster subtends an angle of nearly 8◦ on the sky and is known to contain over a thousand member galaxies. Galaxy clusters play an important role in our understanding of theUniverse. Clusters exist at peaks in the three-dimensional large-scale matter density field. Their sky (2D) locations are easy to detect in astronomical imaging data and their mean galaxy redshifts (redshift is related to the third spatial dimension: distance) are often better (spectroscopically) and cheaper (photometrically) when compared with the entire galaxy population in large sky surveys. Photometric redshift (z) [Photometric techniques use the broad band filter magnitudes of a galaxy to estimate the redshift. Spectroscopic techniques use the galaxy spectra and emission/absorption line features to measure the redshift] determinations of galaxies within clusters are accurate to better than delta_z = 0.05 [7] and when studied as a cluster population, the central galaxies form a line in color-magnitude space (called the the E/S0 ridgeline and visible in Figure 16.3) that contains galaxies with similar stellar populations [15]. The shape of this E/S0 ridgeline enables astronomers to measure the cluster redshift to within delta_z = 0.01 [23]. The most accurate cluster redshift determinations come from spectroscopy of the member galaxies, where only a fraction of the members need to be spectroscopically observed [25,42] to get an accurate redshift to the whole system. If light traces mass in the Universe, then the locations
Cluster synchronization in oscillatory networks
NASA Astrophysics Data System (ADS)
Belykh, Vladimir N.; Osipov, Grigory V.; Petrov, Valentin S.; Suykens, Johan A. K.; Vandewalle, Joos
2008-09-01
Synchronous behavior in networks of coupled oscillators is a commonly observed phenomenon attracting a growing interest in physics, biology, communication, and other fields of science and technology. Besides global synchronization, one can also observe splitting of the full network into several clusters of mutually synchronized oscillators. In this paper, we study the conditions for such cluster partitioning into ensembles for the case of identical chaotic systems. We focus mainly on the existence and the stability of unique unconditional clusters whose rise does not depend on the origin of the other clusters. Also, conditional clusters in arrays of globally nonsymmetrically coupled identical chaotic oscillators are investigated. The design problem of organizing clusters into a given configuration is discussed.
Using Enrichment Clusters for Performance Based Identification.
ERIC Educational Resources Information Center
Renzulli, Joseph S.
2000-01-01
This article describes an enrichment cluster approach designed to create highly challenging learning opportunities that allow high potential students to identify themselves. The enrichment clusters focus students' attention on authentic learning applied to real-life problems. Guidelines for enrichment clusters are discussed, along with the teacher…
ERIC Educational Resources Information Center
Pottawattamie County School System, Council Bluffs, IA.
The 15 occupational clusters (transportation, fine arts and humanities, communications and media, personal service occupations, construction, hospitality and recreation, health occupations, marine science occupations, consumer and homemaking-related occupations, agribusiness and natural resources, environment, public service, business and office…
Donchev, Todor I.; Petrov, Ivan G.
2011-05-31
Described herein is an apparatus and a method for producing atom clusters based on a gas discharge within a hollow cathode. The hollow cathode includes one or more walls. The one or more walls define a sputtering chamber within the hollow cathode and include a material to be sputtered. A hollow anode is positioned at an end of the sputtering chamber, and atom clusters are formed when a gas discharge is generated between the hollow anode and the hollow cathode.
WExplore: hierarchical exploration of high-dimensional spaces using the weighted ensemble algorithm.
Dickson, Alex; Brooks, Charles L
2014-04-01
As most relevant motions in biomolecular systems are inaccessible to conventional molecular dynamics simulations, algorithms that enhance sampling of rare events are indispensable. Increasing interest in intrinsically disordered systems and the desire to target ensembles of protein conformations (rather than single structures) in drug development motivate the need for enhanced sampling algorithms that are not limited to "two-basin" problems, and can efficiently determine structural ensembles. For systems that are not well-studied, this must often be done with little or no information about the dynamics of interest. Here we present a novel strategy to determine structural ensembles that uses dynamically defined sampling regions that are organized in a hierarchical framework. It is based on the weighted ensemble algorithm, where an ensemble of copies of the system ("replicas") is directed to new regions of configuration space through merging and cloning operations. The sampling hierarchy allows for a large number of regions to be defined, while using only a small number of replicas that can be balanced over multiple length scales. We demonstrate this algorithm on two model systems that are analytically solvable and examine the 10-residue peptide chignolin in explicit solvent. The latter system is analyzed using a configuration space network, and novel hydrogen bonds are found that facilitate folding.
Statistical Significance of Clustering using Soft Thresholding
Huang, Hanwen; Liu, Yufeng; Yuan, Ming; Marron, J. S.
2015-01-01
Clustering methods have led to a number of important discoveries in bioinformatics and beyond. A major challenge in their use is determining which clusters represent important underlying structure, as opposed to spurious sampling artifacts. This challenge is especially serious, and very few methods are available, when the data are very high in dimension. Statistical Significance of Clustering (SigClust) is a recently developed cluster evaluation tool for high dimensional low sample size data. An important component of the SigClust approach is the very definition of a single cluster as a subset of data sampled from a multivariate Gaussian distribution. The implementation of SigClust requires the estimation of the eigenvalues of the covariance matrix for the null multivariate Gaussian distribution. We show that the original eigenvalue estimation can lead to a test that suffers from severe inflation of type-I error, in the important case where there are a few very large eigenvalues. This paper addresses this critical challenge using a novel likelihood based soft thresholding approach to estimate these eigenvalues, which leads to a much improved SigClust. Major improvements in SigClust performance are shown by both mathematical analysis, based on the new notion of Theoretical Cluster Index, and extensive simulation studies. Applications to some cancer genomic data further demonstrate the usefulness of these improvements. PMID:26755893
He, Ling Yan; Wang, Tie-Jun; Wang, Chuan
2016-07-11
High-dimensional quantum system provides a higher capacity of quantum channel, which exhibits potential applications in quantum information processing. However, high-dimensional universal quantum logic gates is difficult to achieve directly with only high-dimensional interaction between two quantum systems and requires a large number of two-dimensional gates to build even a small high-dimensional quantum circuits. In this paper, we propose a scheme to implement a general controlled-flip (CF) gate where the high-dimensional single photon serve as the target qudit and stationary qubits work as the control logic qudit, by employing a three-level Λ-type system coupled with a whispering-gallery-mode microresonator. In our scheme, the required number of interaction times between the photon and solid state system reduce greatly compared with the traditional method which decomposes the high-dimensional Hilbert space into 2-dimensional quantum space, and it is on a shorter temporal scale for the experimental realization. Moreover, we discuss the performance and feasibility of our hybrid CF gate, concluding that it can be easily extended to a 2n-dimensional case and it is feasible with current technology. PMID:27410818
The molecular matching problem
NASA Technical Reports Server (NTRS)
Kincaid, Rex K.
1993-01-01
Molecular chemistry contains many difficult optimization problems that have begun to attract the attention of optimizers in the Operations Research community. Problems including protein folding, molecular conformation, molecular similarity, and molecular matching have been addressed. Minimum energy conformations for simple molecular structures such as water clusters, Lennard-Jones microclusters, and short polypeptides have dominated the literature to date. However, a variety of interesting problems exist and we focus here on a molecular structure matching (MSM) problem.
Cowley, Benjamin R.; Kaufman, Matthew T.; Butler, Zachary S.; Churchland, Mark M.; Ryu, Stephen I.; Shenoy, Krishna V.; Yu, Byron M.
2014-01-01
Objective Analyzing and interpreting the activity of a heterogeneous population of neurons can be challenging, especially as the number of neurons, experimental trials, and experimental conditions increases. One approach is to extract a set of latent variables that succinctly captures the prominent co-fluctuation patterns across the neural population. A key problem is that the number of latent variables needed to adequately describe the population activity is often greater than three, thereby preventing direct visualization of the latent space. By visualizing a small number of 2-d projections of the latent space or each latent variable individually, it is easy to miss salient features of the population activity. Approach To address this limitation, we developed a Matlab graphical user interface (called DataHigh) that allows the user to quickly and smoothly navigate through a continuum of different 2-d projections of the latent space. We also implemented a suite of additional visualization tools (including playing out population activity timecourses as a movie and displaying summary statistics, such as covariance ellipses and average timecourses) and an optional tool for performing dimensionality reduction. Main results To demonstrate the utility and versatility of DataHigh, we used it to analyze single-trial spike count and single-trial timecourse population activity recorded using a multi-electrode array, as well as trial-averaged population activity recorded using single electrodes. Significance DataHigh was developed to fulfill a need for visualization in exploratory neural data analysis, which can provide intuition that is critical for building scientific hypotheses and models of population activity. PMID:24216250
Pseudospectral sampling of Gaussian basis sets as a new avenue to high-dimensional quantum dynamics
NASA Astrophysics Data System (ADS)
Heaps, Charles
This thesis presents a novel approach to modeling quantum molecular dynamics (QMD). Theoretical approaches to QMD are essential to understanding and predicting chemical reactivity and spectroscopy. We implement a method based on a trajectory-guided basis set. In this case, the nuclei are propagated in time using classical mechanics. Each nuclear configuration corresponds to a basis function in the quantum mechanical expansion. Using the time-dependent configurations as a basis set, we are able to evolve in time using relatively little information at each time step. We use a basis set of moving frozen (time-independent width) Gaussian functions that are well-known to provide a simple and efficient basis set for nuclear dynamics. We introduce a new perspective to trajectory-guided Gaussian basis sets based on existing numerical methods. The distinction is based on the Galerkin and collocation methods. In the former, the basis set is tested using basis functions, projecting the solution onto the functional space of the problem and requiring integration over all space. In the collocation method, the Dirac delta function tests the basis set, projecting the solution onto discrete points in space. This effectively reduces the integral evaluation to function evaluation, a fundamental characteristic of pseudospectral methods. We adopt this idea for independent trajectory-guided Gaussian basis functions. We investigate a series of anharmonic vibrational models describing dynamics in up to six dimensions. The pseudospectral sampling is found to be as accurate as full integral evaluation, while the former method is fully general and integration is only possible on very particular model potential energy surfaces. Nonadiabatic dynamics are also investigated in models of photodissociation and collinear triatomic vibronic coupling. Using Ehrenfest trajectories to guide the basis set on multiple surfaces, we observe convergence to exact results using hundreds of basis functions
Detecting alternative graph clusterings.
Mandala, Supreet; Kumara, Soundar; Yao, Tao
2012-07-01
The problem of graph clustering or community detection has enjoyed a lot of attention in complex networks literature. A quality function, modularity, quantifies the strength of clustering and on maximization yields sensible partitions. However, in most real world networks, there are an exponentially large number of near-optimal partitions with some being very different from each other. Therefore, picking an optimal clustering among the alternatives does not provide complete information about network topology. To tackle this problem, we propose a graph perturbation scheme which can be used to identify an ensemble of near-optimal and diverse clusterings. We establish analytical properties of modularity function under the perturbation which ensures diversity. Our approach is algorithm independent and therefore can leverage any of the existing modularity maximizing algorithms. We numerically show that our methodology can systematically identify very different partitions on several existing data sets. The knowledge of diverse partitions sheds more light into the topological organization and helps gain a more complete understanding of the underlying complex network.
Clustering of High Throughput Gene Expression Data
Pirim, Harun; Ekşioğlu, Burak; Perkins, Andy; Yüceer, Çetin
2012-01-01
High throughput biological data need to be processed, analyzed, and interpreted to address problems in life sciences. Bioinformatics, computational biology, and systems biology deal with biological problems using computational methods. Clustering is one of the methods used to gain insight into biological processes, particularly at the genomics level. Clearly, clustering can be used in many areas of biological data analysis. However, this paper presents a review of the current clustering algorithms designed especially for analyzing gene expression data. It is also intended to introduce one of the main problems in bioinformatics - clustering gene expression data - to the operations research community. PMID:23144527
NASA Astrophysics Data System (ADS)
Song, Yunquan; Lin, Lu; Jian, Ling
2016-07-01
Single-index varying-coefficient model is an important mathematical modeling method to model nonlinear phenomena in science and engineering. In this paper, we develop a variable selection method for high-dimensional single-index varying-coefficient models using a shrinkage idea. The proposed procedure can simultaneously select significant nonparametric components and parametric components. Under defined regularity conditions, with appropriate selection of tuning parameters, the consistency of the variable selection procedure and the oracle property of the estimators are established. Moreover, due to the robustness of the check loss function to outliers in the finite samples, our proposed variable selection method is more robust than the ones based on the least squares criterion. Finally, the method is illustrated with numerical simulations.
Murphy, Thomas Brendan; Dean, Nema; Raftery, Adrian E
2010-03-01
Food authenticity studies are concerned with determining if food samples have been correctly labelled or not. Discriminant analysis methods are an integral part of the methodology for food authentication. Motivated by food authenticity applications, a model-based discriminant analysis method that includes variable selection is presented. The discriminant analysis model is fitted in a semi-supervised manner using both labeled and unlabeled data. The method is shown to give excellent classification performance on several high-dimensional multiclass food authenticity datasets with more variables than observations. The variables selected by the proposed method provide information about which variables are meaningful for classification purposes. A headlong search strategy for variable selection is shown to be efficient in terms of computation and achieves excellent classification performance. In applications to several food authenticity datasets, our proposed method outperformed default implementations of Random Forests, AdaBoost, transductive SVMs and Bayesian Multinomial Regression by substantial margins.
NASA Astrophysics Data System (ADS)
Miao, Yan-Gang; Xu, Zhen-Ming
2016-04-01
Considering non-Gaussian smeared matter distributions, we investigate the thermodynamic behaviors of the noncommutative high-dimensional Schwarzschild-Tangherlini anti-de Sitter black hole, and we obtain the condition for the existence of extreme black holes. We indicate that the Gaussian smeared matter distribution, which is a special case of non-Gaussian smeared matter distributions, is not applicable for the six- and higher-dimensional black holes due to the hoop conjecture. In particular, the phase transition is analyzed in detail. Moreover, we point out that the Maxwell equal area law holds for the noncommutative black hole whose Hawking temperature is within a specific range, but fails for one whose the Hawking temperature is beyond this range.
Wawer, Mathias J; Li, Kejie; Gustafsdottir, Sigrun M; Ljosa, Vebjorn; Bodycombe, Nicole E; Marton, Melissa A; Sokolnicki, Katherine L; Bray, Mark-Anthony; Kemp, Melissa M; Winchester, Ellen; Taylor, Bradley; Grant, George B; Hon, C Suk-Yee; Duvall, Jeremy R; Wilson, J Anthony; Bittker, Joshua A; Dančík, Vlado; Narayan, Rajiv; Subramanian, Aravind; Winckler, Wendy; Golub, Todd R; Carpenter, Anne E; Shamji, Alykhan F; Schreiber, Stuart L; Clemons, Paul A
2014-07-29
High-throughput screening has become a mainstay of small-molecule probe and early drug discovery. The question of how to build and evolve efficient screening collections systematically for cell-based and biochemical screening is still unresolved. It is often assumed that chemical structure diversity leads to diverse biological performance of a library. Here, we confirm earlier results showing that this inference is not always valid and suggest instead using biological measurement diversity derived from multiplexed profiling in the construction of libraries with diverse assay performance patterns for cell-based screens. Rather than using results from tens or hundreds of completed assays, which is resource intensive and not easily extensible, we use high-dimensional image-based cell morphology and gene expression profiles. We piloted this approach using over 30,000 compounds. We show that small-molecule profiling can be used to select compound sets with high rates of activity and diverse biological performance.
Li, Ke; Liu, Yi; Wang, Quanxin; Wu, Yalei; Song, Shimin; Sun, Yi; Liu, Tengchong; Wang, Jun; Li, Yang; Du, Shaoyi
2015-01-01
This paper proposes a novel multi-label classification method for resolving the spacecraft electrical characteristics problems which involve many unlabeled test data processing, high-dimensional features, long computing time and identification of slow rate. Firstly, both the fuzzy c-means (FCM) offline clustering and the principal component feature extraction algorithms are applied for the feature selection process. Secondly, the approximate weighted proximal support vector machine (WPSVM) online classification algorithms is used to reduce the feature dimension and further improve the rate of recognition for electrical characteristics spacecraft. Finally, the data capture contribution method by using thresholds is proposed to guarantee the validity and consistency of the data selection. The experimental results indicate that the method proposed can obtain better data features of the spacecraft electrical characteristics, improve the accuracy of identification and shorten the computing time effectively. PMID:26544549
Li, Ke; Liu, Yi; Wang, Quanxin; Wu, Yalei; Song, Shimin; Sun, Yi; Liu, Tengchong; Wang, Jun; Li, Yang; Du, Shaoyi
2015-01-01
This paper proposes a novel multi-label classification method for resolving the spacecraft electrical characteristics problems which involve many unlabeled test data processing, high-dimensional features, long computing time and identification of slow rate. Firstly, both the fuzzy c-means (FCM) offline clustering and the principal component feature extraction algorithms are applied for the feature selection process. Secondly, the approximate weighted proximal support vector machine (WPSVM) online classification algorithms is used to reduce the feature dimension and further improve the rate of recognition for electrical characteristics spacecraft. Finally, the data capture contribution method by using thresholds is proposed to guarantee the validity and consistency of the data selection. The experimental results indicate that the method proposed can obtain better data features of the spacecraft electrical characteristics, improve the accuracy of identification and shorten the computing time effectively.
Li, Ke; Liu, Yi; Wang, Quanxin; Wu, Yalei; Song, Shimin; Sun, Yi; Liu, Tengchong; Wang, Jun; Li, Yang; Du, Shaoyi
2015-01-01
This paper proposes a novel multi-label classification method for resolving the spacecraft electrical characteristics problems which involve many unlabeled test data processing, high-dimensional features, long computing time and identification of slow rate. Firstly, both the fuzzy c-means (FCM) offline clustering and the principal component feature extraction algorithms are applied for the feature selection process. Secondly, the approximate weighted proximal support vector machine (WPSVM) online classification algorithms is used to reduce the feature dimension and further improve the rate of recognition for electrical characteristics spacecraft. Finally, the data capture contribution method by using thresholds is proposed to guarantee the validity and consistency of the data selection. The experimental results indicate that the method proposed can obtain better data features of the spacecraft electrical characteristics, improve the accuracy of identification and shorten the computing time effectively. PMID:26544549
Active matter clusters at interfaces.
NASA Astrophysics Data System (ADS)
Copenhagen, Katherine; Gopinathan, Ajay
2016-03-01
Collective and directed motility or swarming is an emergent phenomenon displayed by many self-organized assemblies of active biological matter such as clusters of embryonic cells during tissue development, cancerous cells during tumor formation and metastasis, colonies of bacteria in a biofilm, or even flocks of birds and schools of fish at the macro-scale. Such clusters typically encounter very heterogeneous environments. What happens when a cluster encounters an interface between two different environments has implications for its function and fate. Here we study this problem by using a mathematical model of a cluster that treats it as a single cohesive unit that moves in two dimensions by exerting a force/torque per unit area whose magnitude depends on the nature of the local environment. We find that low speed (overdamped) clusters encountering an interface with a moderate difference in properties can lead to refraction or even total internal reflection of the cluster. For large speeds (underdamped), where inertia dominates, the clusters show more complex behaviors crossing the interface multiple times and deviating from the predictable refraction and reflection for the low velocity clusters. We then present an extreme limit of the model in the absence of rotational damping where clusters can become stuck spiraling along the interface or move in large circular trajectories after leaving the interface. Our results show a wide range of behaviors that occur when collectively moving active biological matter moves across interfaces and these insights can be used to control motion by patterning environments.
HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree.
Obulkasim, Askar; van de Wiel, Mark A
2015-01-01
Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster. However, the fixed-height branch cut may not be ideal in situations where one expects a complicated tree structure with nested clusters. Furthermore, due to lack of utilization of related background information in selecting the cutoff, induced clusters are often difficult to interpret. This paper describes a novel procedure that aims to automatically extract meaningful clusters from the HC tree in a semi-supervised way. The procedure is implemented in the R package HCsnip available from Bioconductor. Rather than cutting the HC tree at a fixed-height, HCsnip probes the various way of snipping, possibly at variable heights, to tease out hidden clusters ensconced deep down in the tree. The cluster extraction process utilizes, along with the data set from which the HC tree is derived, commonly available background information. Consequently, the extracted clusters are highly reproducible and robust against various sources of variations that "haunted" high-dimensional genomics data. Since the clustering process is guided by the background information, clusters are easy to interpret. Unlike existing packages, no constraint is placed on the data type on which clustering is desired. Particularly, the package accepts patient follow-up data for guiding the cluster extraction process. To our knowledge, HCsnip is the first package that is able to decomposes the HC tree into clusters with piecewise snipping under the guidance of patient time-to-event information. Our implementation of the semi-supervised HC tree snipping framework is generic, and can
de Lara-Castells, M P; Villarreal, P; Delgado-Barrio, G; Mitrushchenkov, A O
2009-11-21
An efficient full-configuration-interaction nuclear orbital treatment has been recently developed as a benchmark quantum-chemistry-like method to calculate ground and excited "solvent" energies and wave functions in small doped DeltaE(est) clusters (N < or = 4) [M. P. de Lara-Castells, G. Delgado-Barrio, P. Villarreal, and A. O. Mitrushchenkov, J. Chem. Phys. 125, 221101 (2006)]. Additional methodological and computational details of the implementation, which uses an iterative Jacobi-Davidson diagonalization algorithm to properly address the inherent "hard-core" He-He interaction problem, are described here. The convergence of total energies, average pair He-He interaction energies, and relevant one- and two-body properties upon increasing the angular part of the one-particle basis set (expanded in spherical harmonics) has been analyzed, considering Cl(2) as the dopant and a semiempirical model (T-shaped) He-Cl(2)(B) potential. Converged results are used to analyze global energetic and structural aspects as well as the configuration makeup of the wave functions, associated with the ground and low-lying "solvent" excited states. Our study reveals that besides the fermionic nature of (3)He atoms, key roles in determining total binding energies and wave-function structures are played by the strong repulsive core of the He-He potential as well as its very weak attractive region, the most stable arrangement somehow departing from the one of N He atoms equally spaced on equatorial "ring" around the dopant. The present results for N = 4 fermions indicates the structural "pairing" of two (3)He atoms at opposite sides on a broad "belt" around the dopant, executing a sort of asymmetric umbrella motion. This pairing is a compromise between maximizing the (3)He-(3)He and the He-dopant attractions, and suppressing at the same time the "hard-core" repulsion. Although the He-He attractive interaction is rather weak, its contribution to the total energy is found to scale as a
Systolic architecture for heirarchical clustering
Ku, L.C.
1984-01-01
Several hierarchical clustering methods (including single-linkage complete-linkage, centroid, and absolute overlap methods) are reviewed. The absolute overlap clustering method is selected for the design of systolic architecture mainly due to its simplicity. Two versions of systolic architectures for the absolute overlap hierarchical clustering algorithm are proposed: one-dimensional version that leads to the development of a two dimensional version which fully takes advantage of the underlying data structure of the problems. The two dimensional systolic architecture can achieve a time complexity of O(m + n) in comparison with the conventional computer implementation of a time complexity of O(m/sup 2*/n).
Fuzzy and hard clustering analysis for thyroid disease.
Azar, Ahmad Taher; El-Said, Shaimaa Ahmed; Hassanien, Aboul Ella
2013-07-01
Thyroid hormones produced by the thyroid gland help regulation of the body's metabolism. A variety of methods have been proposed in the literature for thyroid disease classification. As far as we know, clustering techniques have not been used in thyroid diseases data set so far. This paper proposes a comparison between hard and fuzzy clustering algorithms for thyroid diseases data set in order to find the optimal number of clusters. Different scalar validity measures are used in comparing the performances of the proposed clustering systems. To demonstrate the performance of each algorithm, the feature values that represent thyroid disease are used as input for the system. Several runs are carried out and recorded with a different number of clusters being specified for each run (between 2 and 11), so as to establish the optimum number of clusters. To find the optimal number of clusters, the so-called elbow criterion is applied. The experimental results revealed that for all algorithms, the elbow was located at c=3. The clustering results for all algorithms are then visualized by the Sammon mapping method to find a low-dimensional (normally 2D or 3D) representation of a set of points distributed in a high dimensional pattern space. At the end of this study, some recommendations are formulated to improve determining the actual number of clusters present in the data set. PMID:23357404
A Flocking Based algorithm for Document Clustering Analysis
Cui, Xiaohui; Gao, Jinzhu; Potok, Thomas E
2006-01-01
Social animals or insects in nature often exhibit a form of emergent collective behavior known as flocking. In this paper, we present a novel Flocking based approach for document clustering analysis. Our Flocking clustering algorithm uses stochastic and heuristic principles discovered from observing bird flocks or fish schools. Unlike other partition clustering algorithm such as K-means, the Flocking based algorithm does not require initial partitional seeds. The algorithm generates a clustering of a given set of data through the embedding of the high-dimensional data items on a two-dimensional grid for easy clustering result retrieval and visualization. Inspired by the self-organized behavior of bird flocks, we represent each document object with a flock boid. The simple local rules followed by each flock boid result in the entire document flock generating complex global behaviors, which eventually result in a clustering of the documents. We evaluate the efficiency of our algorithm with both a synthetic dataset and a real document collection that includes 100 news articles collected from the Internet. Our results show that the Flocking clustering algorithm achieves better performance compared to the K- means and the Ant clustering algorithm for real document clustering.
2013-01-01
determine the path between two competing models, we provide evidence that a single well-chosen sigmoid shape value requires less computational efforts in order to approximate the true value of the (log) Bayes factor compared to the original approach. We show that the (log) Bayes factors calculated using path sampling and stepping-stone sampling differ drastically from those estimated using either of the harmonic mean estimators, supporting earlier claims that the latter systematically overestimate the performance of high-dimensional models, which we show can lead to erroneous conclusions. Based on our results, we argue that highly accurate estimation of differences in model fit for high-dimensional models requires much more computational effort than suggested in recent studies on marginal likelihood estimation. PMID:23497171
NASA Technical Reports Server (NTRS)
Socolovsky, Eduardo A.; Bushnell, Dennis M. (Technical Monitor)
2002-01-01
The cosine or correlation measures of similarity used to cluster high dimensional data are interpreted as projections, and the orthogonal components are used to define a complementary dissimilarity measure to form a similarity-dissimilarity measure pair. Using a geometrical approach, a number of properties of this pair is established. This approach is also extended to general inner-product spaces of any dimension. These properties include the triangle inequality for the defined dissimilarity measure, error estimates for the triangle inequality and bounds on both measures that can be obtained with a few floating-point operations from previously computed values of the measures. The bounds and error estimates for the similarity and dissimilarity measures can be used to reduce the computational complexity of clustering algorithms and enhance their scalability, and the triangle inequality allows the design of clustering algorithms for high dimensional distributed data.
Two generalizations of Kohonen clustering
NASA Technical Reports Server (NTRS)
Bezdek, James C.; Pal, Nikhil R.; Tsao, Eric C. K.
1993-01-01
The relationship between the sequential hard c-means (SHCM), learning vector quantization (LVQ), and fuzzy c-means (FCM) clustering algorithms is discussed. LVQ and SHCM suffer from several major problems. For example, they depend heavily on initialization. If the initial values of the cluster centers are outside the convex hull of the input data, such algorithms, even if they terminate, may not produce meaningful results in terms of prototypes for cluster representation. This is due in part to the fact that they update only the winning prototype for every input vector. The impact and interaction of these two families with Kohonen's self-organizing feature mapping (SOFM), which is not a clustering method, but which often leads ideas to clustering algorithms is discussed. Then two generalizations of LVQ that are explicitly designed as clustering algorithms are presented; these algorithms are referred to as generalized LVQ = GLVQ; and fuzzy LVQ = FLVQ. Learning rules are derived to optimize an objective function whose goal is to produce 'good clusters'. GLVQ/FLVQ (may) update every node in the clustering net for each input vector. Neither GLVQ nor FLVQ depends upon a choice for the update neighborhood or learning rate distribution - these are taken care of automatically. Segmentation of a gray tone image is used as a typical application of these algorithms to illustrate the performance of GLVQ/FLVQ.
Active matter clusters at interfaces
NASA Astrophysics Data System (ADS)
Copenhagen, Katherine; Gopinathan, Ajay
Collective and directed motility or swarming is an emergent phenomenon displayed by many self-organized assemblies of active biological matter such as clusters of embryonic cells during tissue development and flocks of birds. Such clusters typically encounter very heterogeneous environments. What happens when a cluster encounters an interface between two different environments has implications for its function and fate. Here we study this problem by using a mathematical model of a cluster that treats it as a single cohesive unit whose movement depends on the nature of the local environment. We find that low speed clusters which exert forces but no active torques, encountering an interface with a moderate difference in properties can lead to refraction or even total internal reflection of the cluster. For large speeds and clusters with active torques, they show more complex behaviors crossing the interface multiple times, becoming trapped at the interface and deviating from the predictable refraction and reflection of the low velocity clusters. Our results show a wide range of behaviors that occur when collectively moving active biological matter moves across interfaces and these insights can be used to control motion by patterning environments.
Multitask spectral clustering by exploring intertask correlation.
Yang, Yang; Ma, Zhigang; Yang, Yi; Nie, Feiping; Shen, Heng Tao
2015-05-01
Clustering, as one of the most classical research problems in pattern recognition and data mining, has been widely explored and applied to various applications. Due to the rapid evolution of data on the Web, more emerging challenges have been posed on traditional clustering techniques: 1) correlations among related clustering tasks and/or within individual task are not well captured; 2) the problem of clustering out-of-sample data is seldom considered; and 3) the discriminative property of cluster label matrix is not well explored. In this paper, we propose a novel clustering model, namely multitask spectral clustering (MTSC), to cope with the above challenges. Specifically, two types of correlations are well considered: 1) intertask clustering correlation, which refers the relations among different clustering tasks and 2) intratask learning correlation, which enables the processes of learning cluster labels and learning mapping function to reinforce each other. We incorporate a novel l2,p -norm regularizer to control the coherence of all the tasks based on an assumption that related tasks should share a common low-dimensional representation. Moreover, for each individual task, an explicit mapping function is simultaneously learnt for predicting cluster labels by mapping features to the cluster label matrix. Meanwhile, we show that the learning process can naturally incorporate discriminative information to further improve clustering performance. We explore and discuss the relationships between our proposed model and several representative clustering techniques, including spectral clustering, k -means and discriminative k -means. Extensive experiments on various real-world datasets illustrate the advantage of the proposed MTSC model compared to state-of-the-art clustering approaches. PMID:25252288
Multitask spectral clustering by exploring intertask correlation.
Yang, Yang; Ma, Zhigang; Yang, Yi; Nie, Feiping; Shen, Heng Tao
2015-05-01
Clustering, as one of the most classical research problems in pattern recognition and data mining, has been widely explored and applied to various applications. Due to the rapid evolution of data on the Web, more emerging challenges have been posed on traditional clustering techniques: 1) correlations among related clustering tasks and/or within individual task are not well captured; 2) the problem of clustering out-of-sample data is seldom considered; and 3) the discriminative property of cluster label matrix is not well explored. In this paper, we propose a novel clustering model, namely multitask spectral clustering (MTSC), to cope with the above challenges. Specifically, two types of correlations are well considered: 1) intertask clustering correlation, which refers the relations among different clustering tasks and 2) intratask learning correlation, which enables the processes of learning cluster labels and learning mapping function to reinforce each other. We incorporate a novel l2,p -norm regularizer to control the coherence of all the tasks based on an assumption that related tasks should share a common low-dimensional representation. Moreover, for each individual task, an explicit mapping function is simultaneously learnt for predicting cluster labels by mapping features to the cluster label matrix. Meanwhile, we show that the learning process can naturally incorporate discriminative information to further improve clustering performance. We explore and discuss the relationships between our proposed model and several representative clustering techniques, including spectral clustering, k -means and discriminative k -means. Extensive experiments on various real-world datasets illustrate the advantage of the proposed MTSC model compared to state-of-the-art clustering approaches.
Du, Jing; Wang, Jian
2015-11-01
Bessel beams carrying orbital angular momentum (OAM) with helical phase fronts exp(ilφ)(l=0;±1;±2;…), where φ is the azimuthal angle and l corresponds to the topological number, are orthogonal with each other. This feature of Bessel beams provides a new dimension to code/decode data information on the OAM state of light, and the theoretical infinity of topological number enables possible high-dimensional structured light coding/decoding for free-space optical communications. Moreover, Bessel beams are nondiffracting beams having the ability to recover by themselves in the face of obstructions, which is important for free-space optical communications relying on line-of-sight operation. By utilizing the OAM and nondiffracting characteristics of Bessel beams, we experimentally demonstrate 12 m distance obstruction-free optical m-ary coding/decoding using visible Bessel beams in a free-space optical communication system. We also study the bit error rate (BER) performance of hexadecimal and 32-ary coding/decoding based on Bessel beams with different topological numbers. After receiving 500 symbols at the receiver side, a zero BER of hexadecimal coding/decoding is observed when the obstruction is placed along the propagation path of light.
NASA Astrophysics Data System (ADS)
Cavaglieri, Daniele; Bewley, Thomas
2015-04-01
Implicit/explicit (IMEX) Runge-Kutta (RK) schemes are effective for time-marching ODE systems with both stiff and nonstiff terms on the RHS; such schemes implement an (often A-stable or better) implicit RK scheme for the stiff part of the ODE, which is often linear, and, simultaneously, a (more convenient) explicit RK scheme for the nonstiff part of the ODE, which is often nonlinear. Low-storage RK schemes are especially effective for time-marching high-dimensional ODE discretizations of PDE systems on modern (cache-based) computational hardware, in which memory management is often the most significant computational bottleneck. In this paper, we develop and characterize eight new low-storage implicit/explicit RK schemes which have higher accuracy and better stability properties than the only low-storage implicit/explicit RK scheme available previously, the venerable second-order Crank-Nicolson/Runge-Kutta-Wray (CN/RKW3) algorithm that has dominated the DNS/LES literature for the last 25 years, while requiring similar storage (two, three, or four registers of length N) and comparable floating-point operations per timestep.
Cuny, Jérôme; Xie, Yu; Pickard, Chris J; Hassanali, Ali A
2016-02-01
Nuclear magnetic resonance (NMR) spectroscopy is one of the most powerful experimental tools to probe the local atomic order of a wide range of solid-state compounds. However, due to the complexity of the related spectra, in particular for amorphous materials, their interpretation in terms of structural information is often challenging. These difficulties can be overcome by combining molecular dynamics simulations to generate realistic structural models with an ab initio evaluation of the corresponding chemical shift and quadrupolar coupling tensors. However, due to computational constraints, this approach is limited to relatively small system sizes which, for amorphous materials, prevents an adequate statistical sampling of the distribution of the local environments that is required to quantitatively describe the system. In this work, we present an approach to efficiently and accurately predict the NMR parameters of very large systems. This is achieved by using a high-dimensional neural-network representation of NMR parameters that are calculated using an ab initio formalism. To illustrate the potential of this approach, we applied this neural-network NMR (NN-NMR) method on the (17)O and (29)Si quadrupolar coupling and chemical shift parameters of various crystalline silica polymorphs and silica glasses. This approach is, in principal, general and has the potential to be applied to predict the NMR properties of various materials. PMID:26730889
Cuny, Jérôme; Xie, Yu; Pickard, Chris J; Hassanali, Ali A
2016-02-01
Nuclear magnetic resonance (NMR) spectroscopy is one of the most powerful experimental tools to probe the local atomic order of a wide range of solid-state compounds. However, due to the complexity of the related spectra, in particular for amorphous materials, their interpretation in terms of structural information is often challenging. These difficulties can be overcome by combining molecular dynamics simulations to generate realistic structural models with an ab initio evaluation of the corresponding chemical shift and quadrupolar coupling tensors. However, due to computational constraints, this approach is limited to relatively small system sizes which, for amorphous materials, prevents an adequate statistical sampling of the distribution of the local environments that is required to quantitatively describe the system. In this work, we present an approach to efficiently and accurately predict the NMR parameters of very large systems. This is achieved by using a high-dimensional neural-network representation of NMR parameters that are calculated using an ab initio formalism. To illustrate the potential of this approach, we applied this neural-network NMR (NN-NMR) method on the (17)O and (29)Si quadrupolar coupling and chemical shift parameters of various crystalline silica polymorphs and silica glasses. This approach is, in principal, general and has the potential to be applied to predict the NMR properties of various materials.
Awale, Mahendra; Reymond, Jean-Louis
2015-08-24
An Internet portal accessible at www.gdb.unibe.ch has been set up to automatically generate color-coded similarity maps of the ChEMBL database in relation to up to two sets of active compounds taken from the enhanced Directory of Useful Decoys (eDUD), a random set of molecules, or up to two sets of user-defined reference molecules. These maps visualize the relationships between the selected compounds and ChEMBL in six different high dimensional chemical spaces, namely MQN (42-D molecular quantum numbers), SMIfp (34-D SMILES fingerprint), APfp (20-D shape fingerprint), Xfp (55-D pharmacophore fingerprint), Sfp (1024-bit substructure fingerprint), and ECfp4 (1024-bit extended connectivity fingerprint). The maps are supplied in form of Java based desktop applications called "similarity mapplets" allowing interactive content browsing and linked to a "Multifingerprint Browser for ChEMBL" (also accessible directly at www.gdb.unibe.ch ) to perform nearest neighbor searches. One can obtain six similarity mapplets of ChEMBL relative to random reference compounds, 606 similarity mapplets relative to single eDUD active sets, 30,300 similarity mapplets relative to pairs of eDUD active sets, and any number of similarity mapplets relative to user-defined reference sets to help visualize the structural diversity of compound series in drug optimization projects and their relationship to other known bioactive compounds. PMID:26207526
Chen, Shuo; Bowman, F DuBois
2011-12-01
Recent technological advances have made it possible for many studies to collect high dimensional data (HDD) longitudinally, for example images collected during different scanning sessions. Such studies may yield temporal changes of selected features that, when incorporated with machine learning methods, are able to predict disease status or responses to a therapeutic treatment. Support vector machine (SVM) techniques are robust and effective tools well-suited for the classification and prediction of HDD. However, current SVM methods for HDD analysis typically consider cross-sectional data collected during one time period or session (e.g. baseline). We propose a novel support vector classifier (SVC) for longitudinal HDD that allows simultaneous estimation of the SVM separating hyperplane parameters and temporal trend parameters, which determine the optimal means to combine the longitudinal data for classification and prediction. Our approach is based on an augmented reproducing kernel function and uses quadratic programming for optimization. We demonstrate the use and potential advantages of our proposed methodology using a simulation study and a data example from the Alzheimer's disease Neuroimaging Initiative. The results indicate that our proposed method leverages the additional longitudinal information to achieve higher accuracy than methods using only cross-sectional data and methods that combine longitudinal data by naively expanding the feature space.
Gravitational clustering: an overview
NASA Astrophysics Data System (ADS)
Labini, Francesco Sylos
2008-01-01
We discuss the differences and analogies of gravitational clustering in finite and infinite systems. The process of collective, or violent, relaxation leading to the formation of quasi-stationary states is one of the distinguished features in the dynamics of self-gravitating systems. This occurs, in different conditions, both in a finite than in an infinite system, the latter embedded in a static or in an expanding background. We then discuss, by considering some simple and paradigmatic examples, the problems related to the definition of a mean-field approach to gravitational clustering, focusing on role of discrete fluctuations. The effect of these fluctuations is a basic issue to be clarified to establish the range of scales and times in which a collision-less approximation may describe the evolution of a self-gravitating system and for the theoretical modeling of the non-linear phase.
2010-01-01
Introduction The revised International Headache Society (IHS) criteria for cluster headache are: attacks of severe or very severe, strictly unilateral pain, which is orbital, supraorbital, or temporal pain, lasting 15 to 180 minutes and occurring from once every other day to eight times daily. Methods and outcomes We conducted a systematic review and aimed to answer the following clinical questions: What are the effects of interventions to abort cluster headache? What are the effects of interventions to prevent cluster headache? We searched: Medline, Embase, The Cochrane Library, and other important databases up to June 2009 (Clinical Evidence reviews are updated periodically; please check our website for the most up-to-date version of this review). We included harms alerts from relevant organisations, such as the US Food and Drug Administration (FDA) and the UK Medicines and Healthcare products Regulatory Agency (MHRA). Results We found 23 systematic reviews, RCTs, or observational studies that met our inclusion criteria. We performed a GRADE evaluation of the quality of evidence for interventions. Conclusions In this systematic review, we present information relating to the effectiveness and safety of the following interventions: baclofen (oral); botulinum toxin (intramuscular); capsaicin (intranasal); chlorpromazine; civamide (intranasal); clonidine (transdermal); corticosteroids; ergotamine and dihydroergotamine (oral or intranasal); gabapentin (oral); greater occipital nerve injections (betamethasone plus xylocaine); high-dose and high-flow-rate oxygen; hyperbaric oxygen; leuprolide; lidocaine (intranasal); lithium (oral); melatonin; methysergide (oral); octreotide (subcutaneous); pizotifen (oral); sodium valproate (oral); sumatriptan (oral, subcutaneous, and intranasal); topiramate (oral); tricyclic antidepressants (TCAs); verapamil; and zolmitriptan (oral and intranasal). PMID:21718584
Clustering PPI data by combining FA and SHC method
2015-01-01
Clustering is one of main methods to identify functional modules from protein-protein interaction (PPI) data. Nevertheless traditional clustering methods may not be effective for clustering PPI data. In this paper, we proposed a novel method for clustering PPI data by combining firefly algorithm (FA) and synchronization-based hierarchical clustering (SHC) algorithm. Firstly, the PPI data are preprocessed via spectral clustering (SC) which transforms the high-dimensional similarity matrix into a low dimension matrix. Then the SHC algorithm is used to perform clustering. In SHC algorithm, hierarchical clustering is achieved by enlarging the neighborhood radius of synchronized objects continuously, while the hierarchical search is very difficult to find the optimal neighborhood radius of synchronization and the efficiency is not high. So we adopt the firefly algorithm to determine the optimal threshold of the neighborhood radius of synchronization automatically. The proposed algorithm is tested on the MIPS PPI dataset. The results show that our proposed algorithm is better than the traditional algorithms in precision, recall and f-measure value. PMID:25707632
Progeny Clustering: A Method to Identify Biological Phenotypes.
Hu, Chenyue W; Kornblau, Steven M; Slater, John H; Qutub, Amina A
2015-01-01
Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset. PMID:26267476
Romero, R; Espinoza, J; Gotsch, F; Kusanovic, J P; Friel, L A; Erez, O; Mazaki-Tovi, S; Than, N G; Hassan, S; Tromp, G
2006-12-01
High-dimensional biology (HDB) refers to the simultaneous study of the genetic variants (DNA variation), transcription (messenger RNA [mRNA]), peptides and proteins, and metabolites of an organ, tissue, or an organism in health and disease. The fundamental premise is that the evolutionary complexity of biological systems renders them difficult to comprehensively understand using only a reductionist approach. Such complexity can become tractable with the use of "omics" research. This term refers to the study of entities in aggregate. The current nomenclature of "omics" sciences includes genomics for DNA variants, transcriptomics for mRNA, proteomics for proteins, and metabolomics for intermediate products of metabolism. Another discipline relevant to medicine is pharmacogenomics. The two major advances that have made HDB possible are technological breakthroughs that allow simultaneous examination of thousands of genes, transcripts, and proteins, etc., with high-throughput techniques and analytical tools to extract information. What is conventionally considered hypothesis-driven research and discovery-driven research (through "omic" methodologies) are complementary and synergistic. Here we review data which have been derived from: 1) genomics to examine predisposing factors for preterm birth; 2) transcriptomics to determine changes in mRNA in reproductive tissues associated with preterm labour and preterm prelabour rupture of membranes; 3) proteomics to identify differentially expressed proteins in amniotic fluid of women with preterm labour; and 4) metabolomics to identify the metabolic footprints of women with preterm labour likely to deliver preterm and those who will deliver at term. The complementary nature of discovery science and HDB is emphasised.
Clustering of financial time series
NASA Astrophysics Data System (ADS)
D'Urso, Pierpaolo; Cappelli, Carmela; Di Lallo, Dario; Massari, Riccardo
2013-05-01
This paper addresses the topic of classifying financial time series in a fuzzy framework proposing two fuzzy clustering models both based on GARCH models. In general clustering of financial time series, due to their peculiar features, needs the definition of suitable distance measures. At this aim, the first fuzzy clustering model exploits the autoregressive representation of GARCH models and employs, in the framework of a partitioning around medoids algorithm, the classical autoregressive metric. The second fuzzy clustering model, also based on partitioning around medoids algorithm, uses the Caiado distance, a Mahalanobis-like distance, based on estimated GARCH parameters and covariances that takes into account the information about the volatility structure of time series. In order to illustrate the merits of the proposed fuzzy approaches an application to the problem of classifying 29 time series of Euro exchange rates against international currencies is presented and discussed, also comparing the fuzzy models with their crisp version.
Cluster randomization and political philosophy.
Chwang, Eric
2012-11-01
In this paper, I will argue that, while the ethical issues raised by cluster randomization can be challenging, they are not new. My thesis divides neatly into two parts. In the first, easier part I argue that many of the ethical challenges posed by cluster randomized human subjects research are clearly present in other types of human subjects research, and so are not novel. In the second, more difficult part I discuss the thorniest ethical challenge for cluster randomized research--cases where consent is genuinely impractical to obtain. I argue that once again these cases require no new analytic insight; instead, we should look to political philosophy for guidance. In other words, the most serious ethical problem that arises in cluster randomized research also arises in political philosophy.
NASA Astrophysics Data System (ADS)
Horiuchi, Hisashi; Ikeda, Kiyomi
The following sections are included: * INTRODUCTION * CLUSTER STRUCTURE OF NUCLEI * Typical Clustering States * Molecule-like Structure of Nuclei and the Threshold-energy Rule * Microscopic Cluster Model * Characteristic Points of the Structure Study by the Microscopic Cluster Model * Several Subjects Related to the Study of the Cluster Structure * Connection with the Neighbouring Fields in Nuclear Physics * MANY-CENTER MODEL AND ROTATIONAL STATES * Brink Model * Relation of the Brink Model Wave Function with the Shell Model Wave Function * Molecular Orbital Method * Projection of Parity and Angular Momentum * DESCRIPTION OF THE INTER-CLUSTER RELATIVE MOTION BY THE GENERATOR COORDINATE METHOD * Griffin-Hill-Wheeler Equation * GCM Kernel of Two-spinless-cluster System * DESCRIPTION OF THE INTER-CLUSTER RELATIVE MOTION BY THE RESONATING GROUP METHOD * Formulation * Equivalence of RGM and GCM * Calculation of RGM Kernels by using GCM Kernels * Calculation of Direct Potential * IMPOSITION OF THE SCATTERING BOUNDARY CONDITION * Brief Survey of the Calculational Methods of the Scattering Matrix in RGM and GCM * Wave Functions in the Outside Region and Matrix Elements in the Interaction Region * R-matrix Type Method * Variational Method * CLUSTER MODEL SPACE - RGM NORM KERNEL AND PAULI-FORBIDDEN STATES * Orthonormal Basis Wave Functions of the System and the Eigenvalue Problem of the Norm Kernel * Explicit Solution of the Eigenvalue Problem of the RGM Norm Kernel * Relation between Cluster Model States with Shell Model States * Almost Forbidden States * ORTHOGONALITY CONDITION MODEL * Inner Oscillation of the Inter-cluster Relative Wave Function * Formulation of the OCM * FEW EXAMPLES OF THE MICROSCOPIC CLUSTER MODEL STUDY * α+16O Model for 20Ne * α+12C Model for 16O * 3α Model for 12C * ACKNOWLEDGEMENTS * APPENDIX I SEPARATION OF THE CENTER-OF-MASS COORDINATE IN THE CASE OF THE HARMONIC OSCILLATOR SHELL MODEL * APPENDIX II ANTISYMMETRIZER, LAPLACE EXPANSION OF
SAIL: Summation-bAsed Incremental Learning for Information-Theoretic Text Clustering.
Cao, Jie; Wu, Zhiang; Wu, Junjie; Xiong, Hui
2013-04-01
Information-theoretic clustering aims to exploit information-theoretic measures as the clustering criteria. A common practice on this topic is the so-called Info-Kmeans, which performs K-means clustering with KL-divergence as the proximity function. While expert efforts on Info-Kmeans have shown promising results, a remaining challenge is to deal with high-dimensional sparse data such as text corpora. Indeed, it is possible that the centroids contain many zero-value features for high-dimensional text vectors, which leads to infinite KL-divergence values and creates a dilemma in assigning objects to centroids during the iteration process of Info-Kmeans. To meet this challenge, in this paper, we propose a Summation-bAsed Incremental Learning (SAIL) algorithm for Info-Kmeans clustering. Specifically, by using an equivalent objective function, SAIL replaces the computation of KL-divergence by the incremental computation of Shannon entropy. This can avoid the zero-feature dilemma caused by the use of KL-divergence. To improve the clustering quality, we further introduce the variable neighborhood search scheme and propose the V-SAIL algorithm, which is then accelerated by a multithreaded scheme in PV-SAIL. Our experimental results on various real-world text collections have shown that, with SAIL as a booster, the clustering performance of Info-Kmeans can be significantly improved. Also, V-SAIL and PV-SAIL indeed help improve the clustering quality at a lower cost of computation.
Estimating the number of clusters via system evolution for cluster analysis of gene expression data.
Wang, Kaijun; Zheng, Jie; Zhang, Junying; Dong, Jiyang
2009-09-01
The estimation of the number of clusters (NC) is one of crucial problems in the cluster analysis of gene expression data. Most approaches available give their answers without the intuitive information about separable degrees between clusters. However, this information is useful for understanding cluster structures. To provide this information, we propose system evolution (SE) method to estimate NC based on partitioning around medoids (PAM) clustering algorithm. SE analyzes cluster structures of a dataset from the viewpoint of a pseudothermodynamics system. The system will go to its stable equilibrium state, at which the optimal NC is found, via its partitioning process and merging process. The experimental results on simulated and real gene expression data demonstrate that the SE works well on the data with well-separated clusters and the one with slightly overlapping clusters. PMID:19527960
Feature Clustering for Accelerating Parallel Coordinate Descent
Scherrer, Chad; Tewari, Ambuj; Halappanavar, Mahantesh; Haglin, David J.
2012-12-06
We demonstrate an approach for accelerating calculation of the regularization path for L1 sparse logistic regression problems. We show the benefit of feature clustering as a preconditioning step for parallel block-greedy coordinate descent algorithms.
Learning regularized LDA by clustering.
Pang, Yanwei; Wang, Shuang; Yuan, Yuan
2014-12-01
As a supervised dimensionality reduction technique, linear discriminant analysis has a serious overfitting problem when the number of training samples per class is small. The main reason is that the between- and within-class scatter matrices computed from the limited number of training samples deviate greatly from the underlying ones. To overcome the problem without increasing the number of training samples, we propose making use of the structure of the given training data to regularize the between- and within-class scatter matrices by between- and within-cluster scatter matrices, respectively, and simultaneously. The within- and between-cluster matrices are computed from unsupervised clustered data. The within-cluster scatter matrix contributes to encoding the possible variations in intraclasses and the between-cluster scatter matrix is useful for separating extra classes. The contributions are inversely proportional to the number of training samples per class. The advantages of the proposed method become more remarkable as the number of training samples per class decreases. Experimental results on the AR and Feret face databases demonstrate the effectiveness of the proposed method.
The Second Parameter Problem(s)
NASA Astrophysics Data System (ADS)
Dotter, Aaron
The Second Parameter (2ndP) Problem recognizes the remarkable role played by horizontal branch (HB) morphology in the development of our understanding of globular clusters, and the Galaxy, over the last 50 years. I will describe the historical development of the 2ndP and discuss recent advances that are finally providing some answers. I will discuss how the controversies surrounding the nature of the 2ndP can be reconciled if we acknowledge that there are actually two distinct problems with entirely different solutions.
Electrodynamic properties of fractal clusters
NASA Astrophysics Data System (ADS)
Maksimenko, V. V.; Zagaynov, V. A.; Agranovski, I. E.
2014-07-01
An influence of interference on a character of light interaction both with individual fractal cluster (FC) consisting of nanoparticles and with agglomerates of such clusters is investigated. Using methods of the multiple scattering theory, effective dielectric permeability of a micron-size FC composed of non-absorbing nanoparticles is calculated. The cluster could be characterized by a set of effective dielectric permeabilities. Their number coincides with the number of particles, where space arrangement in the cluster is correlated. If the fractal dimension is less than some critical value and frequency corresponds to the frequency of the visible spectrum, then the absolute value of effective dielectric permeability becomes very large. This results in strong renormalization (decrease) of the incident radiation wavelength inside the cluster. The renormalized photons are cycled or trapped inside the system of multi-scaled cavities inside the cluster. A lifetime of a photon localized inside an agglomerate of FCs is a macroscopic value allowing to observe the stimulated emission of the localized light. The latter opens up a possibility for creation of lasers without inverse population of energy levels. Moreover, this allows to reconsider problems of optical cloaking of macroscopic objects. One more feature of fractal structures is a possibility of unimpeded propagation of light when any resistance associated with scattering disappears.
Some properties of ion and cluster plasma
Gudzenko, L.I.; Derzhiev, V.I.; Yakovlenko, S.I.
1982-11-01
The aggregate of problems connected with the physics of ion and cluster plasma is qualitatively considered. Such a plasma can exist when a dense gas is ionized by a hard ionizer. The conditions for the formation of an ion plasma and the difference between its characteristics and those of an ordinary electron plasma are discussed; a solvated-ion model and the distribution of the clusters with respect to the number of solvated molecules are considered. The recombination rate of the positively and negatively charged clusters is roughly estimated. The parameters of a ball-lightning plasma are estimated on the basis of the cluster model.
Hu, Xiaohua; Park, E K; Zhang, Xiaodan
2009-09-01
Generating high-quality gene clusters and identifying the underlying biological mechanism of the gene clusters are the important goals of clustering gene expression analysis. To get high-quality cluster results, most of the current approaches rely on choosing the best cluster algorithm, in which the design biases and assumptions meet the underlying distribution of the dataset. There are two issues for this approach: 1) usually, the underlying data distribution of the gene expression datasets is unknown and 2) there are so many clustering algorithms available and it is very challenging to choose the proper one. To provide a textual summary of the gene clusters, the most explored approach is the extractive approach that essentially builds upon techniques borrowed from the information retrieval, in which the objective is to provide terms to be used for query expansion, and not to act as a stand-alone summary for the entire document sets. Another drawback is that the clustering quality and cluster interpretation are treated as two isolated research problems and are studied separately. In this paper, we design and develop a unified system Gene Expression Miner to address these challenging issues in a principled and general manner by integrating cluster ensemble, text clustering, and multidocument summarization and provide an environment for comprehensive gene expression data analysis. We present a novel cluster ensemble approach to generate high-quality gene cluster. In our text summarization module, given a gene cluster, our expectation-maximization based algorithm can automatically identify subtopics and extract most probable terms for each topic. Then, the extracted top k topical terms from each subtopic are combined to form the biological explanation of each gene cluster. Experimental results demonstrate that our system can obtain high-quality clusters and provide informative key terms for the gene clusters.
Self consistency grouping: a stringent clustering method
2012-01-01
Background Numerous types of clustering like single linkage and K-means have been widely studied and applied to a variety of scientific problems. However, the existing methods are not readily applicable for the problems that demand high stringency. Methods Our method, self consistency grouping, i.e. SCG, yields clusters whose members are closer in rank to each other than to any member outside the cluster. We do not define a distance metric; we use the best known distance metric and presume that it measures the correct distance. SCG does not impose any restriction on the size or the number of the clusters that it finds. The boundaries of clusters are determined by the inconsistencies in the ranks. In addition to the direct implementation that finds the complete structure of the (sub)clusters we implemented two faster versions. The fastest version is guaranteed to find only the clusters that are not subclusters of any other clusters and the other version yields the same output as the direct implementation but does so more efficiently. Results Our tests have demonstrated that SCG yields very few false positives. This was accomplished by introducing errors in the distance measurement. Clustering of protein domain representatives by structural similarity showed that SCG could recover homologous groups with high precision. Conclusions SCG has potential for finding biological relationships under stringent conditions. PMID:23320864
On the clustering of multidimensional pictorial data
NASA Technical Reports Server (NTRS)
Bryant, J. D. (Principal Investigator)
1979-01-01
Obvious approaches to reducing the cost (in computer resources) of applying current clustering techniques to the problem of remote sensing are discussed. The use of spatial information in finding fields and in classifying mixture pixels is examined, and the AMOEBA clustering program is described. Internally, a pattern recognition program, from without, AMOEBA appears to be an unsupervised clustering program. It is fast and automatic. No choices (such as arbitrary thresholds to set split/combine sequences) need be made. The problem of finding the number of clusters is solved automatically. At the conclusion of the program, all points in the scene are classified; however, a provision is included for a reject classification of some points which, within the theoretical framework, cannot rationally be assigned to any cluster.
Swarm Intelligence in Text Document Clustering
Cui, Xiaohui; Potok, Thomas E
2008-01-01
Social animals or insects in nature often exhibit a form of emergent collective behavior. The research field that attempts to design algorithms or distributed problem-solving devices inspired by the collective behavior of social insect colonies is called Swarm Intelligence. Compared to the traditional algorithms, the swarm algorithms are usually flexible, robust, decentralized and self-organized. These characters make the swarm algorithms suitable for solving complex problems, such as document collection clustering. The major challenge of today's information society is being overwhelmed with information on any topic they are searching for. Fast and high-quality document clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the overwhelmed information. In this chapter, we introduce three nature inspired swarm intelligence clustering approaches for document clustering analysis. These clustering algorithms use stochastic and heuristic principles discovered from observing bird flocks, fish schools and ant food forage.
Bipartite graph partitioning and data clustering
Zha, Hongyuan; He, Xiaofeng; Ding, Chris; Gu, Ming; Simon, Horst D.
2001-05-07
Many data types arising from data mining applications can be modeled as bipartite graphs, examples include terms and documents in a text corpus, customers and purchasing items in market basket analysis and reviewers and movies in a movie recommender system. In this paper, the authors propose a new data clustering method based on partitioning the underlying biopartite graph. The partition is constructed by minimizing a normalized sum of edge weights between unmatched pairs of vertices of the bipartite graph. They show that an approximate solution to the minimization problem can be obtained by computing a partial singular value decomposition (SVD) of the associated edge weight matrix of the bipartite graph. They point out the connection of their clustering algorithm to correspondence analysis used in multivariate analysis. They also briefly discuss the issue of assigning data objects to multiple clusters. In the experimental results, they apply their clustering algorithm to the problem of document clustering to illustrate its effectiveness and efficiency.
Open-box spectral clustering: applications to medical image analysis.
Schultz, Thomas; Kindlmann, Gordon L
2013-12-01
Spectral clustering is a powerful and versatile technique, whose broad range of applications includes 3D image analysis. However, its practical use often involves a tedious and time-consuming process of tuning parameters and making application-specific choices. In the absence of training data with labeled clusters, help from a human analyst is required to decide the number of clusters, to determine whether hierarchical clustering is needed, and to define the appropriate distance measures, parameters of the underlying graph, and type of graph Laplacian. We propose to simplify this process via an open-box approach, in which an interactive system visualizes the involved mathematical quantities, suggests parameter values, and provides immediate feedback to support the required decisions. Our framework focuses on applications in 3D image analysis, and links the abstract high-dimensional feature space used in spectral clustering to the three-dimensional data space. This provides a better understanding of the technique, and helps the analyst predict how well specific parameter settings will generalize to similar tasks. In addition, our system supports filtering outliers and labeling the final clusters in such a way that user actions can be recorded and transferred to different data in which the same structures are to be found. Our system supports a wide range of inputs, including triangular meshes, regular grids, and point clouds. We use our system to develop segmentation protocols in chest CT and brain MRI that are then successfully applied to other datasets in an automated manner.
... often, it could be a sign of a balance problem. Balance problems can make you feel unsteady or as ... fall-related injuries, such as hip fracture. Some balance problems are due to problems in the inner ...
Analysis of Massive Emigration from Poland: The Model-Based Clustering Approach
NASA Astrophysics Data System (ADS)
Witek, Ewa
The model-based approach assumes that data is generated by a finite mixture of probability distributions such as multivariate normal distributions. In finite mixture models, each component of probability distribution corresponds to a cluster. The problem of determining the number of clusters and choosing an appropriate clustering method becomes the problem of statistical model choice. Hence, the model-based approach provides a key advantage over heuristic clustering algorithms, because it selects both the correct model and the number of clusters.
Large scale cluster computing workshop
Dane Skow; Alan Silverman
2002-12-23
Recent revolutions in computer hardware and software technologies have paved the way for the large-scale deployment of clusters of commodity computers to address problems heretofore the domain of tightly coupled SMP processors. Near term projects within High Energy Physics and other computing communities will deploy clusters of scale 1000s of processors and be used by 100s to 1000s of independent users. This will expand the reach in both dimensions by an order of magnitude from the current successful production facilities. The goals of this workshop were: (1) to determine what tools exist which can scale up to the cluster sizes foreseen for the next generation of HENP experiments (several thousand nodes) and by implication to identify areas where some investment of money or effort is likely to be needed. (2) To compare and record experimences gained with such tools. (3) To produce a practical guide to all stages of planning, installing, building and operating a large computing cluster in HENP. (4) To identify and connect groups with similar interest within HENP and the larger clustering community.
Performance Comparison Of Evolutionary Algorithms For Image Clustering
NASA Astrophysics Data System (ADS)
Civicioglu, P.; Atasever, U. H.; Ozkan, C.; Besdok, E.; Karkinli, A. E.; Kesikoglu, A.
2014-09-01
Evolutionary computation tools are able to process real valued numerical sets in order to extract suboptimal solution of designed problem. Data clustering algorithms have been intensively used for image segmentation in remote sensing applications. Despite of wide usage of evolutionary algorithms on data clustering, their clustering performances have been scarcely studied by using clustering validation indexes. In this paper, the recently proposed evolutionary algorithms (i.e., Artificial Bee Colony Algorithm (ABC), Gravitational Search Algorithm (GSA), Cuckoo Search Algorithm (CS), Adaptive Differential Evolution Algorithm (JADE), Differential Search Algorithm (DSA) and Backtracking Search Optimization Algorithm (BSA)) and some classical image clustering techniques (i.e., k-means, fcm, som networks) have been used to cluster images and their performances have been compared by using four clustering validation indexes. Experimental test results exposed that evolutionary algorithms give more reliable cluster-centers than classical clustering techniques, but their convergence time is quite long.
A vector reconstruction based clustering algorithm particularly for large-scale text collection.
Liu, Ming; Wu, Chong; Chen, Lei
2015-03-01
Along with the fast evolvement of internet technology, internet users have to face the large amount of textual data every day. Apparently, organizing texts into categories can help users dig the useful information from large-scale text collection. Clustering is one of the most promising tools for categorizing texts due to its unsupervised characteristic. Unfortunately, most of traditional clustering algorithms lose their high qualities on large-scale text collection, which mainly attributes to the high-dimensional vector space and semantic similarity among texts. To effectively and efficiently cluster large-scale text collection, this paper puts forward a vector reconstruction based clustering algorithm. Only the features that can represent the cluster are preserved in cluster's representative vector. This algorithm alternately repeats two sub-processes until it converges. One process is partial tuning sub-process, where feature's weight is fine-tuned by iterative process similar to self-organizing-mapping (SOM) algorithm. To accelerate clustering velocity, an intersection based similarity measurement and its corresponding neuron adjustment function are proposed and implemented in this sub-process. The other process is overall tuning sub-process, where the features are reallocated among different clusters. In this sub-process, the features useless to represent the cluster are removed from cluster's representative vector. Experimental results on the three text collections (including two small-scale and one large-scale text collections) demonstrate that our algorithm obtains high-quality performances on both small-scale and large-scale text collections.
PREFACE: Nuclear Cluster Conference; Cluster'07
NASA Astrophysics Data System (ADS)
Freer, Martin
2008-05-01
The Cluster Conference is a long-running conference series dating back to the 1960's, the first being initiated by Wildermuth in Bochum, Germany, in 1969. The most recent meeting was held in Nara, Japan, in 2003, and in 2007 the 9th Cluster Conference was held in Stratford-upon-Avon, UK. As the name suggests the town of Stratford lies upon the River Avon, and shortly before the conference, due to unprecedented rainfall in the area (approximately 10 cm within half a day), lay in the River Avon! Stratford is the birthplace of the `Bard of Avon' William Shakespeare, and this formed an intriguing conference backdrop. The meeting was attended by some 90 delegates and the programme contained 65 70 oral presentations, and was opened by a historical perspective presented by Professor Brink (Oxford) and closed by Professor Horiuchi (RCNP) with an overview of the conference and future perspectives. In between, the conference covered aspects of clustering in exotic nuclei (both neutron and proton-rich), molecular structures in which valence neutrons are exchanged between cluster cores, condensates in nuclei, neutron-clusters, superheavy nuclei, clusters in nuclear astrophysical processes and exotic cluster decays such as 2p and ternary cluster decay. The field of nuclear clustering has become strongly influenced by the physics of radioactive beam facilities (reflected in the programme), and by the excitement that clustering may have an important impact on the structure of nuclei at the neutron drip-line. It was clear that since Nara the field had progressed substantially and that new themes had emerged and others had crystallized. Two particular topics resonated strongly condensates and nuclear molecules. These topics are thus likely to be central in the next cluster conference which will be held in 2011 in the Hungarian city of Debrechen. Martin Freer Participants and Cluster'07
Properties and Formation of Star Clusters
NASA Astrophysics Data System (ADS)
Sharina, M. E.
2016-03-01
Many key problems in astrophysics involve research on the properties of star clusters, for example: stellar evolution and nucleosynthesis, the history of star formation in galaxies, formation dynamics of galaxies and their subsystems, the calibration of the fundamental distance scale in the universe, and the luminosity functions of stars and star clusters. This review is intended to familiarize the reader with modern observational and theoretical data on the formation and evolution of star clusters in our galaxy and others. Unsolved problems in this area are formulated and research on ways to solve them is discussed. In particular, some of the most important current observational and theoretical problems include: (1) a more complete explanation of the physical processes in molecular clouds leading to the formation and evolution of massive star clusters; (2) observation of these objects in different stages of evolution, including protoclusters, at wavelengths where interstellar absorption is minimal; and, (3) comparison of the properties of massive star clusters in different galaxies and of galaxies during the most active star formation phase at different red shifts. The main goal in solving these problems is to explain the variations in the abundance of chemical elements and in the multiple populations of stars in clusters discovered at the end of the twentieth century.
Cluster-localized sparse logistic regression for SNP data.
Binder, Harald; Müller, Tina; Schwender, Holger; Golka, Klaus; Steffens, Michael; Hengstler, Jan G; Ickstadt, Katja; Schumacher, Martin
2012-08-14
The task of analyzing high-dimensional single nucleotide polymorphism (SNP) data in a case-control design using multivariable techniques has only recently been tackled. While many available approaches investigate only main effects in a high-dimensional setting, we propose a more flexible technique, cluster-localized regression (CLR), based on localized logistic regression models, that allows different SNPs to have an effect for different groups of individuals. Separate multivariable regression models are fitted for the different groups of individuals by incorporating weights into componentwise boosting, which provides simultaneous variable selection, hence sparse fits. For model fitting, these groups of individuals are identified using a clustering approach, where each group may be defined via different SNPs. This allows for representing complex interaction patterns, such as compositional epistasis, that might not be detected by a single main effects model. In a simulation study, the CLR approach results in improved prediction performance, compared to the main effects approach, and identification of important SNPs in several scenarios. Improved prediction performance is also obtained for an application example considering urinary bladder cancer. Some of the identified SNPs are predictive for all individuals, while others are only relevant for a specific group. Together with the sets of SNPs that define the groups, potential interaction patterns are uncovered.
Misty Mountain clustering: application to fast unsupervised flow cytometry gating
2010-01-01
Background There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments. Results To circumvent these limitations, we developed a new, unsupervised density contour clustering algorithm, called Misty Mountain, that is based on percolation theory and that efficiently analyzes large data sets. The approach can be envisioned as a progressive top-down removal of clouds covering a data histogram relief map to identify clusters by the appearance of statistically distinct peaks and ridges. This is a parallel clustering method that finds every cluster after analyzing only once the cross sections of the histogram. The overall run time for the composite steps of the algorithm increases linearly by the number of data points. The clustering of 106 data points in 2D data space takes place within about 15 seconds on a standard laptop PC. Comparison of the performance of this algorithm with other state of the art automated flow cytometry gating methods indicate that Misty Mountain provides substantial improvements in both run time and in the accuracy of cluster assignment. Conclusions Misty Mountain is fast, unbiased
Improved Ant Colony Clustering Algorithm and Its Performance Study.
Gao, Wei
2016-01-01
Clustering analysis is used in many disciplines and applications; it is an important tool that descriptively identifies homogeneous groups of objects based on attribute values. The ant colony clustering algorithm is a swarm-intelligent method used for clustering problems that is inspired by the behavior of ant colonies that cluster their corpses and sort their larvae. A new abstraction ant colony clustering algorithm using a data combination mechanism is proposed to improve the computational efficiency and accuracy of the ant colony clustering algorithm. The abstraction ant colony clustering algorithm is used to cluster benchmark problems, and its performance is compared with the ant colony clustering algorithm and other methods used in existing literature. Based on similar computational difficulties and complexities, the results show that the abstraction ant colony clustering algorithm produces results that are not only more accurate but also more efficiently determined than the ant colony clustering algorithm and the other methods. Thus, the abstraction ant colony clustering algorithm can be used for efficient multivariate data clustering. PMID:26839533
Improved Ant Colony Clustering Algorithm and Its Performance Study
Gao, Wei
2016-01-01
Clustering analysis is used in many disciplines and applications; it is an important tool that descriptively identifies homogeneous groups of objects based on attribute values. The ant colony clustering algorithm is a swarm-intelligent method used for clustering problems that is inspired by the behavior of ant colonies that cluster their corpses and sort their larvae. A new abstraction ant colony clustering algorithm using a data combination mechanism is proposed to improve the computational efficiency and accuracy of the ant colony clustering algorithm. The abstraction ant colony clustering algorithm is used to cluster benchmark problems, and its performance is compared with the ant colony clustering algorithm and other methods used in existing literature. Based on similar computational difficulties and complexities, the results show that the abstraction ant colony clustering algorithm produces results that are not only more accurate but also more efficiently determined than the ant colony clustering algorithm and the other methods. Thus, the abstraction ant colony clustering algorithm can be used for efficient multivariate data clustering. PMID:26839533
A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data.
Shirkhorshidi, Ali Seyed; Aghabozorgi, Saeed; Wah, Teh Ying
2015-01-01
Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.
A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data
Shirkhorshidi, Ali Seyed; Aghabozorgi, Saeed; Wah, Teh Ying
2015-01-01
Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones. PMID:26658987
Cluster automorphism groups of cluster algebras with coefficients
NASA Astrophysics Data System (ADS)
Chang, Wen; Zhu, Bin
2016-10-01
We study the cluster automorphism group of a skew-symmetric cluster algebra with geometric coefficients. For this, we introduce the notion of gluing free cluster algebra, and show that under a weak condition the cluster automorphism group of a gluing free cluster algebra is a subgroup of the cluster automorphism group of its principal part cluster algebra (i.e. the corresponding cluster algebra without coefficients). We show that several classes of cluster algebras with coefficients are gluing free, for example, cluster algebras with principal coefficients, cluster algebras with universal geometric coefficients, and cluster algebras from surfaces (except a 4-gon) with coefficients from boundaries. Moreover, except four kinds of surfaces, the cluster automorphism group of a cluster algebra from a surface with coefficients from boundaries is isomorphic to the cluster automorphism group of its principal part cluster algebra; for a cluster algebra with principal coefficients, its cluster automorphism group is isomorphic to the automorphism group of its initial quiver.
NASA Astrophysics Data System (ADS)
Tran, Binh; Xue, Bing; Zhang, Mengjie; Nguyen, Su
2016-07-01
Feature selection is an essential step in classification tasks with a large number of features, such as in gene expression data. Recent research has shown that particle swarm optimisation (PSO) is a promising approach to feature selection. However, it also has potential limitation to get stuck into local optima, especially for gene selection problems with a huge search space. Therefore, we developed a PSO algorithm (PSO-LSRG) with a fast "local search" combined with a gbest resetting mechanism as a way to improve the performance of PSO for feature selection. Furthermore, since many existing PSO-based feature selection approaches on the gene expression data have feature selection bias, i.e. no unseen test data is used, 2 sets of experiments on 10 gene expression datasets were designed: with and without feature selection bias. As compared to standard PSO, PSO with gbest resetting only, and PSO with local search only, PSO-LSRG obtained a substantial dimensionality reduction and a significant improvement on the classification performance in both sets of experiments. PSO-LSRG outperforms the other three algorithms when feature selection bias exists. When there is no feature selection bias, PSO-LSRG selects the smallest number of features in all cases, but the classification performance is slightly worse in a few cases, which may be caused by the overfitting problem. This shows that feature selection bias should be avoided when designing a feature selection algorithm to ensure its generalisation ability on unseen data.
Unsupervised fuzzy clustering using Weighted Incremental Neural Networks.
Muhammed, Hamed Hamid
2004-12-01
A new more efficient variant of a recently developed algorithm for unsupervised fuzzy clustering is introduced. A Weighted Incremental Neural Network (WINN) is introduced and used for this purpose. The new approach is called FC-WINN (Fuzzy Clustering using WINN). The WINN algorithm produces a net of nodes connected by edges, which reflects and preserves the topology of the input data set. Additional weights, which are proportional to the local densities in input space, are associated with the resulting nodes and edges to store useful information about the topological relations in the given input data set. A fuzziness factor, proportional to the connectedness of the net, is introduced in the system. A watershed-like procedure is used to cluster the resulting net. The number of the resulting clusters is determined by this procedure. Only two parameters must be chosen by the user for the FC-WINN algorithm to determine the resolution and the connectedness of the net. Other parameters that must be specified are those which are necessary for the used incremental neural network, which is a modified version of the Growing Neural Gas algorithm (GNG). The FC-WINN algorithm is computationally efficient when compared to other approaches for clustering large high-dimensional data sets. PMID:15714603
Large-Scale Multi-Dimensional Document Clustering on GPU Clusters
Cui, Xiaohui; Mueller, Frank; Zhang, Yongpeng; Potok, Thomas E
2010-01-01
Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteennode GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrates the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.
Web document clustering using hyperlink structures
He, Xiaofeng; Zha, Hongyuan; Ding, Chris H.Q; Simon, Horst D.
2001-05-07
With the exponential growth of information on the World Wide Web there is great demand for developing efficient and effective methods for organizing and retrieving the information available. Document clustering plays an important role in information retrieval and taxonomy management for the World Wide Web and remains an interesting and challenging problem in the field of web computing. In this paper we consider document clustering methods exploring textual information hyperlink structure and co-citation relations. In particular we apply the normalized cut clustering method developed in computer vision to the task of hyperdocument clustering. We also explore some theoretical connections of the normalized-cut method to K-means method. We then experiment with normalized-cut method in the context of clustering query result sets for web search engines.
... version of this page please turn Javascript on. Balance Problems About Balance Problems Have you ever felt dizzy, lightheaded, or ... dizziness problem during the past year. Why Good Balance is Important Having good balance means being able ...
Webster, Clayton; Tempone, Raul; Nobile, Fabio
2007-12-01
This work describes the convergence analysis of a Smolyak-type sparse grid stochastic collocation method for the approximation of statistical quantities related to the solution of partial differential equations with random coefficients and forcing terms (input data of the model). To compute solution statistics, the sparse grid stochastic collocation method uses approximate solutions, produced here by finite elements, corresponding to a deterministic set of points in the random input space. This naturally requires solving uncoupled deterministic problems and, as such, the derived strong error estimates for the fully discrete solution are used to compare the computational efficiency of the proposed method with the Monte Carlo method. Numerical examples illustrate the theoretical results and are used to compare this approach with several others, including the standard Monte Carlo.
HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree
Obulkasim, Askar; van de Wiel, Mark A
2015-01-01
Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster. However, the fixed-height branch cut may not be ideal in situations where one expects a complicated tree structure with nested clusters. Furthermore, due to lack of utilization of related background information in selecting the cutoff, induced clusters are often difficult to interpret. This paper describes a novel procedure that aims to automatically extract meaningful clusters from the HC tree in a semi-supervised way. The procedure is implemented in the R package HCsnip available from Bioconductor. Rather than cutting the HC tree at a fixed-height, HCsnip probes the various way of snipping, possibly at variable heights, to tease out hidden clusters ensconced deep down in the tree. The cluster extraction process utilizes, along with the data set from which the HC tree is derived, commonly available background information. Consequently, the extracted clusters are highly reproducible and robust against various sources of variations that “haunted” high-dimensional genomics data. Since the clustering process is guided by the background information, clusters are easy to interpret. Unlike existing packages, no constraint is placed on the data type on which clustering is desired. Particularly, the package accepts patient follow-up data for guiding the cluster extraction process. To our knowledge, HCsnip is the first package that is able to decomposes the HC tree into clusters with piecewise snipping under the guidance of patient time-to-event information. Our implementation of the semi-supervised HC tree snipping framework is generic, and
[Autism Spectrum Disorder and DSM-5: Spectrum or Cluster?].
Kienle, Xaver; Freiberger, Verena; Greulich, Heide; Blank, Rainer
2015-01-01
Within the new DSM-5, the currently differentiated subgroups of "Autistic Disorder" (299.0), "Asperger's Disorder" (299.80) and "Pervasive Developmental Disorder" (299.80) are replaced by the more general "Autism Spectrum Disorder". With regard to a patient-oriented and expedient advising therapy planning, however, the issue of an empirically reproducible and clinically feasible differentiation into subgroups must still be raised. Based on two Autism-rating-scales (ASDS and FSK), an exploratory two-step cluster analysis was conducted with N=103 children (age: 5-18) seen in our social-pediatric health care centre to examine potentially autistic symptoms. In the two-cluster solution of both rating scales, mainly the problems in social communication grouped the children into a cluster "with communication problems" (51 % and 41 %), and a cluster "without communication problems". Within the three-cluster solution of the ASDS, sensory hypersensitivity, cleaving to routines and social-communicative problems generated an "autistic" subgroup (22%). The children of the second cluster ("communication problems", 35%) were only described by social-communicative problems, and the third group did not show any problems (38%). In the three-cluster solution of the FSK, the "autistic cluster" of the two-cluster solution differentiated in a subgroup with mainly social-communicative problems (cluster 1) and a second subgroup described by restrictive, repetitive behavior. The different cluster solutions will be discussed with a view to the new DSM-5 diagnostic criteria, for following studies a further specification of some of the ASDS and FSK items could be helpful.
[Autism Spectrum Disorder and DSM-5: Spectrum or Cluster?].
Kienle, Xaver; Freiberger, Verena; Greulich, Heide; Blank, Rainer
2015-01-01
Within the new DSM-5, the currently differentiated subgroups of "Autistic Disorder" (299.0), "Asperger's Disorder" (299.80) and "Pervasive Developmental Disorder" (299.80) are replaced by the more general "Autism Spectrum Disorder". With regard to a patient-oriented and expedient advising therapy planning, however, the issue of an empirically reproducible and clinically feasible differentiation into subgroups must still be raised. Based on two Autism-rating-scales (ASDS and FSK), an exploratory two-step cluster analysis was conducted with N=103 children (age: 5-18) seen in our social-pediatric health care centre to examine potentially autistic symptoms. In the two-cluster solution of both rating scales, mainly the problems in social communication grouped the children into a cluster "with communication problems" (51 % and 41 %), and a cluster "without communication problems". Within the three-cluster solution of the ASDS, sensory hypersensitivity, cleaving to routines and social-communicative problems generated an "autistic" subgroup (22%). The children of the second cluster ("communication problems", 35%) were only described by social-communicative problems, and the third group did not show any problems (38%). In the three-cluster solution of the FSK, the "autistic cluster" of the two-cluster solution differentiated in a subgroup with mainly social-communicative problems (cluster 1) and a second subgroup described by restrictive, repetitive behavior. The different cluster solutions will be discussed with a view to the new DSM-5 diagnostic criteria, for following studies a further specification of some of the ASDS and FSK items could be helpful. PMID:26289149
Thermodynamics of confined gallium clusters
NASA Astrophysics Data System (ADS)
Chandrachud, Prachi
2015-11-01
We report the results of ab initio molecular dynamics simulations of Ga13 and Ga17 clusters confined inside carbon nanotubes with different diameters. The cluster-tube interaction is simulated by the Lennard-Jones (LJ) potential. We discuss the geometries, the nature of the bonding and the thermodynamics under confinement. The geometries as well as the isomer spectra of both the clusters are significantly affected. The degree of confinement decides the dimensionality of the clusters. We observe that a number of low-energy isomers appear under moderate confinement while some isomers seen in the free space disappear. Our finite-temperature simulations bring out interesting aspects, namely that the heat capacity curve is flat, even though the ground state is symmetric. Such a flat nature indicates that the phase change is continuous. This effect is due to the restricted phase space available to the system. These observations are supported by the mean square displacement of individual atoms, which are significantly smaller than in free space. The nature of the bonding is found to be approximately jellium-like. Finally we note the relevance of the work to the problem of single file diffusion for the case of the highest confinement.
Matlab Cluster Ensemble Toolbox v. 1.0
2009-04-27
This is a Matlab toolbox for investigating the application of cluster ensembles to data classification, with the objective of improving the accuracy and/or speed of clustering. The toolbox divides the cluster ensemble problem into four areas, providing functionality for each. These include, (1) synthetic data generation, (2) clustering to generate individual data partitions and similarity matrices, (3) consensus function generation and final clustering to generate ensemble data partitioning, and (4) implementation of accuracy metrics. With regard to data generation, Gaussian data of arbitrary dimension can be generated. The kcenters algorithm can then be used to generate individual data partitions by either, (a) subsampling the data and clustering each subsample, or by (b) randomly initializing the algorithm and generating a clustering for each initialization. In either case an overall similarity matrix can be computed using a consensus function operating on the individual similarity matrices. A final clustering can be performed and performance metrics are provided for evaluation purposes.
Star clusters as simple stellar populations.
Bruzual A, Gustavo
2010-02-28
In this paper, I review to what extent we can understand the photometric properties of star clusters, and of low-mass, unresolved galaxies, in terms of population-synthesis models designed to describe 'simple stellar populations' (SSPs), i.e. groups of stars born at the same time, in the same volume of space and from a gas cloud of homogeneous chemical composition. The photometric properties predicted by these models do not readily match the observations of most star clusters, unless we properly take into account the expected variation in the number of stars occupying sparsely populated evolutionary stages, owing to stochastic fluctuations in the stellar initial mass function. In this case, population-synthesis models reproduce remarkably well the full ranges of observed integrated colours and absolute magnitudes of star clusters of various ages and metallicities. The disagreement between the model predictions and observations of cluster colours and magnitudes may indicate problems with or deficiencies in the modelling, and does not necessarily tell us that star clusters do not behave like SSPs. Matching the photometric properties of star clusters using SSP models is a necessary (but not sufficient) condition for clusters to be considered SSPs. Composite models, characterized by complex star-formation histories, also match the observed cluster colours.
Collaborative Clustering for Sensor Networks
NASA Technical Reports Server (NTRS)
Wagstaff. Loro :/; Green Jillian; Lane, Terran
2011-01-01
Traditionally, nodes in a sensor network simply collect data and then pass it on to a centralized node that archives, distributes, and possibly analyzes the data. However, analysis at the individual nodes could enable faster detection of anomalies or other interesting events, as well as faster responses such as sending out alerts or increasing the data collection rate. There is an additional opportunity for increased performance if individual nodes can communicate directly with their neighbors. Previously, a method was developed by which machine learning classification algorithms could collaborate to achieve high performance autonomously (without requiring human intervention). This method worked for supervised learning algorithms, in which labeled data is used to train models. The learners collaborated by exchanging labels describing the data. The new advance enables clustering algorithms, which do not use labeled data, to also collaborate. This is achieved by defining a new language for collaboration that uses pair-wise constraints to encode useful information for other learners. These constraints specify that two items must, or cannot, be placed into the same cluster. Previous work has shown that clustering with these constraints (in isolation) already improves performance. In the problem formulation, each learner resides at a different node in the sensor network and makes observations (collects data) independently of the other learners. Each learner clusters its data and then selects a pair of items about which it is uncertain and uses them to query its neighbors. The resulting feedback (a must and cannot constraint from each neighbor) is combined by the learner into a consensus constraint, and it then reclusters its data while incorporating the new constraint. A strategy was also proposed for cleaning the resulting constraint sets, which may contain conflicting constraints; this improves performance significantly. This approach has been applied to collaborative
ERIC Educational Resources Information Center
Brokes, Joy Cunningham
2010-01-01
New Jersey's urban students traditionally don't do well on the high stakes NJ High School Proficiency Assessment. Most current remedial mathematics curricula provide students with a plethora of problems like those traditionally found on the state test. This approach is not working. Finding better ways to teach our urban students may help close…
Improving performance through concept formation and conceptual clustering
NASA Technical Reports Server (NTRS)
Fisher, Douglas H.
1992-01-01
Research from June 1989 through October 1992 focussed on concept formation, clustering, and supervised learning for purposes of improving the efficiency of problem-solving, planning, and diagnosis. These projects resulted in two dissertations on clustering, explanation-based learning, and means-ends planning, and publications in conferences and workshops, several book chapters, and journals; a complete Bibliography of NASA Ames supported publications is included. The following topics are studied: clustering of explanations and problem-solving experiences; clustering and means-end planning; and diagnosis of space shuttle and space station operating modes.
Vesperini, Enrico
2010-02-28
Dynamical evolution plays a key role in shaping the current properties of star clusters and star cluster systems. A detailed understanding of the effects of evolutionary processes is essential to be able to disentangle the properties that result from dynamical evolution from those imprinted at the time of cluster formation. In this review, I focus my attention on globular clusters, and review the main physical ingredients driving their early and long-term evolution, describe the possible evolutionary routes and show how cluster structure and stellar content are affected by dynamical evolution.
NASA Astrophysics Data System (ADS)
Feng, Jian-xin; Tang, Jia-fu; Wang, Guang-xing
2007-04-01
On the basis of the analysis of clustering algorithm that had been proposed for MANET, a novel clustering strategy was proposed in this paper. With the trust defined by statistical hypothesis in probability theory and the cluster head selected by node trust and node mobility, this strategy can realize the function of the malicious nodes detection which was neglected by other clustering algorithms and overcome the deficiency of being incapable of implementing the relative mobility metric of corresponding nodes in the MOBIC algorithm caused by the fact that the receiving power of two consecutive HELLO packet cannot be measured. It's an effective solution to cluster MANET securely.
NASA Astrophysics Data System (ADS)
Lee, J. H.; Yoon, H.; Kitanidis, P. K.; Werth, C. J.; Valocchi, A. J.
2015-12-01
Characterizing subsurface properties, particularly hydraulic conductivity, is crucial for reliable and cost-effective groundwater supply management, contaminant remediation, and emerging deep subsurface activities such as geologic carbon storage and unconventional resources recovery. With recent advances in sensor technology, a large volume of hydro-geophysical and chemical data can be obtained to achieve high-resolution images of subsurface properties, which can be used for accurate subsurface flow and reactive transport predictions. However, subsurface characterization with a plethora of information requires high, often prohibitive, computational costs associated with "big data" processing and large-scale numerical simulations. As a result, traditional inversion techniques are not well-suited for problems that require coupled multi-physics simulation models with massive data. In this work, we apply a scalable inversion method called Principal Component Geostatistical Approach (PCGA) for characterizing heterogeneous hydraulic conductivity (K) distribution in a 3-D sand box. The PCGA is a Jacobian-free geostatistical inversion approach that uses the leading principal components of the prior information to reduce computational costs, sometimes dramatically, and can be easily linked with any simulation software. Sequential images of transient tracer concentrations in the sand box were obtained using magnetic resonance imaging (MRI) technique, resulting in 6 million tracer-concentration data [Yoon et. al., 2008]. Since each individual tracer observation has little information on the K distribution, the dimension of the data was reduced using temporal moments and discrete cosine transform (DCT). Consequently, 100,000 unknown K values consistent with the scale of MRI data (at a scale of 0.25^3 cm^3) were estimated by matching temporal moments and DCT coefficients of the original tracer data. Estimated K fields are close to the true K field, and even small
Image segmentation using fuzzy LVQ clustering networks
NASA Technical Reports Server (NTRS)
Tsao, Eric Chen-Kuo; Bezdek, James C.; Pal, Nikhil R.
1992-01-01
In this note we formulate image segmentation as a clustering problem. Feature vectors extracted from a raw image are clustered into subregions, thereby segmenting the image. A fuzzy generalization of a Kohonen learning vector quantization (LVQ) which integrates the Fuzzy c-Means (FCM) model with the learning rate and updating strategies of the LVQ is used for this task. This network, which segments images in an unsupervised manner, is thus related to the FCM optimization problem. Numerical examples on photographic and magnetic resonance images are given to illustrate this approach to image segmentation.
Hierarchical modeling of cluster size in wildlife surveys
Royle, J. Andrew
2008-01-01
Clusters or groups of individuals are the fundamental unit of observation in many wildlife sampling problems, including aerial surveys of waterfowl, marine mammals, and ungulates. Explicit accounting of cluster size in models for estimating abundance is necessary because detection of individuals within clusters is not independent and detectability of clusters is likely to increase with cluster size. This induces a cluster size bias in which the average cluster size in the sample is larger than in the population at large. Thus, failure to account for the relationship between delectability and cluster size will tend to yield a positive bias in estimates of abundance or density. I describe a hierarchical modeling framework for accounting for cluster-size bias in animal sampling. The hierarchical model consists of models for the observation process conditional on the cluster size distribution and the cluster size distribution conditional on the total number of clusters. Optionally, a spatial model can be specified that describes variation in the total number of clusters per sample unit. Parameter estimation, model selection, and criticism may be carried out using conventional likelihood-based methods. An extension of the model is described for the situation where measurable covariates at the level of the sample unit are available. Several candidate models within the proposed class are evaluated for aerial survey data on mallard ducks (Anas platyrhynchos).
Unconventional methods for clustering
NASA Astrophysics Data System (ADS)
Kotyrba, Martin
2016-06-01
Cluster analysis or clustering is a task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is the main task of exploratory data mining and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. The topic of this paper is one of the modern methods of clustering namely SOM (Self Organising Map). The paper describes the theory needed to understand the principle of clustering and descriptions of algorithm used with clustering in our experiments.
Alleviating Comprehension Problems in Movies. Working Paper.
ERIC Educational Resources Information Center
Tatsuki, Donna
This paper describes the various barriers to comprehension that learners may encounter when viewing feature films in a second language. Two clusters of interfacing factors that may contribute to comprehension hot spots emerged from a quantitative analysis of problems noted in student logbooks. One cluster had a strong acoustic basis, whereas the…
Full text clustering and relationship network analysis of biomedical publications.
Guan, Renchu; Yang, Chen; Marchese, Maurizio; Liang, Yanchun; Shi, Xiaohu
2014-01-01
Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.
Full text clustering and relationship network analysis of biomedical publications.
Guan, Renchu; Yang, Chen; Marchese, Maurizio; Liang, Yanchun; Shi, Xiaohu
2014-01-01
Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers. PMID:25250864
Robust Multi-Network Clustering via Joint Cross-Domain Cluster Alignment
Liu, Rui; Cheng, Wei; Tong, Hanghang; Wang, Wei; Zhang, Xiang
2016-01-01
Network clustering is an important problem that has recently drawn a lot of attentions. Most existing work focuses on clustering nodes within a single network. In many applications, however, there exist multiple related networks, in which each network may be constructed from a different domain and instances in one domain may be related to instances in other domains. In this paper, we propose a robust algorithm, MCA, for multi-network clustering that takes into account cross-domain relationships between instances. MCA has several advantages over the existing single network clustering methods. First, it is able to detect associations between clusters from different domains, which, however, is not addressed by any existing methods. Second, it achieves more consistent clustering results on multiple networks by leveraging the duality between clustering individual networks and inferring cross-network cluster alignment. Finally, it provides a multi-network clustering solution that is more robust to noise and errors. We perform extensive experiments on a variety of real and synthetic networks to demonstrate the effectiveness and efficiency of MCA. PMID:27239167
NASA Astrophysics Data System (ADS)
Khare, Y. P.; Martinez, C. J.; Munoz-Carpena, R.
2015-12-01
Improved knowledge about fundamental physical processes, advances in computing power, and a focus on integrated modeling has resulted in complex environmental and water resources models. However, the high-dimensionality of these models adds to overall uncertainty and poses issues when evaluating them for sensitivity, parameter identification, and optimization through rigorous computer experiments. The parameter screening method of elementary effects (EE) offers a perfect blend of useful properties inherited from inexpensive one-at-a time methods and expensive global techniques. Since its development EE has undergone improvements largely on the sampling side with over seven sampling strategies developed during the last decade. These strategies can broadly be classified into trajectory-based and polytope-based schemes. Trajectory-based strategies are more widely used, conceptually simple, and generally use the principle of spreading the sample points in the input hyper-space as widely as possible through oversampling. Due to this their implementation have been found to be impractically time consuming for high-dimensional cases (when # input factors > 50, say). Here, we enhanced the Sampling for Uniformity (SU) (Khare et al., 2015), a trajectory-based EE sampling scheme founded on the dual principle of spread and uniformity. This new scheme - enhanced SU (eSU) is the same as SU except the manner in which intermediate trajectory points are formed. It was tested for sample uniformity, spread, sampling time, and screening efficiency. Experiments were repeated with combinations of the number of trajectories and oversampling size. Preliminary results indicate that eSU is superior to SU by some margin with respect to all four criteria. Interestingly, in the case of eSU oversampling size had no impact on any of the evaluation criteria except linear increament in sampling time. Pending further investigation, this has opened a new avenue to substantially bring down the
... daily activities, get around, and exercise. Having a problem with walking can make daily life more difficult. ... walk is called your gait. A variety of problems can cause an abnormal gait and lead to ...
... re not getting enough air. Sometimes mild breathing problems are from a stuffy nose or hard exercise. ... emphysema or pneumonia cause breathing difficulties. So can problems with your trachea or bronchi, which are part ...
... ankles and toes. Other types of arthritis include gout or pseudogout. Sometimes, there is a mechanical problem ... for more information on osteoarthritis, rheumatoid arthritis and gout. How Common are Joint Problems? Osteoarthritis, which affects ...
Clusters of polyhedra in spherical confinement.
Teich, Erin G; van Anders, Greg; Klotsa, Daphne; Dshemuchadse, Julia; Glotzer, Sharon C
2016-02-01
Dense particle packing in a confining volume remains a rich, largely unexplored problem, despite applications in blood clotting, plasmonics, industrial packaging and transport, colloidal molecule design, and information storage. Here, we report densest found clusters of the Platonic solids in spherical confinement, for up to [Formula: see text] constituent polyhedral particles. We examine the interplay between anisotropic particle shape and isotropic 3D confinement. Densest clusters exhibit a wide variety of symmetry point groups and form in up to three layers at higher N. For many N values, icosahedra and dodecahedra form clusters that resemble sphere clusters. These common structures are layers of optimal spherical codes in most cases, a surprising fact given the significant faceting of the icosahedron and dodecahedron. We also investigate cluster density as a function of N for each particle shape. We find that, in contrast to what happens in bulk, polyhedra often pack less densely than spheres. We also find especially dense clusters at so-called magic numbers of constituent particles. Our results showcase the structural diversity and experimental utility of families of solutions to the packing in confinement problem. PMID:26811458
Clusters of polyhedra in spherical confinement
Teich, Erin G.; van Anders, Greg; Klotsa, Daphne; Dshemuchadse, Julia; Glotzer, Sharon C.
2016-01-01
Dense particle packing in a confining volume remains a rich, largely unexplored problem, despite applications in blood clotting, plasmonics, industrial packaging and transport, colloidal molecule design, and information storage. Here, we report densest found clusters of the Platonic solids in spherical confinement, for up to N=60 constituent polyhedral particles. We examine the interplay between anisotropic particle shape and isotropic 3D confinement. Densest clusters exhibit a wide variety of symmetry point groups and form in up to three layers at higher N. For many N values, icosahedra and dodecahedra form clusters that resemble sphere clusters. These common structures are layers of optimal spherical codes in most cases, a surprising fact given the significant faceting of the icosahedron and dodecahedron. We also investigate cluster density as a function of N for each particle shape. We find that, in contrast to what happens in bulk, polyhedra often pack less densely than spheres. We also find especially dense clusters at so-called magic numbers of constituent particles. Our results showcase the structural diversity and experimental utility of families of solutions to the packing in confinement problem. PMID:26811458
ERIC Educational Resources Information Center
Hale, Norman; Lindelow, John
Chapter 12 in a volume on school leadership, this chapter cites the work of several authorities concerning problem-solving or decision-making techniques based on the belief that group problem-solving effort is preferable to individual effort. The first technique, force-field analysis, is described as a means of dissecting complex problems into…
Formation and Assembly of Massive Star Clusters
NASA Astrophysics Data System (ADS)
McMillan, Stephen
The formation of stars and star clusters is a major unresolved problem in astrophysics. It is central to modeling stellar populations and understanding galaxy luminosity distributions in cosmological models. Young massive clusters are major components of starburst galaxies, while globular clusters are cornerstones of the cosmic distance scale and represent vital laboratories for studies of stellar dynamics and stellar evolution. Yet how these clusters form and how rapidly and efficiently they expel their natal gas remain unclear, as do the consequences of this gas expulsion for cluster structure and survival. Also unclear is how the properties of low-mass clusters, which form from small-scale instabilities in galactic disks and inform much of our understanding of cluster formation and star-formation efficiency, differ from those of more massive clusters, which probably formed in starburst events driven by fast accretion at high redshift, or colliding gas flows in merging galaxies. Modeling cluster formation requires simulating many simultaneous physical processes, placing stringent demands on both software and hardware. Simulations of galaxies evolving in cosmological contexts usually lack the numerical resolution to simulate star formation in detail. They do not include detailed treatments of important physical effects such as magnetic fields, radiation pressure, ionization, and supernova feedback. Simulations of smaller clusters include these effects, but fall far short of the mass of even single young globular clusters. With major advances in computing power and software, we can now directly address this problem. We propose to model the formation of massive star clusters by integrating the FLASH adaptive mesh refinement magnetohydrodynamics (MHD) code into the Astrophysical Multi-purpose Software Environment (AMUSE) framework, to work with existing stellar-dynamical and stellar evolution modules in AMUSE. All software will be freely distributed on-line, allowing
Growth of Pt Clusters from Mixture Film of Pt-C and Dynamics of Pt Clusters
NASA Astrophysics Data System (ADS)
Shintaku, Masayuki; Kumamoto, Akihito; Suzuki, Hitoshi; Kaito, Chihiro
2007-06-01
A complete mixture film of carbon and platinum produced by coevaporation in a vacuum was directly heated in a transmission electron microscope. It was found that the diffusion and crystal growth of Pt clusters in the mixture film take place at approximately 500 °C. Pt clusters with a size of 2-5 nm were connected with each other in a parallel orientation or twin-crystal configuration in the mixture film. The growth of onion-like carbon with a hole at the center also occurred. The grown Pt clusters with twin-crystal structures appeared on and in the carbon film. The diffusion of Pt atoms in carbon was discussed as the problem of elusion in fuel cells. Direct observation of the movement of Pt clusters on and in the carbon film was carried out. The movement difference of Pt clusters in and on carbon film has been directly presented.
Convalescing Cluster Configuration Using a Superlative Framework
Sabitha, R.; Karthik, S.
2015-01-01
Competent data mining methods are vital to discover knowledge from databases which are built as a result of enormous growth of data. Various techniques of data mining are applied to obtain knowledge from these databases. Data clustering is one such descriptive data mining technique which guides in partitioning data objects into disjoint segments. K-means algorithm is a versatile algorithm among the various approaches used in data clustering. The algorithm and its diverse adaptation methods suffer certain problems in their performance. To overcome these issues a superlative algorithm has been proposed in this paper to perform data clustering. The specific feature of the proposed algorithm is discretizing the dataset, thereby improving the accuracy of clustering, and also adopting the binary search initialization method to generate cluster centroids. The generated centroids are fed as input to K-means approach which iteratively segments the data objects into respective clusters. The clustered results are measured for accuracy and validity. Experiments conducted by testing the approach on datasets from the UC Irvine Machine Learning Repository evidently show that the accuracy and validity measure is higher than the other two approaches, namely, simple K-means and Binary Search method. Thus, the proposed approach proves that discretization process will improve the efficacy of descriptive data mining tasks. PMID:26543895
Some Basic Elements in Clustering and Classification
NASA Astrophysics Data System (ADS)
Grégoire, G.
2016-05-01
This chapter deals with basic tools useful in clustering and classification and present some commonly used approaches for these two problems. Since several chapters in these proceedings are devoted to approaches to deal with classification, we give more attention in this chapter to clustering issues. We are first concerned with notions of distances or dissimilarities between objects we are to group in clusters. Then based on these inter-objects distances we define distances between sets of objects, such as single linkage, complete linkage or Ward distance. Three clustering algorithms are presented with some details and compared: Kmeans, Ascendant Hierarchical and DBSCAN algorithms. The comparison between partitions and the issue of choosing the correct number of clusters are investigated and the proposed procedures are tested on two data sets. We emphasize the fact that the results provided by the numerous indices available in the literature for selecting the number of clusters is largely depending upon the shape and the dispersion we are assuming for these clusters. Finally the last section is devoted to classification. Some basic notions such as training sets, test sets and cross-validation are discussed. Two particular approaches are detailed, the K-nearest neighbors method and the logistic regression, and comparisons with LDA (Linear Discriminant Analysis) and QDA (Quadratic Discriminant Analysis) are analyzed.
Topic modeling for cluster analysis of large biological and medical datasets
2014-01-01
Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than
NASA Astrophysics Data System (ADS)
Vikhlinin, A. A.; Kravtsov, A. V.; Markevich, M. L.; Sunyaev, R. A.; Churazov, E. M.
2014-04-01
Galaxy clusters are formed via nonlinear growth of primordial density fluctuations and are the most massive gravitationally bound objects in the present Universe. Their number density at different epochs and their properties depend strongly on the properties of dark matter and dark energy, making clusters a powerful tool for observational cosmology. Observations of the hot gas filling the gravitational potential well of a cluster allows studying gasdynamic and plasma effects and the effect of supermassive black holes on the heating and cooling of gas on cluster scales. The work of Yakov Borisovich Zeldovich has had a profound impact on virtually all cosmological and astrophysical studies of galaxy clusters, introducing concepts such as the Harrison-Zeldovich spectrum, the Zeldovich approximation, baryon acoustic peaks, and the Sunyaev-Zeldovich effect. Here, we review the most basic properties of clusters and their role in modern astrophysics and cosmology.
Cluster identification based on correlations
NASA Astrophysics Data System (ADS)
Schulman, L. S.
2012-04-01
The problem addressed is the identification of cooperating agents based on correlations created as a result of the joint action of these and other agents. A systematic method for using correlations beyond second moments is developed. The technique is applied to a didactic example, the identification of alphabet letters based on correlations among the pixels used in an image of the letter. As in this example, agents can belong to more than one cluster. Moreover, the identification scheme does not require that the patterns be known ahead of time.
Zhang, Zhaoyang; Fang, Hua; Wang, Honggang
2016-06-01
Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services.
Zhang, Zhaoyang; Fang, Hua; Wang, Honggang
2016-06-01
Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering are more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services. PMID:27126063
Zhang, Zhaoyang; Wang, Honggang
2016-01-01
Web-delivered trials are an important component in eHealth services. These trials, mostly behavior-based, generate big heterogeneous data that are longitudinal, high dimensional with missing values. Unsupervised learning methods have been widely applied in this area, however, validating the optimal number of clusters has been challenging. Built upon our multiple imputation (MI) based fuzzy clustering, MIfuzzy, we proposed a new multiple imputation based validation (MIV) framework and corresponding MIV algorithms for clustering big longitudinal eHealth data with missing values, more generally for fuzzy-logic based clustering methods. Specifically, we detect the optimal number of clusters by auto-searching and -synthesizing a suite of MI-based validation methods and indices, including conventional (bootstrap or cross-validation based) and emerging (modularity-based) validation indices for general clustering methods as well as the specific one (Xie and Beni) for fuzzy clustering. The MIV performance was demonstrated on a big longitudinal dataset from a real web-delivered trial and using simulation. The results indicate MI-based Xie and Beni index for fuzzy-clustering is more appropriate for detecting the optimal number of clusters for such complex data. The MIV concept and algorithms could be easily adapted to different types of clustering that could process big incomplete longitudinal trial data in eHealth services. PMID:27126063
Open cluster evolutions in binary system: How they dissolved
NASA Astrophysics Data System (ADS)
Priyatikanto, R.; Arifyanto, M. I.; Wulandari, H. R. T.
2014-03-01
Binarity among stellar clusters in galaxy is such a reality which has been realized for a long time, but still hides several questions and problems to be solved. Some of binary star clusters are formed by close encounter, but the others are formed together from similar womb. Some of them undergo separation process, while the others are in the middle of merger toward common future. The products of merger binary star cluster have typical characteristics which differ from solo clusters, especially in their spatial distribution and their stellar members kinematics. On the other hand, these merger products still have to face dissolving processes triggered by both internal and external factors. In this study, we performed N-body simulations of merger binary clusters with different initial conditions. After merging, these clusters dissolve with greater mass-loss rate because of their angular momentum. These rotating clusters also experience more deceleration caused by external tidal field.
NASA Technical Reports Server (NTRS)
Chinellato, J. A.; Dobrigkeit, C.; Bellandifilho, J.; Lattes, C. M. G.; Menon, M. J.; Navia, C. E.; Pamilaju, A.; Sawayanagi, K.; Shibuya, E. H.; Turtelli, A., Jr.
1985-01-01
Experimental results of mini-clusters observed in Chacaltaya emulsion chamber no.19 are summarized. The study was made on 54 single core shower upper and 91 shower clusters of E(gamma) 10 TeV from 30 families which are visible energy greater than 80 TeV and penetrate through both upper and lower detectors of the two-story chamber. The association of hadrons in mini-cluster is made clear from their penetrative nature and microscopic observation of shower continuation in lower chamber. Small P sub t (gamma) of hadrons in mini-clusters remained in puzzle.
Reactions of intermetallic clusters
NASA Astrophysics Data System (ADS)
Farley, R. W.; Castleman, A. W., Jr.
1990-02-01
Reaction of bismuth-alkali clusters with closed-shell HX acids provides insight into the structures, formation, and stabilities of these intermetallic species. HC1 and HI are observed to quantitatively strip BixNay and BixKy, respectively, of their alkali component, leaving bare bismuth clusters as the only bismuth-containing species detected. Product bismuth clusters exhibit the same distribution observed when pure bismuth is evaporated in the source. Though evaporated simultaneously from the same crucible, this suggests alkali atoms condense onto existing bismuth clusters and have negligible effect on their formation and consequent distribution. The indistinguishibility of reacted and pure bismuth cluster distributions further argues against the simple replacement of alkali atoms with hydrogen in these reactions. This is considered further evidence that the alkali atoms are external to the stable bismuth Zintl anionic structures. Reactivities of BixNay clusters with HC1 are estimated to lie between 3×10-13 for Bi4Na, to greater than 4×10-11 for clusters possessing large numbers of alkali atoms. Bare bismuth clusters are observed in separate experiments to react significantly more slowly with rates of 1-9×10-14 and exhibit little variation of reactivity with size. The bismuth clusters may thus be considered a relatively inert substrate upon which the alkali overlayer reacts.
The youngest globular clusters
NASA Astrophysics Data System (ADS)
Beck, Sara
2015-11-01
It is likely that all stars are born in clusters, but most clusters are not bound and disperse. None of the many protoclusters in our Galaxy are likely to develop into long-lived bound clusters. The super star clusters (SSCs) seen in starburst galaxies are more massive and compact and have better chances of survival. The birth and early development of SSCs takes place deep in molecular clouds, and during this crucial stage the embedded clusters are invisible to optical or UV observations but are studied via the radio-infrared supernebulae (RISN) they excite. We review observations of embedded clusters and identify RISN within 10 Mpc whose exciting clusters have ≈ 106 M⊙ or more in volumes of a few pc3 and which are likely to not only survive as bound clusters, but to evolve into objects as massive and compact as Galactic globulars. These clusters are distinguished by very high star formation efficiency η, at least a factor of 10 higher than the few percent seen in the Galaxy, probably due to the violent disturbances their host galaxies have undergone. We review recent observations of the kinematics of the ionized gas in RISN showing outflows through low-density channels in the ambient molecular cloud; this may protect the cloud from feedback by the embedded H II region.
Automatic Clustering Using Multi-objective Particle Swarm and Simulated Annealing
Abubaker, Ahmad; Baharum, Adam; Alrefaei, Mahmoud
2015-01-01
This paper puts forward a new automatic clustering algorithm based on Multi-Objective Particle Swarm Optimization and Simulated Annealing, “MOPSOSA”. The proposed algorithm is capable of automatic clustering which is appropriate for partitioning datasets to a suitable number of clusters. MOPSOSA combines the features of the multi-objective based particle swarm optimization (PSO) and the Multi-Objective Simulated Annealing (MOSA). Three cluster validity indices were optimized simultaneously to establish the suitable number of clusters and the appropriate clustering for a dataset. The first cluster validity index is centred on Euclidean distance, the second on the point symmetry distance, and the last cluster validity index is based on short distance. A number of algorithms have been compared with the MOPSOSA algorithm in resolving clustering problems by determining the actual number of clusters and optimal clustering. Computational experiments were carried out to study fourteen artificial and five real life datasets. PMID:26132309
Automatic Clustering Using Multi-objective Particle Swarm and Simulated Annealing.
Abubaker, Ahmad; Baharum, Adam; Alrefaei, Mahmoud
2015-01-01
This paper puts forward a new automatic clustering algorithm based on Multi-Objective Particle Swarm Optimization and Simulated Annealing, "MOPSOSA". The proposed algorithm is capable of automatic clustering which is appropriate for partitioning datasets to a suitable number of clusters. MOPSOSA combines the features of the multi-objective based particle swarm optimization (PSO) and the Multi-Objective Simulated Annealing (MOSA). Three cluster validity indices were optimized simultaneously to establish the suitable number of clusters and the appropriate clustering for a dataset. The first cluster validity index is centred on Euclidean distance, the second on the point symmetry distance, and the last cluster validity index is based on short distance. A number of algorithms have been compared with the MOPSOSA algorithm in resolving clustering problems by determining the actual number of clusters and optimal clustering. Computational experiments were carried out to study fourteen artificial and five real life datasets.
Clustering versus non-clustering phase synchronizations
Liu, Shuai; Zhan, Meng
2014-03-15
Clustering phase synchronization (CPS) is a common scenario to the global phase synchronization of coupled dynamical systems. In this work, a novel scenario, the non-clustering phase synchronization (NPS), is reported. It is found that coupled systems do not transit to the global synchronization until a certain sufficiently large coupling is attained, and there is no clustering prior to the global synchronization. To reveal the relationship between CPS and NPS, we further analyze the noise effect on coupled phase oscillators and find that the coupled oscillator system can change from CPS to NPS with the increase of noise intensity or system disorder. These findings are expected to shed light on the mechanism of various intriguing self-organized behaviors in coupled systems.
Comparative analysis of fuzzy ART and ART-2A network clustering performance.
Frank, T; Kraiss, K F; Kuhlen, T
1998-01-01
Adaptive resonance theory (ART) describes a family of self-organizing neural networks, capable of clustering arbitrary sequences of input patterns into stable recognition codes. Many different types of ART-networks have been developed to improve clustering capabilities. In this paper we compare clustering performance of different types of ART-networks: Fuzzy ART, ART 2A with and without complement encoded input patterns, and an Euclidean ART 2A-variation. All types are tested with two- and high-dimensional input patterns in order to illustrate general capabilities and characteristics in different system environments. Based on our simulation results, Fuzzy ART seems to be less appropriate whenever input signals are corrupted by additional noise, while ART 2A-type networks keep stable in all inspected environments. Together with other examined features, ART-architectures suited for particular applications can be selected. PMID:18252478
Contributions to "k"-Means Clustering and Regression via Classification Algorithms
ERIC Educational Resources Information Center
Salman, Raied
2012-01-01
The dissertation deals with clustering algorithms and transforming regression problems into classification problems. The main contributions of the dissertation are twofold; first, to improve (speed up) the clustering algorithms and second, to develop a strict learning environment for solving regression problems as classification tasks by using…
Associations between Early Childhood Temperament Clusters and Later Psychosocial Adjustment
ERIC Educational Resources Information Center
Sanson, Ann; Letcher, Primrose; Smart, Diana; Prior, Margot; Toumbourou, John W.; Oberklaid, Frank
2009-01-01
The study adopted a person-centered approach to examine whether clusters of children could be identified on the basis of temperament profiles assessed on four occasions from infancy to early childhood, and if so whether differing temperament clusters were associated with subsequent differences in behavior problems, social skills, and school…
A nonparametric clustering technique which estimates the number of clusters
NASA Technical Reports Server (NTRS)
Ramey, D. B.
1983-01-01
In applications of cluster analysis, one usually needs to determine the number of clusters, K, and the assignment of observations to each cluster. A clustering technique based on recursive application of a multivariate test of bimodality which automatically estimates both K and the cluster assignments is presented.
High-dimensional entanglement certification
Huang, Zixin; Maccone, Lorenzo; Karim, Akib; Macchiavello, Chiara; Chapman, Robert J.; Peruzzo, Alberto
2016-01-01
Quantum entanglement is the ability of joint quantum systems to possess global properties (correlation among systems) even when subsystems have no definite individual property. Whilst the 2-dimensional (qubit) case is well-understood, currently, tools to characterise entanglement in high dimensions are limited. We experimentally demonstrate a new procedure for entanglement certification that is suitable for large systems, based entirely on information-theoretics. It scales more efficiently than Bell’s inequality and entanglement witness. The method we developed works for arbitrarily large system dimension d and employs only two local measurements of complementary properties. This procedure can also certify whether the system is maximally entangled. We illustrate the protocol for families of bipartite states of qudits with dimension up to 32 composed of polarisation-entangled photon pairs. PMID:27311935
High-dimensional entanglement certification.
Huang, Zixin; Maccone, Lorenzo; Karim, Akib; Macchiavello, Chiara; Chapman, Robert J; Peruzzo, Alberto
2016-06-17
Quantum entanglement is the ability of joint quantum systems to possess global properties (correlation among systems) even when subsystems have no definite individual property. Whilst the 2-dimensional (qubit) case is well-understood, currently, tools to characterise entanglement in high dimensions are limited. We experimentally demonstrate a new procedure for entanglement certification that is suitable for large systems, based entirely on information-theoretics. It scales more efficiently than Bell's inequality and entanglement witness. The method we developed works for arbitrarily large system dimension d and employs only two local measurements of complementary properties. This procedure can also certify whether the system is maximally entangled. We illustrate the protocol for families of bipartite states of qudits with dimension up to 32 composed of polarisation-entangled photon pairs.
High-dimensional entanglement certification
NASA Astrophysics Data System (ADS)
Huang, Zixin; Maccone, Lorenzo; Karim, Akib; Macchiavello, Chiara; Chapman, Robert J.; Peruzzo, Alberto
2016-06-01
Quantum entanglement is the ability of joint quantum systems to possess global properties (correlation among systems) even when subsystems have no definite individual property. Whilst the 2-dimensional (qubit) case is well-understood, currently, tools to characterise entanglement in high dimensions are limited. We experimentally demonstrate a new procedure for entanglement certification that is suitable for large systems, based entirely on information-theoretics. It scales more efficiently than Bell’s inequality and entanglement witness. The method we developed works for arbitrarily large system dimension d and employs only two local measurements of complementary properties. This procedure can also certify whether the system is maximally entangled. We illustrate the protocol for families of bipartite states of qudits with dimension up to 32 composed of polarisation-entangled photon pairs.
Bregman Clustering for Separable Instances
NASA Astrophysics Data System (ADS)
Ackermann, Marcel R.; Blömer, Johannes
The Bregman k-median problem is defined as follows. Given a Bregman divergence D φ and a finite set P subseteq { R}^d of size n, our goal is to find a set C of size k such that the sum of errors cost(P,C) = ∑ p ∈ P min c ∈ C D φ (p,c) is minimized. The Bregman k-median problem plays an important role in many applications, e.g., information theory, statistics, text classification, and speech processing. We study a generalization of the kmeans++ seeding of Arthur and Vassilvitskii (SODA '07). We prove for an almost arbitrary Bregman divergence that if the input set consists of k well separated clusters, then with probability 2^{-{O}(k)} this seeding step alone finds an {O}(1)-approximate solution. Thereby, we generalize an earlier result of Ostrovsky et al. (FOCS '06) from the case of the Euclidean k-means problem to the Bregman k-median problem. Additionally, this result leads to a constant factor approximation algorithm for the Bregman k-median problem using at most 2^{{O}(k)}n arithmetic operations, including evaluations of Bregman divergence D φ .
Cluster geometry and inclinations from deprojection uncertainties. Cluster geometry and inclination
NASA Astrophysics Data System (ADS)
Chakrabarty, D.; de Filippis, E.; Russell, H.
2008-08-01
Context: The determination of cluster masses is a complex problem that would be aided by information about the cluster shape and orientation (with respect to the line-of-sight). Aims: It is in this context, that we have developed a scheme for identifying the intrinsic morphology and inclination of a cluster, by looking for the signature of the true cluster characteristics in the inter-comparison of the different deprojected emissivity profiles (that all project to the same X-ray brightness distribution) and complimenting this with SZe data when available. Methods: We deproject the cluster X-ray surface brightness profile under assumptions about geometry and inclination that correspond to four extreme scenarios; the deprojection is performed by the non-parametric algorithm DOPING. The formalism is tested with model clusters and is then applied to a sample of 24 clusters. While the shape determination is possible by implementing the X-ray brightness alone, the estimation of the inclination is usually markedly improved upon by the usage of SZe data that is available for the considered sample. Results: We spot 8 prolate systems, 1 oblate and 15 of the clusters in our sample as triaxial. In fact, for systems identified as triaxial, we are able to discern how the three semi-axis lengths compare with each other. This, when compounded by the information about the line-of-sight extent, allows us to constrain the intrinsic axial ratios and the inclination quite tightly.
Multiple populations in globular clusters. Lessons learned from the Milky Way globular clusters
NASA Astrophysics Data System (ADS)
Gratton, Raffaele G.; Carretta, Eugenio; Bragaglia, Angela
2012-02-01
Recent progress in studies of globular clusters has shown that they are not simple stellar populations, but rather are made up of multiple generations. Evidence stems both from photometry and spectroscopy. A new paradigm is arising for the formation of massive star clusters, which includes several episodes of star formation. While this provides an explanation for several features of globular clusters, including the second-parameter problem, it also opens new perspectives on the relation between globular clusters and the halo of our Galaxy, and by extension on all populations with a high specific frequency of globular clusters, such as, e.g., giant elliptical galaxies. We review progress in this area, focussing on the most recent studies. Several points remain to become properly understood, in particular those concerning the nature of the polluters producing the abundance pattern in the clusters and the typical timescale, the range of cluster masses where this phenomenon is active, and the relation between globular clusters and other satellites of our Galaxy.
Perualila-Tan, Nolen Joy; Shkedy, Ziv; Talloen, Willem; Göhlmann, Hinrich W H; Moerbeke, Marijke Van; Kasim, Adetayo
2016-08-01
The modern process of discovering candidate molecules in early drug discovery phase includes a wide range of approaches to extract vital information from the intersection of biology and chemistry. A typical strategy in compound selection involves compound clustering based on chemical similarity to obtain representative chemically diverse compounds (not incorporating potency information). In this paper, we propose an integrative clustering approach that makes use of both biological (compound efficacy) and chemical (structural features) data sources for the purpose of discovering a subset of compounds with aligned structural and biological properties. The datasets are integrated at the similarity level by assigning complementary weights to produce a weighted similarity matrix, serving as a generic input in any clustering algorithm. This new analysis work flow is semi-supervised method since, after the determination of clusters, a secondary analysis is performed wherein it finds differentially expressed genes associated to the derived integrated cluster(s) to further explain the compound-induced biological effects inside the cell. In this paper, datasets from two drug development oncology projects are used to illustrate the usefulness of the weighted similarity-based clustering approach to integrate multi-source high-dimensional information to aid drug discovery. Compounds that are structurally and biologically similar to the reference compounds are discovered using this proposed integrative approach. PMID:27312313
Perualila-Tan, Nolen Joy; Shkedy, Ziv; Talloen, Willem; Göhlmann, Hinrich W H; Moerbeke, Marijke Van; Kasim, Adetayo
2016-08-01
The modern process of discovering candidate molecules in early drug discovery phase includes a wide range of approaches to extract vital information from the intersection of biology and chemistry. A typical strategy in compound selection involves compound clustering based on chemical similarity to obtain representative chemically diverse compounds (not incorporating potency information). In this paper, we propose an integrative clustering approach that makes use of both biological (compound efficacy) and chemical (structural features) data sources for the purpose of discovering a subset of compounds with aligned structural and biological properties. The datasets are integrated at the similarity level by assigning complementary weights to produce a weighted similarity matrix, serving as a generic input in any clustering algorithm. This new analysis work flow is semi-supervised method since, after the determination of clusters, a secondary analysis is performed wherein it finds differentially expressed genes associated to the derived integrated cluster(s) to further explain the compound-induced biological effects inside the cell. In this paper, datasets from two drug development oncology projects are used to illustrate the usefulness of the weighted similarity-based clustering approach to integrate multi-source high-dimensional information to aid drug discovery. Compounds that are structurally and biologically similar to the reference compounds are discovered using this proposed integrative approach.