Clustering high dimensional data using RIA
Aziz, Nazrina
2015-05-15
Clustering may simply represent a convenient method for organizing a large data set so that it can easily be understood and information can efficiently be retrieved. However, identifying cluster in high dimensionality data sets is a difficult task because of the curse of dimensionality. Another challenge in clustering is some traditional functions cannot capture the pattern dissimilarity among objects. In this article, we used an alternative dissimilarity measurement called Robust Influence Angle (RIA) in the partitioning method. RIA is developed using eigenstructure of the covariance matrix and robust principal component score. We notice that, it can obtain cluster easily and hence avoid the curse of dimensionality. It is also manage to cluster large data sets with mixed numeric and categorical value.
Clustering high dimensional data using RIA
NASA Astrophysics Data System (ADS)
Aziz, Nazrina
2015-05-01
Clustering may simply represent a convenient method for organizing a large data set so that it can easily be understood and information can efficiently be retrieved. However, identifying cluster in high dimensionality data sets is a difficult task because of the curse of dimensionality. Another challenge in clustering is some traditional functions cannot capture the pattern dissimilarity among objects. In this article, we used an alternative dissimilarity measurement called Robust Influence Angle (RIA) in the partitioning method. RIA is developed using eigenstructure of the covariance matrix and robust principal component score. We notice that, it can obtain cluster easily and hence avoid the curse of dimensionality. It is also manage to cluster large data sets with mixed numeric and categorical value.
NASA Technical Reports Server (NTRS)
Srivastava, Ashok, N.; Akella, Ram; Diev, Vesselin; Kumaresan, Sakthi Preethi; McIntosh, Dawn M.; Pontikakis, Emmanuel D.; Xu, Zuobing; Zhang, Yi
2006-01-01
This paper describes the results of a significant research and development effort conducted at NASA Ames Research Center to develop new text mining techniques to discover anomalies in free-text reports regarding system health and safety of two aerospace systems. We discuss two problems of significant importance in the aviation industry. The first problem is that of automatic anomaly discovery about an aerospace system through the analysis of tens of thousands of free-text problem reports that are written about the system. The second problem that we address is that of automatic discovery of recurring anomalies, i.e., anomalies that may be described m different ways by different authors, at varying times and under varying conditions, but that are truly about the same part of the system. The intent of recurring anomaly identification is to determine project or system weakness or high-risk issues. The discovery of recurring anomalies is a key goal in building safe, reliable, and cost-effective aerospace systems. We address the anomaly discovery problem on thousands of free-text reports using two strategies: (1) as an unsupervised learning problem where an algorithm takes free-text reports as input and automatically groups them into different bins, where each bin corresponds to a different unknown anomaly category; and (2) as a supervised learning problem where the algorithm classifies the free-text reports into one of a number of known anomaly categories. We then discuss the application of these methods to the problem of discovering recurring anomalies. In fact the special nature of recurring anomalies (very small cluster sizes) requires incorporating new methods and measures to enhance the original approach for anomaly detection. ?& pant 0-
Adaptive dimension reduction for clustering high dimensional data
Ding, Chris; He, Xiaofeng; Zha, Hongyuan; Simon, Horst
2002-10-01
It is well-known that for high dimensional data clustering, standard algorithms such as EM and the K-means are often trapped in local minimum. many initialization methods were proposed to tackle this problem, but with only limited success. In this paper they propose a new approach to resolve this problem by repeated dimension reductions such that K-means or EM are performed only in very low dimensions. Cluster membership is utilized as a bridge between the reduced dimensional sub-space and the original space, providing flexibility and ease of implementation. Clustering analysis performed on highly overlapped Gaussians, DNA gene expression profiles and internet newsgroups demonstrate the effectiveness of the proposed algorithm.
Semi-supervised high-dimensional clustering by tight wavelet frames
NASA Astrophysics Data System (ADS)
Dong, Bin; Hao, Ning
2015-08-01
High-dimensional clustering arises frequently from many areas in natural sciences, technical disciplines and social medias. In this paper, we consider the problem of binary clustering of high-dimensional data, i.e. classification of a data set into 2 classes. We assume that the correct (or mostly correct) classification of a small portion of the given data is known. Based on such partial classification, we design optimization models that complete the clustering of the entire data set using the recently introduced tight wavelet frames on graphs.1 Numerical experiments of the proposed models applied to some real data sets are conducted. In particular, the performance of the models on some very high-dimensional data sets are examined; and combinations of the models with some existing dimension reduction techniques are also considered.
Modification of DIRECT for high-dimensional design problems
NASA Astrophysics Data System (ADS)
Tavassoli, Arash; Haji Hajikolaei, Kambiz; Sadeqi, Soheil; Wang, G. Gary; Kjeang, Erik
2014-06-01
DIviding RECTangles (DIRECT), as a well-known derivative-free global optimization method, has been found to be effective and efficient for low-dimensional problems. When facing high-dimensional black-box problems, however, DIRECT's performance deteriorates. This work proposes a series of modifications to DIRECT for high-dimensional problems (dimensionality d>10). The principal idea is to increase the convergence speed by breaking its single initialization-to-convergence approach into several more intricate steps. Specifically, starting with the entire feasible area, the search domain will shrink gradually and adaptively to the region enclosing the potential optimum. Several stopping criteria have been introduced to avoid premature convergence. A diversification subroutine has also been developed to prevent the algorithm from being trapped in local minima. The proposed approach is benchmarked using nine standard high-dimensional test functions and one black-box engineering problem. All these tests show a significant efficiency improvement over the original DIRECT for high-dimensional design problems.
Model-based Clustering of High-Dimensional Data in Astrophysics
NASA Astrophysics Data System (ADS)
Bouveyron, C.
2016-05-01
The nature of data in Astrophysics has changed, as in other scientific fields, in the past decades due to the increase of the measurement capabilities. As a consequence, data are nowadays frequently of high dimensionality and available in mass or stream. Model-based techniques for clustering are popular tools which are renowned for their probabilistic foundations and their flexibility. However, classical model-based techniques show a disappointing behavior in high-dimensional spaces which is mainly due to their dramatical over-parametrization. The recent developments in model-based classification overcome these drawbacks and allow to efficiently classify high-dimensional data, even in the "small n / large p" situation. This work presents a comprehensive review of these recent approaches, including regularization-based techniques, parsimonious modeling, subspace classification methods and classification methods based on variable selection. The use of these model-based methods is also illustrated on real-world classification problems in Astrophysics using R packages.
Visualization of high-dimensional clusters using nonlinear magnification
Keahey, T.A.
1998-12-31
This paper describes a cluster visualization system used for data-mining fraud detection. The system can simultaneously show 6 dimensions of data, and a unique technique of 3D nonlinear magnification allows individual clusters of data points to be magnified while still maintaining a view of the global context. The author first describes the fraud detection problem, along with the data which is to be visualized. Then he describes general characteristics of the visualization system, and shows how nonlinear magnification can be used in this system. Finally he concludes and describes options for further work.
Xie, Benhuai; Pan, Wei; Shen, Xiaotong
2010-01-01
Motivation: Model-based clustering has been widely used, e.g. in microarray data analysis. Since for high-dimensional data variable selection is necessary, several penalized model-based clustering methods have been proposed tørealize simultaneous variable selection and clustering. However, the existing methods all assume that the variables are independent with the use of diagonal covariance matrices. Results: To model non-independence of variables (e.g. correlated gene expressions) while alleviating the problem with the large number of unknown parameters associated with a general non-diagonal covariance matrix, we generalize the mixture of factor analyzers to that with penalization, which, among others, can effectively realize variable selection. We use simulated data and real microarray data to illustrate the utility and advantages of the proposed method over several existing ones. Contact: weip@biostat.umn.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:20031967
A multistage mathematical approach to automated clustering of high-dimensional noisy data
Friedman, Alexander; Keselman, Michael D.; Gibb, Leif G.; Graybiel, Ann M.
2015-01-01
A critical problem faced in many scientific fields is the adequate separation of data derived from individual sources. Often, such datasets require analysis of multiple features in a highly multidimensional space, with overlap of features and sources. The datasets generated by simultaneous recording from hundreds of neurons emitting phasic action potentials have produced the challenge of separating the recorded signals into independent data subsets (clusters) corresponding to individual signal-generating neurons. Mathematical methods have been developed over the past three decades to achieve such spike clustering, but a complete solution with fully automated cluster identification has not been achieved. We propose here a fully automated mathematical approach that identifies clusters in multidimensional space through recursion, which combats the multidimensionality of the data. Recursion is paired with an approach to dimensional evaluation, in which each dimension of a dataset is examined for its informational importance for clustering. The dimensions offering greater informational importance are given added weight during recursive clustering. To combat strong background activity, our algorithm takes an iterative approach of data filtering according to a signal-to-noise ratio metric. The algorithm finds cluster cores, which are thereafter expanded to include complete clusters. This mathematical approach can be extended from its prototype context of spike sorting to other datasets that suffer from high dimensionality and background activity. PMID:25831512
Srinivasan, Thenmozhi; Palanisamy, Balasubramanie
2015-01-01
Clusters of high-dimensional data techniques are emerging, according to data noisy and poor quality challenges. This paper has been developed to cluster data using high-dimensional similarity based PCM (SPCM), with ant colony optimization intelligence which is effective in clustering nonspatial data without getting knowledge about cluster number from the user. The PCM becomes similarity based by using mountain method with it. Though this is efficient clustering, it is checked for optimization using ant colony algorithm with swarm intelligence. Thus the scalable clustering technique is obtained and the evaluation results are checked with synthetic datasets. PMID:26495413
Srinivasan, Thenmozhi; Palanisamy, Balasubramanie
2015-01-01
Clusters of high-dimensional data techniques are emerging, according to data noisy and poor quality challenges. This paper has been developed to cluster data using high-dimensional similarity based PCM (SPCM), with ant colony optimization intelligence which is effective in clustering nonspatial data without getting knowledge about cluster number from the user. The PCM becomes similarity based by using mountain method with it. Though this is efficient clustering, it is checked for optimization using ant colony algorithm with swarm intelligence. Thus the scalable clustering technique is obtained and the evaluation results are checked with synthetic datasets. PMID:26495413
NASA Astrophysics Data System (ADS)
Manukyan, N.; Eppstein, M. J.; Rizzo, D. M.
2011-12-01
data to demonstrate how the proposed methods facilitate automatic identification and visualization of clusters in real-world, high-dimensional biogeochemical data with complex relationships. The proposed methods are quite general and are applicable to a wide range of geophysical problems. [1] Pearce, A., Rizzo, D., and Mouser, P., "Subsurface characterization of groundwater contaminated by landfill leachate using microbial community profile data and a nonparametric decision-making process", Water Resources Research, 47:W06511, 11 pp, 2011. [2] Mouser, P., Rizzo, D., Druschel, G., Morales, S, O'Grady, P., Hayden, N., Stevens, L., "Enhanced detection of groundwater contamination from a leaking waste disposal site by microbial community profiles", Water Resources Research, 46:W12506, 12 pp., 2010.
Visualization of high-dimensional clusters using nonlinear magnification
NASA Astrophysics Data System (ADS)
Keahey, T. A.
1999-03-01
This paper describes a visualization system which has been used as part of a data-mining effort to detect fraud and abuse within state medicare programs. The data-mining process generates a set of N attributes for each medicare provider and beneficiary in the state; these attributes can be numeric, categorical, or derived from the scoring proces of the data- mining routines. The attribute list can be considered as an N- dimensional space, which is subsequently partitioned into some fixed number of cluster partitions. The sparse nature of the clustered space provides room for the simultaneous visualization of more than 3 dimensions; examples in the paper will show 6-dimensional visualization. This ability to view higher dimensional data allows the data-mining researcher to compare the clustering effectiveness of the different attributes. Transparency based rendering is also used in conjunction with filtering techniques to provide selective rendering of only those data which are of greatest interest. Nonlinear magnification techniques are used to stretch the N- dimensional space to allow focus on one or more regions of interest while still allowing a view of the global context. The magnification can either be applied globally, or in a constrained fashion to expand individual clusters within the space.
High dimensional data clustering by partitioning the hypergraphs using dense subgraph partition
NASA Astrophysics Data System (ADS)
Sun, Xili; Tian, Shoucai; Lu, Yonggang
2015-12-01
Due to the curse of dimensionality, traditional clustering methods usually fail to produce meaningful results for the high dimensional data. Hypergraph partition is believed to be a promising method for dealing with this challenge. In this paper, we first construct a graph G from the data by defining an adjacency relationship between the data points using Shared Reverse k Nearest Neighbors (SRNN). Then a hypergraph is created from the graph G by defining the hyperedges to be all the maximal cliques in the graph G. After the hypergraph is produced, a powerful hypergraph partitioning method called dense subgraph partition (DSP) combined with the k-medoids method is used to produce the final clustering results. The proposed method is evaluated on several real high-dimensional datasets, and the experimental results show that the proposed method can improve the clustering results of the high dimensional data compared with applying k-medoids method directly on the original data.
Variational Bayesian strategies for high-dimensional, stochastic design problems
NASA Astrophysics Data System (ADS)
Koutsourelakis, P. S.
2016-03-01
This paper is concerned with a lesser-studied problem in the context of model-based, uncertainty quantification (UQ), that of optimization/design/control under uncertainty. The solution of such problems is hindered not only by the usual difficulties encountered in UQ tasks (e.g. the high computational cost of each forward simulation, the large number of random variables) but also by the need to solve a nonlinear optimization problem involving large numbers of design variables and potentially constraints. We propose a framework that is suitable for a class of such problems and is based on the idea of recasting them as probabilistic inference tasks. To that end, we propose a Variational Bayesian (VB) formulation and an iterative VB-Expectation-Maximization scheme that is capable of identifying a local maximum as well as a low-dimensional set of directions in the design space, along which, the objective exhibits the largest sensitivity. We demonstrate the validity of the proposed approach in the context of two numerical examples involving thousands of random and design variables. In all cases considered the cost of the computations in terms of calls to the forward model was of the order of 100 or less. The accuracy of the approximations provided is assessed by information-theoretic metrics.
Nuclear Potential Clustering As a New Tool to Detect Patterns in High Dimensional Datasets
NASA Astrophysics Data System (ADS)
Tonkova, V.; Paulus, D.; Neeb, H.
2013-02-01
We present a new approach for the clustering of high dimensional data without prior assumptions about the structure of the underlying distribution. The proposed algorithm is based on a concept adapted from nuclear physics. To partition the data, we model the dynamic behaviour of nucleons interacting in an N-dimensional space. An adaptive nuclear potential, comprised of a short-range attractive (strong interaction) and a long-range repulsive term (Coulomb force) is assigned to each data point. By modelling the dynamics, nucleons that are densely distributed in space fuse to build nuclei (clusters) whereas single point clusters repel each other. The formation of clusters is completed when the system reaches the state of minimal potential energy. The data are then grouped according to the particles' final effective potential energy level. The performance of the algorithm is tested with several synthetic datasets showing that the proposed method can robustly identify clusters even when complex configurations are present. Furthermore, quantitative MRI data from 43 multiple sclerosis patients were analyzed, showing a reasonable splitting into subgroups according to the individual patients' disease grade. The good performance of the algorithm on such highly correlated non-spherical datasets, which are typical for MRI derived image features, shows that Nuclear Potential Clustering is a valuable tool for automated data analysis, not only in the MRI domain.
Clustering High-Dimensional Landmark-based Two-dimensional Shape Data‡
Huang, Chao; Styner, Martin; Zhu, Hongtu
2015-01-01
An important goal in image analysis is to cluster and recognize objects of interest according to the shapes of their boundaries. Clustering such objects faces at least four major challenges including a curved shape space, a high-dimensional feature space, a complex spatial correlation structure, and shape variation associated with some covariates (e.g., age or gender). The aim of this paper is to develop a penalized model-based clustering framework to cluster landmark-based planar shape data, while explicitly addressing these challenges. Specifically, a mixture of offset-normal shape factor analyzers (MOSFA) is proposed with mixing proportions defined through a regression model (e.g., logistic) and an offset-normal shape distribution in each component for data in the curved shape space. A latent factor analysis model is introduced to explicitly model the complex spatial correlation. A penalized likelihood approach with both adaptive pairwise fusion Lasso penalty function and L2 penalty function is used to automatically realize variable selection via thresholding and deliver a sparse solution. Our real data analysis has confirmed the excellent finite-sample performance of MOSFA in revealing meaningful clusters in the corpus callosum shape data obtained from the Attention Deficit Hyperactivity Disorder-200 (ADHD-200) study. PMID:26604425
NASA Astrophysics Data System (ADS)
Wang, Wei; Yang, Jiong
With the rapid growth of computational biology and e-commerce applications, high-dimensional data becomes very common. Thus, mining high-dimensional data is an urgent problem of great practical importance. However, there are some unique challenges for mining data of high dimensions, including (1) the curse of dimensionality and more crucial (2) the meaningfulness of the similarity measure in the high dimension space. In this chapter, we present several state-of-art techniques for analyzing high-dimensional data, e.g., frequent pattern mining, clustering, and classification. We will discuss how these methods deal with the challenges of high dimensionality.
CHARACTERIZATION OF DISCONTINUITIES IN HIGH-DIMENSIONAL STOCHASTIC PROBLEMS ON ADAPTIVE SPARSE GRIDS
Jakeman, John D; Archibald, Richard K; Xiu, Dongbin
2011-01-01
In this paper we present a set of efficient algorithms for detection and identification of discontinuities in high dimensional space. The method is based on extension of polynomial annihilation for edge detection in low dimensions. Compared to the earlier work, the present method poses significant improvements for high dimensional problems. The core of the algorithms relies on adaptive refinement of sparse grids. It is demonstrated that in the commonly encountered cases where a discontinuity resides on a small subset of the dimensions, the present method becomes optimal , in the sense that the total number of points required for function evaluations depends linearly on the dimensionality of the space. The details of the algorithms will be presented and various numerical examples are utilized to demonstrate the efficacy of the method.
NASA Astrophysics Data System (ADS)
Choo, Jaegul; Lee, Hanseung; Liu, Zhicheng; Stasko, John; Park, Haesun
2013-01-01
Many of the modern data sets such as text and image data can be represented in high-dimensional vector spaces and have benefited from computational methods that utilize advanced computational methods. Visual analytics approaches have contributed greatly to data understanding and analysis due to their capability of leveraging humans' ability for quick visual perception. However, visual analytics targeting large-scale data such as text and image data has been challenging due to the limited screen space in terms of both the numbers of data points and features to represent. Among various computational methods supporting visual analytics, dimension reduction and clustering have played essential roles by reducing these numbers in an intelligent way to visually manageable sizes. Given numerous dimension reduction and clustering methods available, however, the decision on the choice of algorithms and their parameters becomes difficult. In this paper, we present an interactive visual testbed system for dimension reduction and clustering in a large-scale high-dimensional data analysis. The testbed system enables users to apply various dimension reduction and clustering methods with different settings, visually compare the results from different algorithmic methods to obtain rich knowledge for the data and tasks at hand, and eventually choose the most appropriate path for a collection of algorithms and parameters. Using various data sets such as documents, images, and others that are already encoded in vectors, we demonstrate how the testbed system can support these tasks.
NASA Astrophysics Data System (ADS)
Schütze, Niels; Wöhling, Thomas; de Play, Michael
2010-05-01
Some real-world optimization problems in water resources have a high-dimensional space of decision variables and more than one objective function. In this work, we compare three general-purpose, multi-objective simulation optimization algorithms, namely NSGA-II, AMALGAM, and CMA-ES-MO when solving three real case Multi-objective Optimization Problems (MOPs): (i) a high-dimensional soil hydraulic parameter estimation problem; (ii) a multipurpose multi-reservoir operation problem; and (iii) a scheduling problem in deficit irrigation. We analyze the behaviour of the three algorithms on these test problems considering their formulations ranging from 40 up to 120 decision variables and 2 to 4 objectives. The computational effort required by each algorithm in order to reach the true Pareto front is also analyzed.
Mosmann, Tim R; Naim, Iftekhar; Rebhahn, Jonathan; Datta, Suprakash; Cavenaugh, James S; Weaver, Jason M; Sharma, Gaurav
2014-01-01
A multistage clustering and data processing method, SWIFT (detailed in a companion manuscript), has been developed to detect rare subpopulations in large, high-dimensional flow cytometry datasets. An iterative sampling procedure initially fits the data to multidimensional Gaussian distributions, then splitting and merging stages use a criterion of unimodality to optimize the detection of rare subpopulations, to converge on a consistent cluster number, and to describe non-Gaussian distributions. Probabilistic assignment of cells to clusters, visualization, and manipulation of clusters by their cluster medians, facilitate application of expert knowledge using standard flow cytometry programs. The dual problems of rigorously comparing similar complex samples, and enumerating absent or very rare cell subpopulations in negative controls, were solved by assigning cells in multiple samples to a cluster template derived from a single or combined sample. Comparison of antigen-stimulated and control human peripheral blood cell samples demonstrated that SWIFT could identify biologically significant subpopulations, such as rare cytokine-producing influenza-specific T cells. A sensitivity of better than one part per million was attained in very large samples. Results were highly consistent on biological replicates, yet the analysis was sensitive enough to show that multiple samples from the same subject were more similar than samples from different subjects. A companion manuscript (Part 1) details the algorithmic development of SWIFT. © 2014 The Authors. Published by Wiley Periodicals Inc. PMID:24532172
Semi-Supervised Clustering for High-Dimensional and Sparse Features
ERIC Educational Resources Information Center
Yan, Su
2010-01-01
Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some "weak" form of side…
L2-Boosting algorithm applied to high-dimensional problems in genomic selection.
González-Recio, Oscar; Weigel, Kent A; Gianola, Daniel; Naya, Hugo; Rosa, Guilherme J M
2010-06-01
The L(2)-Boosting algorithm is one of the most promising machine-learning techniques that has appeared in recent decades. It may be applied to high-dimensional problems such as whole-genome studies, and it is relatively simple from a computational point of view. In this study, we used this algorithm in a genomic selection context to make predictions of yet to be observed outcomes. Two data sets were used: (1) productive lifetime predicted transmitting abilities from 4702 Holstein sires genotyped for 32 611 single nucleotide polymorphisms (SNPs) derived from the Illumina BovineSNP50 BeadChip, and (2) progeny averages of food conversion rate, pre-corrected by environmental and mate effects, in 394 broilers genotyped for 3481 SNPs. Each of these data sets was split into training and testing sets, the latter comprising dairy or broiler sires whose ancestors were in the training set. Two weak learners, ordinary least squares (OLS) and non-parametric (NP) regression were used for the L2-Boosting algorithm, to provide a stringent evaluation of the procedure. This algorithm was compared with BL [Bayesian LASSO (least absolute shrinkage and selection operator)] and BayesA regression. Learning tasks were carried out in the training set, whereas validation of the models was performed in the testing set. Pearson correlations between predicted and observed responses in the dairy cattle (broiler) data set were 0.65 (0.33), 0.53 (0.37), 0.66 (0.26) and 0.63 (0.27) for OLS-Boosting, NP-Boosting, BL and BayesA, respectively. The smallest bias and mean-squared errors (MSEs) were obtained with OLS-Boosting in both the dairy cattle (0.08 and 1.08, respectively) and broiler (-0.011 and 0.006) data sets, respectively. In the dairy cattle data set, the BL was more accurate (bias=0.10 and MSE=1.10) than BayesA (bias=1.26 and MSE=2.81), whereas no differences between these two methods were found in the broiler data set. L2-Boosting with a suitable learner was found to be a competitive
Haplotyping Problem, A Clustering Approach
Eslahchi, Changiz; Sadeghi, Mehdi; Pezeshk, Hamid; Kargar, Mehdi; Poormohammadi, Hadi
2007-09-06
Construction of two haplotypes from a set of Single Nucleotide Polymorphism (SNP) fragments is called haplotype reconstruction problem. One of the most popular computational model for this problem is Minimum Error Correction (MEC). Since MEC is an NP-hard problem, here we propose a novel heuristic algorithm based on clustering analysis in data mining for haplotype reconstruction problem. Based on hamming distance and similarity between two fragments, our iterative algorithm produces two clusters of fragments; then, in each iteration, the algorithm assigns a fragment to one of the clusters. Our results suggest that the algorithm has less reconstruction error rate in comparison with other algorithms.
Wang, Xueyi
2011-01-01
The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 106 records and 104 dimensions, kMkNN shows a 2-to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces. PMID:22247818
Cluster expression in fission and fusion in high-dimensional macroscopic-microscopic calculations
Iwamoto, A.; Ichikawa, T.; Moller, P.; Sierk, A. J.
2004-01-01
We discuss the relation between the fission-fusion potential-energy surfaces of very heavy nuclei and the formation process of these nuclei in cold-fusion reactions. In the potential-energy surfaces, we find a pronounced valley structure, with one valley corresponding to the cold-fusion reaction, the other to fission. As the touching point is approached in the cold-fusion entrance channel, an instability towards dynamical deformation of the projectile occurs, which enhances the fusion cross section. These two 'cluster effects' enhance the production of superheavy nuclei in cold-fusion reactions, in addition to the effect of the low compound-system excitation energy in these reactions. Heavy-ion fusion reactions have been used extensively to synthesize heavy elements beyond actinide nuclei. In order to proceed further in this direction, we need to understand the formation process more precisely, not just the decay process. The dynamics of the formation process are considerably more complex than the dynamics necessary to interpret the spontaneous-fission decay of heavy elements. However, before implementing a full dynamical description it is useful to understand the basic properties of the potential-energy landscape encountered in the initial stages of the collision. The collision process and entrance-channel landscape can conveniently be separated into two parts, namely the early-stage separated system before touching and the late-stage composite system after touching. The transition between these two stages is particularly important, but not very well understood until now. To understand better the transition between the two stages we analyze here in detail the potential energy landscape or 'collision surface' of the system both outside and inside the touching configuration of the target and projectile. In Sec. 2, we discuss calculated five-dimensional potential-energy landscapes inside touching and identify major features. In Sec. 3, we present calculated
Naim, Iftekhar; Datta, Suprakash; Rebhahn, Jonathan; Cavenaugh, James S; Mosmann, Tim R; Sharma, Gaurav
2014-05-01
We present a model-based clustering method, SWIFT (Scalable Weighted Iterative Flow-clustering Technique), for digesting high-dimensional large-sized datasets obtained via modern flow cytometry into more compact representations that are well-suited for further automated or manual analysis. Key attributes of the method include the following: (a) the analysis is conducted in the multidimensional space retaining the semantics of the data, (b) an iterative weighted sampling procedure is utilized to maintain modest computational complexity and to retain discrimination of extremely small subpopulations (hundreds of cells from datasets containing tens of millions), and (c) a splitting and merging procedure is incorporated in the algorithm to preserve distinguishability between biologically distinct populations, while still providing a significant compaction relative to the original data. This article presents a detailed algorithmic description of SWIFT, outlining the application-driven motivations for the different design choices, a discussion of computational complexity of the different steps, and results obtained with SWIFT for synthetic data and relatively simple experimental data that allow validation of the desirable attributes. A companion paper (Part 2) highlights the use of SWIFT, in combination with additional computational tools, for more challenging biological problems. PMID:24677621
NASA Astrophysics Data System (ADS)
Regis, Rommel G.
2014-02-01
This article develops two new algorithms for constrained expensive black-box optimization that use radial basis function surrogates for the objective and constraint functions. These algorithms are called COBRA and Extended ConstrLMSRBF and, unlike previous surrogate-based approaches, they can be used for high-dimensional problems where all initial points are infeasible. They both follow a two-phase approach where the first phase finds a feasible point while the second phase improves this feasible point. COBRA and Extended ConstrLMSRBF are compared with alternative methods on 20 test problems and on the MOPTA08 benchmark automotive problem (D.R. Jones, Presented at MOPTA 2008), which has 124 decision variables and 68 black-box inequality constraints. The alternatives include a sequential penalty derivative-free algorithm, a direct search method with kriging surrogates, and two multistart methods. Numerical results show that COBRA algorithms are competitive with Extended ConstrLMSRBF and they generally outperform the alternatives on the MOPTA08 problem and most of the test problems.
Fournier, René; Orel, Slava
2013-12-21
We present a method for fitting high-dimensional potential energy surfaces that is almost fully automated, can be applied to systems with various chemical compositions, and involves no particular choice of function form. We tested it on four systems: Ag20, Sn6Pb6, Si10, and Li8. The cost for energy evaluation is smaller than the cost of a density functional theory (DFT) energy evaluation by a factor of 1500 for Li8, and 60,000 for Ag20. We achieved intermediate accuracy (errors of 0.4 to 0.8 eV on atomization energies, or, 1% to 3% on cohesive energies) with rather small datasets (between 240 and 1400 configurations). We demonstrate that this accuracy is sufficient to correctly screen the configurations with lowest DFT energy, making this function potentially very useful in a hybrid global optimization strategy. We show that, as expected, the accuracy of the function improves with an increase in the size of the fitting dataset. PMID:24359355
On the complexity of some quadratic Euclidean 2-clustering problems
NASA Astrophysics Data System (ADS)
Kel'manov, A. V.; Pyatkin, A. V.
2016-03-01
Some problems of partitioning a finite set of points of Euclidean space into two clusters are considered. In these problems, the following criteria are minimized: (1) the sum over both clusters of the sums of squared pairwise distances between the elements of the cluster and (2) the sum of the (multiplied by the cardinalities of the clusters) sums of squared distances from the elements of the cluster to its geometric center, where the geometric center (or centroid) of a cluster is defined as the mean value of the elements in that cluster. Additionally, another problem close to (2) is considered, where the desired center of one of the clusters is given as input, while the center of the other cluster is unknown (is the variable to be optimized) as in problem (2). Two variants of the problems are analyzed, in which the cardinalities of the clusters are (1) parts of the input or (2) optimization variables. It is proved that all the considered problems are strongly NP-hard and that, in general, there is no fully polynomial-time approximation scheme for them (unless P = NP).
A facility for using cluster research to study environmental problems
Not Available
1991-11-01
This report begins by describing the general application of cluster based research to environmental chemistry and the development of a Cluster Structure and Dynamics Research Facility (CSDRF). Next, four important areas of cluster research are described in more detail, including how they can impact environmental problems. These are: surface-supported clusters, water and contaminant interactions, time-resolved dynamic studies in clusters, and cluster structures and reactions. These facilities and equipment required for each area of research are then presented. The appendices contain workshop agenda and a listing of the researchers who participated in the workshop discussions that led to this report.
Haitian adolescent personality clusters and their problem area correlates.
McMahon, Robert C; Bryant, Vaughn E; Dévieux, Jessy G; Jean-Gilles, Michèle; Rosenberg, Rhonda; Malow, Robert M
2013-04-01
This study identified personality clusters among a community sample of adolescents of Haitian decent and related cluster subgroup membership to problems in the areas of substance abuse, mental and physical health, family and peer relationships, educational and vocational status, social skills, leisure and recreational pursuits, aggressive behavior-delinquency, and to sexual risk activity. Three cluster subgroups were identified: dependent/conforming (N = 68), high pathology (N = 30); and confident/extroverted/conforming (N = 111). Although the overall sample was relatively healthy based on low average endorsement of problems across areas of expressed concern, significant physical health, mental health, relationship, educational, and HIV risk problems were identified in a MACI identified high psychopathology cluster subgroup. A confident/extraverted/conforming cluster subgroup revealed few problems and appears to reflect a protective style. PMID:22362195
ICANP2: Isoenergetic cluster algorithm for NP-complete Problems
NASA Astrophysics Data System (ADS)
Zhu, Zheng; Fang, Chao; Katzgraber, Helmut G.
NP-complete optimization problems with Boolean variables are of fundamental importance in computer science, mathematics and physics. Most notably, the minimization of general spin-glass-like Hamiltonians remains a difficult numerical task. There has been a great interest in designing efficient heuristics to solve these computationally difficult problems. Inspired by the rejection-free isoenergetic cluster algorithm developed for Ising spin glasses, we present a generalized cluster update that can be applied to different NP-complete optimization problems with Boolean variables. The cluster updates allow for a wide-spread sampling of phase space, thus speeding up optimization. By carefully tuning the pseudo-temperature (needed to randomize the configurations) of the problem, we show that the method can efficiently tackle problems on topologies with a large site-percolation threshold. We illustrate the ICANP2 heuristic on paradigmatic optimization problems, such as the satisfiability problem and the vertex cover problem.
The ordered clustered travelling salesman problem: a hybrid genetic algorithm.
Ahmed, Zakir Hussain
2014-01-01
The ordered clustered travelling salesman problem is a variation of the usual travelling salesman problem in which a set of vertices (except the starting vertex) of the network is divided into some prespecified clusters. The objective is to find the least cost Hamiltonian tour in which vertices of any cluster are visited contiguously and the clusters are visited in the prespecified order. The problem is NP-hard, and it arises in practical transportation and sequencing problems. This paper develops a hybrid genetic algorithm using sequential constructive crossover, 2-opt search, and a local search for obtaining heuristic solution to the problem. The efficiency of the algorithm has been examined against two existing algorithms for some asymmetric and symmetric TSPLIB instances of various sizes. The computational results show that the proposed algorithm is very effective in terms of solution quality and computational time. Finally, we present solution to some more symmetric TSPLIB instances. PMID:24701148
Solving global optimization problems on GPU cluster
NASA Astrophysics Data System (ADS)
Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya
2016-06-01
The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.
NASA Astrophysics Data System (ADS)
Lacoin, Hubert
2014-03-01
Let be the number of self-avoiding paths of length starting from the origin on the infinite cluster obtained after performing Bernoulli percolation on with parameter . The object of this paper is to study the connective constant of the dilute lattice , which is a non-random quantity. We want to investigate if the inequality obtained with the Borel-Cantelli Lemma is strict or not. In other words, we want to know if the quenched and annealed versions of the connective constant are equal. On a heuristic level, this indicates whether or not localization of the trajectories occurs. We prove that when is sufficiently large there exists such that the inequality is strict for.
CARE: Finding Local Linear Correlations in High Dimensional Data
Zhang, Xiang; Pan, Feng; Wang, Wei
2010-01-01
Finding latent patterns in high dimensional data is an important research problem with numerous applications. Existing approaches can be summarized into 3 categories: feature selection, feature transformation (or feature projection) and projected clustering. Being widely used in many applications, these methods aim to capture global patterns and are typically performed in the full feature space. In many emerging biomedical applications, however, scientists are interested in the local latent patterns held by feature subsets, which may be invisible via any global transformation. In this paper, we investigate the problem of finding local linear correlations in high dimensional data. Our goal is to find the latent pattern structures that may exist only in some subspaces. We formalize this problem as finding strongly correlated feature subsets which are supported by a large portion of the data points. Due to the combinatorial nature of the problem and lack of monotonicity of the correlation measurement, it is prohibitively expensive to exhaustively explore the whole search space. In our algorithm, CARE, we utilize spectrum properties and effective heuristic to prune the search space. Extensive experimental results show that our approach is effective in finding local linear correlations that may not be identified by existing methods. PMID:20419037
An agglomerative hierarchical clustering approach to visualisation in Bayesian clustering problems
Dawson, Kevin J.; Belkhir, Khalid
2009-01-01
Clustering problems (including the clustering of individuals into outcrossing populations, hybrid generations, full-sib families and selfing lines) have recently received much attention in population genetics. In these clustering problems, the parameter of interest is a partition of the set of sampled individuals, - the sample partition. In a fully Bayesian approach to clustering problems of this type, our knowledge about the sample partition is represented by a probability distribution on the space of possible sample partitions. Since the number of possible partitions grows very rapidly with the sample size, we can not visualise this probability distribution in its entirety, unless the sample is very small. As a solution to this visualisation problem, we recommend using an agglomerative hierarchical clustering algorithm, which we call the exact linkage algorithm. This algorithm is a special case of the maximin clustering algorithm that we introduced previously. The exact linkage algorithm is now implemented in our software package Partition View. The exact linkage algorithm takes the posterior co-assignment probabilities as input, and yields as output a rooted binary tree, - or more generally, a forest of such trees. Each node of this forest defines a set of individuals, and the node height is the posterior co-assignment probability of this set. This provides a useful visual representation of the uncertainty associated with the assignment of individuals to categories. It is also a useful starting point for a more detailed exploration of the posterior distribution in terms of the co-assignment probabilities. PMID:19337306
The Heterogeneous P-Median Problem for Categorization Based Clustering
ERIC Educational Resources Information Center
Blanchard, Simon J.; Aloise, Daniel; DeSarbo, Wayne S.
2012-01-01
The p-median offers an alternative to centroid-based clustering algorithms for identifying unobserved categories. However, existing p-median formulations typically require data aggregation into a single proximity matrix, resulting in masked respondent heterogeneity. A proposed three-way formulation of the p-median problem explicitly considers…
Automated High-Dimensional Flow Cytometric Data Analysis
NASA Astrophysics Data System (ADS)
Pyne, Saumyadipta; Hu, Xinli; Wang, Kui; Rossin, Elizabeth; Lin, Tsung-I.; Maier, Lisa; Baecher-Allan, Clare; McLachlan, Geoffrey; Tamayo, Pablo; Hafler, David; de Jager, Philip; Mesirov, Jill
Flow cytometry is widely used for single cell interrogation of surface and intracellular protein expression by measuring fluorescence intensity of fluorophore-conjugated reagents. We focus on the recently developed procedure of Pyne et al. (2009, Proceedings of the National Academy of Sciences USA 106, 8519-8524) for automated high- dimensional flow cytometric analysis called FLAME (FLow analysis with Automated Multivariate Estimation). It introduced novel finite mixture models of heavy-tailed and asymmetric distributions to identify and model cell populations in a flow cytometric sample. This approach robustly addresses the complexities of flow data without the need for transformation or projection to lower dimensions. It also addresses the critical task of matching cell populations across samples that enables downstream analysis. It thus facilitates application of flow cytometry to new biological and clinical problems. To facilitate pipelining with standard bioinformatic applications such as high-dimensional visualization, subject classification or outcome prediction, FLAME has been incorporated with the GenePattern package of the Broad Institute. Thereby analysis of flow data can be approached similarly as other genomic platforms. We also consider some new work that proposes a rigorous and robust solution to the registration problem by a multi-level approach that allows us to model and register cell populations simultaneously across a cohort of high-dimensional flow samples. This new approach is called JCM (Joint Clustering and Matching). It enables direct and rigorous comparisons across different time points or phenotypes in a complex biological study as well as for classification of new patient samples in a more clinical setting.
Optimization of the K-means algorithm for the solution of high dimensional instances
NASA Astrophysics Data System (ADS)
Pérez, Joaquín; Pazos, Rodolfo; Olivares, Víctor; Hidalgo, Miguel; Ruiz, Jorge; Martínez, Alicia; Almanza, Nelva; González, Moisés
2016-06-01
This paper addresses the problem of clustering instances with a high number of dimensions. In particular, a new heuristic for reducing the complexity of the K-means algorithm is proposed. Traditionally, there are two approaches that deal with the clustering of instances with high dimensionality. The first executes a preprocessing step to remove those attributes of limited importance. The second, called divide and conquer, creates subsets that are clustered separately and later their results are integrated through post-processing. In contrast, this paper proposes a new solution which consists of the reduction of distance calculations from the objects to the centroids at the classification step. This heuristic is derived from the visual observation of the clustering process of K-means, in which it was found that the objects can only migrate to adjacent clusters without crossing distant clusters. Therefore, this heuristic can significantly reduce the number of distance calculations from an object to the centroids of the potential clusters that it may be classified to. To validate the proposed heuristic, it was designed a set of experiments with synthetic and high dimensional instances. One of the most notable results was obtained for an instance of 25,000 objects and 200 dimensions, where its execution time was reduced up to 96.5% and the quality of the solution decreased by only 0.24% when compared to the K-means algorithm.
Manifold learning to interpret JET high-dimensional operational space
NASA Astrophysics Data System (ADS)
Cannas, B.; Fanni, A.; Murari, A.; Pau, A.; Sias, G.; JET EFDA Contributors, the
2013-04-01
In this paper, the problem of visualization and exploration of JET high-dimensional operational space is considered. The data come from plasma discharges selected from JET campaigns from C15 (year 2005) up to C27 (year 2009). The aim is to learn the possible manifold structure embedded in the data and to create some representations of the plasma parameters on low-dimensional maps, which are understandable and which preserve the essential properties owned by the original data. A crucial issue for the design of such mappings is the quality of the dataset. This paper reports the details of the criteria used to properly select suitable signals downloaded from JET databases in order to obtain a dataset of reliable observations. Moreover, a statistical analysis is performed to recognize the presence of outliers. Finally data reduction, based on clustering methods, is performed to select a limited and representative number of samples for the operational space mapping. The high-dimensional operational space of JET is mapped using a widely used manifold learning method, the self-organizing maps. The results are compared with other data visualization methods. The obtained maps can be used to identify characteristic regions of the plasma scenario, allowing to discriminate between regions with high risk of disruption and those with low risk of disruption.
Scalable Nearest Neighbor Algorithms for High Dimensional Data.
Muja, Marius; Lowe, David G
2014-11-01
For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching. PMID:26353063
Information technology of clustering problem situations in computing and office equipment
NASA Astrophysics Data System (ADS)
Savchuk, T. O.; Petrishyn, S. I.; Kisała, Piotr; Imanbek, Baglan; Smailova, Saule
2015-12-01
The article contains information technology of clustering problem situations in computing and office equipment, which is based on an information model of clustering and modified clustering methods FOREL and K-MEANS such situations.
Distributed Computation of the knn Graph for Large High-Dimensional Point Sets.
Plaku, Erion; Kavraki, Lydia E
2007-03-01
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318
Distributed Computation of the knn Graph for Large High-Dimensional Point Sets
Plaku, Erion; Kavraki, Lydia E.
2009-01-01
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318
Analysis of data separation and recovery problems using clustered sparsity
NASA Astrophysics Data System (ADS)
King, Emily J.; Kutyniok, Gitta; Zhuang, Xiaosheng
2011-09-01
Data often have two or more fundamental components, like cartoon-like and textured elements in images; point, filament, and sheet clusters in astronomical data; and tonal and transient layers in audio signals. For many applications, separating these components is of interest. Another issue in data analysis is that of incomplete data, for example a photograph with scratches or seismic data collected with fewer than necessary sensors. There exists a unified approach to solving these problems which is minimizing the l1 norm of the analysis coefficients with respect to particular frame(s). This approach using the concept of clustered sparsity leads to similar theoretical bounds and results, which are presented here. Furthermore, necessary conditions for the frames to lead to sufficiently good solutions are also shown.
Statistical Physics of High Dimensional Inference
NASA Astrophysics Data System (ADS)
Advani, Madhu; Ganguli, Surya
To model modern large-scale datasets, we need efficient algorithms to infer a set of P unknown model parameters from N noisy measurements. What are fundamental limits on the accuracy of parameter inference, given limited measurements, signal-to-noise ratios, prior information, and computational tractability requirements? How can we combine prior information with measurements to achieve these limits? Classical statistics gives incisive answers to these questions as the measurement density α =N/P --> ∞ . However, modern high-dimensional inference problems, in fields ranging from bio-informatics to economics, occur at finite α. We formulate and analyze high-dimensional inference analytically by applying the replica and cavity methods of statistical physics where data serves as quenched disorder and inferred parameters play the role of thermal degrees of freedom. Our analysis reveals that widely cherished Bayesian inference algorithms such as maximum likelihood and maximum a posteriori are suboptimal in the modern setting, and yields new tractable, optimal algorithms to replace them as well as novel bounds on the achievable accuracy of a large class of high-dimensional inference algorithms. Thanks to Stanford Graduate Fellowship and Mind Brain Computation IGERT grant for support.
High dimensional feature reduction via projection pursuit
NASA Technical Reports Server (NTRS)
Jimenez, Luis; Landgrebe, David
1994-01-01
The recent development of more sophisticated remote sensing systems enables the measurement of radiation in many more spectral intervals than previously possible. An example of that technology is the AVIRIS system, which collects image data in 220 bands. As a result of this, new algorithms must be developed in order to analyze the more complex data effectively. Data in a high dimensional space presents a substantial challenge, since intuitive concepts valid in a 2-3 dimensional space to not necessarily apply in higher dimensional spaces. For example, high dimensional space is mostly empty. This results from the concentration of data in the corners of hypercubes. Other examples may be cited. Such observations suggest the need to project data to a subspace of a much lower dimension on a problem specific basis in such a manner that information is not lost. Projection Pursuit is a technique that will accomplish such a goal. Since it processes data in lower dimensions, it should avoid many of the difficulties of high dimensional spaces. In this paper, we begin the investigation of some of the properties of Projection Pursuit for this purpose.
Problem decomposition by mutual information and force-based clustering
NASA Astrophysics Data System (ADS)
Otero, Richard Edward
The scale of engineering problems has sharply increased over the last twenty years. Larger coupled systems, increasing complexity, and limited resources create a need for methods that automatically decompose problems into manageable sub-problems by discovering and leveraging problem structure. The ability to learn the coupling (inter-dependence) structure and reorganize the original problem could lead to large reductions in the time to analyze complex problems. Such decomposition methods could also provide engineering insight on the fundamental physics driving problem solution. This work forwards the current state of the art in engineering decomposition through the application of techniques originally developed within computer science and information theory. The work describes the current state of automatic problem decomposition in engineering and utilizes several promising ideas to advance the state of the practice. Mutual information is a novel metric for data dependence and works on both continuous and discrete data. Mutual information can measure both the linear and non-linear dependence between variables without the limitations of linear dependence measured through covariance. Mutual information is also able to handle data that does not have derivative information, unlike other metrics that require it. The value of mutual information to engineering design work is demonstrated on a planetary entry problem. This study utilizes a novel tool developed in this work for planetary entry system synthesis. A graphical method, force-based clustering, is used to discover related sub-graph structure as a function of problem structure and links ranked by their mutual information. This method does not require the stochastic use of neural networks and could be used with any link ranking method currently utilized in the field. Application of this method is demonstrated on a large, coupled low-thrust trajectory problem. Mutual information also serves as the basis for an
Application of clustering global optimization to thin film design problems.
Lemarchand, Fabien
2014-03-10
Refinement techniques usually calculate an optimized local solution, which is strongly dependent on the initial formula used for the thin film design. In the present study, a clustering global optimization method is used which can iteratively change this initial formula, thereby progressing further than in the case of local optimization techniques. A wide panel of local solutions is found using this procedure, resulting in a large range of optical thicknesses. The efficiency of this technique is illustrated by two thin film design problems, in particular an infrared antireflection coating, and a solar-selective absorber coating. PMID:24663856
Statistical challenges of high-dimensional data
Johnstone, Iain M.; Titterington, D. Michael
2009-01-01
Modern applications of statistical theory and methods can involve extremely large datasets, often with huge numbers of measurements on each of a comparatively small number of experimental units. New methodology and accompanying theory have emerged in response: the goal of this Theme Issue is to illustrate a number of these recent developments. This overview article introduces the difficulties that arise with high-dimensional data in the context of the very familiar linear statistical model: we give a taste of what can nevertheless be achieved when the parameter vector of interest is sparse, that is, contains many zero elements. We describe other ways of identifying low-dimensional subspaces of the data space that contain all useful information. The topic of classification is then reviewed along with the problem of identifying, from within a very large set, the variables that help to classify observations. Brief mention is made of the visualization of high-dimensional data and ways to handle computational problems in Bayesian analysis are described. At appropriate points, reference is made to the other papers in the issue. PMID:19805443
Visual Exploration of High Dimensional Scalar Functions
Gerber, Samuel; Bremer, Peer-Timo; Pascucci, Valerio; Whitaker, Ross
2011-01-01
An important goal of scientific data analysis is to understand the behavior of a system or process based on a sample of the system. In many instances it is possible to observe both input parameters and system outputs, and characterize the system as a high-dimensional function. Such data sets arise, for instance, in large numerical simulations, as energy landscapes in optimization problems, or in the analysis of image data relating to biological or medical parameters. This paper proposes an approach to analyze and visualizing such data sets. The proposed method combines topological and geometric techniques to provide interactive visualizations of discretely sampled high-dimensional scalar fields. The method relies on a segmentation of the parameter space using an approximate Morse-Smale complex on the cloud of point samples. For each crystal of the Morse-Smale complex, a regression of the system parameters with respect to the output yields a curve in the parameter space. The result is a simplified geometric representation of the Morse-Smale complex in the high dimensional input domain. Finally, the geometric representation is embedded in 2D, using dimension reduction, to provide a visualization platform. The geometric properties of the regression curves enable the visualization of additional information about each crystal such as local and global shape, width, length, and sampling densities. The method is illustrated on several synthetic examples of two dimensional functions. Two use cases, using data sets from the UCI machine learning repository, demonstrate the utility of the proposed approach on real data. Finally, in collaboration with domain experts the proposed method is applied to two scientific challenges. The analysis of parameters of climate simulations and their relationship to predicted global energy flux and the concentrations of chemical species in a combustion simulation and their integration with temperature. PMID:20975167
A facility for using cluster research to study environmental problems. Workshop proceedings
Not Available
1991-11-01
This report begins by describing the general application of cluster based research to environmental chemistry and the development of a Cluster Structure and Dynamics Research Facility (CSDRF). Next, four important areas of cluster research are described in more detail, including how they can impact environmental problems. These are: surface-supported clusters, water and contaminant interactions, time-resolved dynamic studies in clusters, and cluster structures and reactions. These facilities and equipment required for each area of research are then presented. The appendices contain workshop agenda and a listing of the researchers who participated in the workshop discussions that led to this report.
Six clustering algorithms applied to the WAIS-R: the problem of dissimilar cluster results.
Fraboni, M; Cooper, D
1989-11-01
Clusterings of the Wechsler Adult Intelligence Scale-Revised subtests were obtained from the application of six hierarchical clustering methods (N = 113). These sets of clusters were compared for similarities using the Rand index. The calculated indices suggested similarities of cluster group membership between the Complete Linkage and Centroid methods; Complete Linkage and Ward's methods; Centroid and Ward's methods; and Single Linkage and Average Linkage Between Groups methods. Cautious use of single clustering methods is implied, though the authors suggest some advantages of knowing specific similarities and differences. If between-method comparisons consistently reveal similar cluster membership, a choice could be made from those algorithms that tend to produce similar partitions, thereby enhancing cluster interpretation. PMID:2613904
Random rotation survival forest for high dimensional censored data.
Zhou, Lifeng; Wang, Hong; Xu, Qingsong
2016-01-01
Recently, rotation forest has been extended to regression and survival analysis problems. However, due to intensive computation incurred by principal component analysis, rotation forest often fails when high-dimensional or big data are confronted. In this study, we extend rotation forest to high dimensional censored time-to-event data analysis by combing random subspace, bagging and rotation forest. Supported by proper statistical analysis, we show that the proposed method random rotation survival forest outperforms state-of-the-art survival ensembles such as random survival forest and popular regularized Cox models. PMID:27625979
An approximation polynomial-time algorithm for a sequence bi-clustering problem
NASA Astrophysics Data System (ADS)
Kel'manov, A. V.; Khamidullin, S. A.
2015-06-01
We consider a strongly NP-hard problem of partitioning a finite sequence of vectors in Euclidean space into two clusters using the criterion of the minimal sum of the squared distances from the elements of the clusters to the centers of the clusters. The center of one of the clusters is to be optimized and is determined as the mean value over all vectors in this cluster. The center of the other cluster is fixed at the origin. Moreover, the partition is such that the difference between the indices of two successive vectors in the first cluster is bounded above and below by prescribed constants. A 2-approximation polynomial-time algorithm is proposed for this problem.
Optimal M-estimation in high-dimensional regression.
Bean, Derek; Bickel, Peter J; El Karoui, Noureddine; Yu, Bin
2013-09-01
We consider, in the modern setting of high-dimensional statistics, the classic problem of optimizing the objective function in regression using M-estimates when the error distribution is assumed to be known. We propose an algorithm to compute this optimal objective function that takes into account the dimensionality of the problem. Although optimality is achieved under assumptions on the design matrix that will not always be satisfied, our analysis reveals generally interesting families of dimension-dependent objective functions. PMID:23954907
Optimal M-estimation in high-dimensional regression
Bean, Derek; Bickel, Peter J.; El Karoui, Noureddine; Yu, Bin
2013-01-01
We consider, in the modern setting of high-dimensional statistics, the classic problem of optimizing the objective function in regression using M-estimates when the error distribution is assumed to be known. We propose an algorithm to compute this optimal objective function that takes into account the dimensionality of the problem. Although optimality is achieved under assumptions on the design matrix that will not always be satisfied, our analysis reveals generally interesting families of dimension-dependent objective functions. PMID:23954907
An Examination of Indexes for Determining the Number of Clusters in Binary Data Sets.
ERIC Educational Resources Information Center
Dimitriadou, Evgenia; Dolnicar, Sara; Weingessel, Andreas
2002-01-01
Explored the problem of choosing the correct number of clusters in cluster analysis of high dimensional empirical binary data. Findings from a simulation that included 162 binary data sets resulted in recommendations about the number of clusters for each index under consideration. Compared and analyzed the performance of index results. (SLD)
Sparse High Dimensional Models in Economics
Fan, Jianqing; Lv, Jinchi; Qi, Lei
2010-01-01
This paper reviews the literature on sparse high dimensional models and discusses some applications in economics and finance. Recent developments of theory, methods, and implementations in penalized least squares and penalized likelihood methods are highlighted. These variable selection methods are proved to be effective in high dimensional sparse modeling. The limits of dimensionality that regularization methods can handle, the role of penalty functions, and their statistical properties are detailed. Some recent advances in ultra-high dimensional sparse modeling are also briefly discussed. PMID:22022635
Clusters of primordial black holes and reionization problem
Belotsky, K. M. Kirillov, A. A. Rubin, S. G.
2015-05-15
Clusters of primordial black holes may cause the formation of quasars in the early Universe. In turn, radiation from these quasars may lead to the reionization of the Universe. However, the evaporation of primordial black holes via Hawking’s mechanism may also contribute to the ionization of matter. The possibility of matter ionization via the evaporation of primordial black holes with allowance for existing constraints on their density is discussed. The contribution to ionization from the evaporation of primordial black holes characterized by their preset mass spectrum can roughly be estimated at about 10{sup −3}.
Numerical methods for high-dimensional probability density function equations
NASA Astrophysics Data System (ADS)
Cho, H.; Venturi, D.; Karniadakis, G. E.
2016-01-01
In this paper we address the problem of computing the numerical solution to kinetic partial differential equations involving many phase variables. These types of equations arise naturally in many different areas of mathematical physics, e.g., in particle systems (Liouville and Boltzmann equations), stochastic dynamical systems (Fokker-Planck and Dostupov-Pugachev equations), random wave theory (Malakhov-Saichev equations) and coarse-grained stochastic systems (Mori-Zwanzig equations). We propose three different classes of new algorithms addressing high-dimensionality: The first one is based on separated series expansions resulting in a sequence of low-dimensional problems that can be solved recursively and in parallel by using alternating direction methods. The second class of algorithms relies on truncation of interaction in low-orders that resembles the Bogoliubov-Born-Green-Kirkwood-Yvon (BBGKY) framework of kinetic gas theory and it yields a hierarchy of coupled probability density function equations. The third class of algorithms is based on high-dimensional model representations, e.g., the ANOVA method and probabilistic collocation methods. A common feature of all these approaches is that they are reducible to the problem of computing the solution to high-dimensional equations via a sequence of low-dimensional problems. The effectiveness of the new algorithms is demonstrated in numerical examples involving nonlinear stochastic dynamical systems and partial differential equations, with up to 120 variables.
Feature extraction and classification algorithms for high dimensional data
NASA Technical Reports Server (NTRS)
Lee, Chulhee; Landgrebe, David
1993-01-01
Feature extraction and classification algorithms for high dimensional data are investigated. Developments with regard to sensors for Earth observation are moving in the direction of providing much higher dimensional multispectral imagery than is now possible. In analyzing such high dimensional data, processing time becomes an important factor. With large increases in dimensionality and the number of classes, processing time will increase significantly. To address this problem, a multistage classification scheme is proposed which reduces the processing time substantially by eliminating unlikely classes from further consideration at each stage. Several truncation criteria are developed and the relationship between thresholds and the error caused by the truncation is investigated. Next an approach to feature extraction for classification is proposed based directly on the decision boundaries. It is shown that all the features needed for classification can be extracted from decision boundaries. A characteristic of the proposed method arises by noting that only a portion of the decision boundary is effective in discriminating between classes, and the concept of the effective decision boundary is introduced. The proposed feature extraction algorithm has several desirable properties: it predicts the minimum number of features necessary to achieve the same classification accuracy as in the original space for a given pattern recognition problem; and it finds the necessary feature vectors. The proposed algorithm does not deteriorate under the circumstances of equal means or equal covariances as some previous algorithms do. In addition, the decision boundary feature extraction algorithm can be used both for parametric and non-parametric classifiers. Finally, some problems encountered in analyzing high dimensional data are studied and possible solutions are proposed. First, the increased importance of the second order statistics in analyzing high dimensional data is recognized
An Extended Membrane System with Active Membranes to Solve Automatic Fuzzy Clustering Problems.
Peng, Hong; Wang, Jun; Shi, Peng; Pérez-Jiménez, Mario J; Riscos-Núñez, Agustín
2016-05-01
This paper focuses on automatic fuzzy clustering problem and proposes a novel automatic fuzzy clustering method that employs an extended membrane system with active membranes that has been designed as its computing framework. The extended membrane system has a dynamic membrane structure; since membranes can evolve, it is particularly suitable for processing the automatic fuzzy clustering problem. A modification of a differential evolution (DE) mechanism was developed as evolution rules for objects according to membrane structure and object communication mechanisms. Under the control of both the object's evolution-communication mechanism and the membrane evolution mechanism, the extended membrane system can effectively determine the most appropriate number of clusters as well as the corresponding optimal cluster centers. The proposed method was evaluated over 13 benchmark problems and was compared with four state-of-the-art automatic clustering methods, two recently developed clustering methods and six classification techniques. The comparison results demonstrate the superiority of the proposed method in terms of effectiveness and robustness. PMID:26790484
Sparse representation approaches for the classification of high-dimensional biological data
2013-01-01
Background High-throughput genomic and proteomic data have important applications in medicine including prevention, diagnosis, treatment, and prognosis of diseases, and molecular biology, for example pathway identification. Many of such applications can be formulated to classification and dimension reduction problems in machine learning. There are computationally challenging issues with regards to accurately classifying such data, and which due to dimensionality, noise and redundancy, to name a few. The principle of sparse representation has been applied to analyzing high-dimensional biological data within the frameworks of clustering, classification, and dimension reduction approaches. However, the existing sparse representation methods are inefficient. The kernel extensions are not well addressed either. Moreover, the sparse representation techniques have not been comprehensively studied yet in bioinformatics. Results In this paper, a Bayesian treatment is presented on sparse representations. Various sparse coding and dictionary learning models are discussed. We propose fast parallel active-set optimization algorithm for each model. Kernel versions are devised based on their dimension-free property. These models are applied for classifying high-dimensional biological data. Conclusions In our experiment, we compared our models with other methods on both accuracy and computing time. It is shown that our models can achieve satisfactory accuracy, and their performance are very efficient. PMID:24565287
Problem-Solving Environments (PSEs) to Support Innovation Clustering
NASA Technical Reports Server (NTRS)
Gill, Zann
1999-01-01
This paper argues that there is need for high level concepts to inform the development of Problem-Solving Environment (PSE) capability. A traditional approach to PSE implementation is to: (1) assemble a collection of tools; (2) integrate the tools; and (3) assume that collaborative work begins after the PSE is assembled. I argue for the need to start from the opposite premise, that promoting human collaboration and observing that process comes first, followed by the development of supporting tools, and finally evolution of PSE capability through input from collaborating project teams.
Identifying the number of population clusters with structure: problems and solutions.
Gilbert, Kimberly J
2016-05-01
The program structure has been used extensively to understand and visualize population genetic structure. It is one of the most commonly used clustering algorithms, cited over 11 500 times in Web of Science since its introduction in 2000. The method estimates ancestry proportions to assign individuals to clusters, and post hoc analyses of results may indicate the most likely number of clusters, or populations, on the landscape. However, as has been shown in this issue of Molecular Ecology Resources by Puechmaille (), when sampling is uneven across populations or across hierarchical levels of population structure, these post hoc analyses can be inaccurate and identify an incorrect number of population clusters. To solve this problem, Puechmaille () presents strategies for subsampling and new analysis methods that are robust to uneven sampling to improve inferences of the number of population clusters. PMID:27062588
Automated high-dimensional flow cytometric data analysis
Pyne, Saumyadipta; Hu, Xinli; Wang, Kui; Rossin, Elizabeth; Lin, Tsung-I; Maier, Lisa M.; Baecher-Allan, Clare; McLachlan, Geoffrey J.; Tamayo, Pablo; Hafler, David A.; De Jager, Philip L.; Mesirov, Jill P.
2009-01-01
Flow cytometric analysis allows rapid single cell interrogation of surface and intracellular determinants by measuring fluorescence intensity of fluorophore-conjugated reagents. The availability of new platforms, allowing detection of increasing numbers of cell surface markers, has challenged the traditional technique of identifying cell populations by manual gating and resulted in a growing need for the development of automated, high-dimensional analytical methods. We present a direct multivariate finite mixture modeling approach, using skew and heavy-tailed distributions, to address the complexities of flow cytometric analysis and to deal with high-dimensional cytometric data without the need for projection or transformation. We demonstrate its ability to detect rare populations, to model robustly in the presence of outliers and skew, and to perform the critical task of matching cell populations across samples that enables downstream analysis. This advance will facilitate the application of flow cytometry to new, complex biological and clinical problems. PMID:19443687
HIGH DIMENSIONAL COVARIANCE MATRIX ESTIMATION IN APPROXIMATE FACTOR MODELS
Fan, Jianqing; Liao, Yuan; Mincheva, Martina
2012-01-01
The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu (2011), taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied. PMID:22661790
ANISOTROPIC THERMAL CONDUCTION AND THE COOLING FLOW PROBLEM IN GALAXY CLUSTERS
Parrish, Ian J.; Sharma, Prateek; Quataert, Eliot
2009-09-20
We examine the long-standing cooling flow problem in galaxy clusters with three-dimensional magnetohydrodynamics simulations of isolated clusters including radiative cooling and anisotropic thermal conduction along magnetic field lines. The central regions of the intracluster medium (ICM) can have cooling timescales of {approx}200 Myr or shorter-in order to prevent a cooling catastrophe the ICM must be heated by some mechanism such as active galactic nucleus feedback or thermal conduction from the thermal reservoir at large radii. The cores of galaxy clusters are linearly unstable to the heat-flux-driven buoyancy instability (HBI), which significantly changes the thermodynamics of the cluster core. The HBI is a convective, buoyancy-driven instability that rearranges the magnetic field to be preferentially perpendicular to the temperature gradient. For a wide range of parameters, our simulations demonstrate that in the presence of the HBI, the effective radial thermal conductivity is reduced to {approx}<10% of the full Spitzer conductivity. With this suppression of conductive heating, the cooling catastrophe occurs on a timescale comparable to the central cooling time of the cluster. Thermal conduction alone is thus unlikely to stabilize clusters with low central entropies and short central cooling timescales. High central entropy clusters have sufficiently long cooling times that conduction can help stave off the cooling catastrophe for cosmologically interesting timescales.
NASA Astrophysics Data System (ADS)
Masood, Tabasum
2016-07-01
The distribution of galaxies in the universe can be well understood by correlation function analysis. The lowest order two point auto correlation function has remained a successful tool for understanding the galaxy clustering phenomena. The two point correlation function is a probability of finding two galaxies in a given volume separated by some particular distance. Given a random galaxy in a location, the correlation function describes the probability that another galaxy will be found within a given distance .The correlation function tool is important for theoretical models of physical cosmology because it provides means of testing models which assume different things about the contents of the universe Correlation function is one of the way to characterize the distribution of galaxies in the space . This can be done by observations and can be extracted from numerical N-body experiments. Correlation function is a natural quantity in theoretical dynamical description of gravitating systems. These correlations can answer many interesting questions about the evolution and the distribution of galaxies.
High dimensional cohomology of discrete groups.
Brown, K S
1976-06-01
For a large class of discrete groups Gamma, relations are established between the high dimensional cohomology of Gamma and the cohomology of the normalizers of the finite subgroups of Gamma. The results are stated in terms of a generalization of Tate cohomology recently constructed by F. T. Farrell. As an illustration of these results, it is shown that one can recover a cohomology calculation of Lee and Szczarba, which they used to calculate the odd torsion in K(3)(Z). PMID:16592322
High dimensional cohomology of discrete groups
Brown, Kenneth S.
1976-01-01
For a large class of discrete groups Γ, relations are established between the high dimensional cohomology of Γ and the cohomology of the normalizers of the finite subgroups of Γ. The results are stated in terms of a generalization of Tate cohomology recently constructed by F. T. Farrell. As an illustration of these results, it is shown that one can recover a cohomology calculation of Lee and Szczarba, which they used to calculate the odd torsion in K3(Z). PMID:16592322
ERIC Educational Resources Information Center
Brusco, Michael J.; Kohn, Hans-Friedrich
2009-01-01
The clique partitioning problem (CPP) requires the establishment of an equivalence relation for the vertices of a graph such that the sum of the edge costs associated with the relation is minimized. The CPP has important applications for the social sciences because it provides a framework for clustering objects measured on a collection of nominal…
Locating landmarks on high-dimensional free energy surfaces.
Chen, Ming; Yu, Tang-Qing; Tuckerman, Mark E
2015-03-17
Coarse graining of complex systems possessing many degrees of freedom can often be a useful approach for analyzing and understanding key features of these systems in terms of just a few variables. The relevant energy landscape in a coarse-grained description is the free energy surface as a function of the coarse-grained variables, which, despite the dimensional reduction, can still be an object of high dimension. Consequently, navigating and exploring this high-dimensional free energy surface is a nontrivial task. In this paper, we use techniques from multiscale modeling, stochastic optimization, and machine learning to devise a strategy for locating minima and saddle points (termed "landmarks") on a high-dimensional free energy surface "on the fly" and without requiring prior knowledge of or an explicit form for the surface. In addition, we propose a compact graph representation of the landmarks and connections between them, and we show that the graph nodes can be subsequently analyzed and clustered based on key attributes that elucidate important properties of the system. Finally, we show that knowledge of landmark locations allows for the efficient determination of their relative free energies via enhanced sampling techniques. PMID:25737545
NASA Astrophysics Data System (ADS)
Stewart, John; Miller, Mayo; Audo, Christine; Stewart, Gay
2012-12-01
This study examined the evolution of student responses to seven contextually different versions of two Force Concept Inventory questions in an introductory physics course at the University of Arkansas. The consistency in answering the closely related questions evolved little over the seven-question exam. A model for the state of student knowledge involving the probability of selecting one of the multiple-choice answers was developed. Criteria for using clustering algorithms to extract model parameters were explored and it was found that the overlap between the probability distributions of the model vectors was an important parameter in characterizing the cluster models. The course data were then clustered and the extracted model showed that students largely fit into two groups both pre- and postinstruction: one that answered all questions correctly with high probability and one that selected the distracter representing the same misconception with high probability. For the course studied, 14% of the students were left with persistent misconceptions post instruction on a static force problem and 30% on a dynamic Newton’s third law problem. These students selected the answer representing the predominant misconception slightly more consistently postinstruction, indicating that the course studied had been ineffective at moving this subgroup of students nearer a Newtonian force concept and had instead moved them slightly farther away from a correct conceptual understanding of these two problems. The consistency in answering pairs of problems with varied physical contexts is shown to be an important supplementary statistic to the score on the problems and suggests that the inclusion of such problem pairs in future conceptual inventories would be efficacious. Multiple, contextually varied questions further probe the structure of students’ knowledge. To allow working instructors to make use of the additional insight gained from cluster analysis, it is our hope that the
Technical Report: Scalable Parallel Algorithms for High Dimensional Numerical Integration
Masalma, Yahya; Jiao, Yu
2010-10-01
We implemented a scalable parallel quasi-Monte Carlo numerical high-dimensional integration for tera-scale data points. The implemented algorithm uses the Sobol s quasi-sequences to generate random samples. Sobol s sequence was used to avoid clustering effects in the generated random samples and to produce low-discrepancy random samples which cover the entire integration domain. The performance of the algorithm was tested. Obtained results prove the scalability and accuracy of the implemented algorithms. The implemented algorithm could be used in different applications where a huge data volume is generated and numerical integration is required. We suggest using the hyprid MPI and OpenMP programming model to improve the performance of the algorithms. If the mixed model is used, attention should be paid to the scalability and accuracy.
NASA Astrophysics Data System (ADS)
Konno, Yohko; Suzuki, Keiji
This paper describes an approach to development of a solution algorithm of a general-purpose for large scale problems using “Local Clustering Organization (LCO)” as a new solution for Job-shop scheduling problem (JSP). Using a performance effective large scale scheduling in the study of usual LCO, a solving JSP keep stability induced better solution is examined. In this study for an improvement of a performance of a solution for JSP, processes to a optimization by LCO is examined, and a scheduling solution-structure is extended to a new solution-structure based on machine-division. A solving method introduced into effective local clustering for the solution-structure is proposed as an extended LCO. An extended LCO has an algorithm which improves scheduling evaluation efficiently by clustering of parallel search which extends over plural machines. A result verified by an application of extended LCO on various scale of problems proved to conduce to minimizing make-span and improving on the stable performance.
Mode Estimation for High Dimensional Discrete Tree Graphical Models
Chen, Chao; Liu, Han; Metaxas, Dimitris N.; Zhao, Tianqi
2014-01-01
This paper studies the following problem: given samples from a high dimensional discrete distribution, we want to estimate the leading (δ, ρ)-modes of the underlying distributions. A point is defined to be a (δ, ρ)-mode if it is a local optimum of the density within a δ-neighborhood under metric ρ. As we increase the “scale” parameter δ, the neighborhood size increases and the total number of modes monotonically decreases. The sequence of the (δ, ρ)-modes reveal intrinsic topographical information of the underlying distributions. Though the mode finding problem is generally intractable in high dimensions, this paper unveils that, if the distribution can be approximated well by a tree graphical model, mode characterization is significantly easier. An efficient algorithm with provable theoretical guarantees is proposed and is applied to applications like data analysis and multiple predictions. PMID:25620859
A cluster-analytic study of substance problems and mental health among street youths.
Adlaf, E M; Zdanowicz, Y M
1999-11-01
Based on a cluster analysis of 211 street youths aged 13-24 years interviewed in 1992 in Toronto, Ontario, Canada, we describe the configuration of mental health and substance use outcomes. Eight clusters were suggested: Entrepreneurs (n = 19) were frequently involved in delinquent activity and were highly entrenched in the street lifestyle; Drifters (n = 35) had infrequent social contact, displayed lower than average family dysfunction, and were not highly entrenched in the street lifestyle; Partiers (n = 40) were distinguished by their recreational motivation for alcohol and drug use and their below average entrenchment in the street lifestyle; Retreatists (n = 32) were distinguished by their high coping motivation for substance use; Fringers (n = 48) were involved marginally in the street lifestyle and showed lower than average family dysfunction; Transcenders (n = 21), despite above average physical and sexual abuse, reported below average mental health or substance use problems; Vulnerables (n = 12) were characterized by high family dysfunction (including physical and sexual abuse), elevated mental health outcomes, and use of alcohol and other drugs motivated by coping and escapism; Sex Workers (n = 4) were highly entrenched in the street lifestyle and reported frequent commercial sexual work, above average sexual abuse, and extensive use of crack cocaine. The results showed that distress, self-esteem, psychotic thoughts, attempted suicide, alcohol problems, drug problems, dual substance problems, and dual disorders varied significantly among the eight clusters. Overall, the findings suggest the need for differential programming. The data showed that risk factors, mental health, and substance use outcomes vary among this population. Also, for some the web of mental health and substance use problems is inseparable. PMID:10548440
GLOBALLY ADAPTIVE QUANTILE REGRESSION WITH ULTRA-HIGH DIMENSIONAL DATA
Zheng, Qi; Peng, Limin; He, Xuming
2015-01-01
Quantile regression has become a valuable tool to analyze heterogeneous covaraite-response associations that are often encountered in practice. The development of quantile regression methodology for high dimensional covariates primarily focuses on examination of model sparsity at a single or multiple quantile levels, which are typically prespecified ad hoc by the users. The resulting models may be sensitive to the specific choices of the quantile levels, leading to difficulties in interpretation and erosion of confidence in the results. In this article, we propose a new penalization framework for quantile regression in the high dimensional setting. We employ adaptive L1 penalties, and more importantly, propose a uniform selector of the tuning parameter for a set of quantile levels to avoid some of the potential problems with model selection at individual quantile levels. Our proposed approach achieves consistent shrinkage of regression quantile estimates across a continuous range of quantiles levels, enhancing the flexibility and robustness of the existing penalized quantile regression methods. Our theoretical results include the oracle rate of uniform convergence and weak convergence of the parameter estimators. We also use numerical studies to confirm our theoretical findings and illustrate the practical utility of our proposal. PMID:26604424
Spatially Weighted Principal Component Regression for High-dimensional Prediction
Shen, Dan; Zhu, Hongtu
2015-01-01
We consider the problem of using high dimensional data residing on graphs to predict a low-dimensional outcome variable, such as disease status. Examples of data include time series and genetic data measured on linear graphs and imaging data measured on triangulated graphs (or lattices), among many others. Many of these data have two key features including spatial smoothness and intrinsically low dimensional structure. We propose a simple solution based on a general statistical framework, called spatially weighted principal component regression (SWPCR). In SWPCR, we introduce two sets of weights including importance score weights for the selection of individual features at each node and spatial weights for the incorporation of the neighboring pattern on the graph. We integrate the importance score weights with the spatial weights in order to recover the low dimensional structure of high dimensional data. We demonstrate the utility of our methods through extensive simulations and a real data analysis based on Alzheimer’s disease neuroimaging initiative data. PMID:26213452
Saltstone, R; Fraboni, M
1990-11-01
This study utilized the four most commonly employed clustering techniques (CLINK, SLINK, UPGMA, and Ward's) to illustrate the dissimilarity of cluster group membership (based upon short-form MMPI scale scores and a measure of alcohol dependency) between partitions in a sample of 113 impaired driving offenders. Results, examined with the Rand index of cluster comparison, demonstrated that cluster group membership can be so different between alternative clustering methods as to equal chance assignment. Cautions are given with regard to the use of cluster analysis for other than exploratory work. In particular, psychologists are cautioned against attempting to use cluster analysis based upon personality inventory scores (which can never be wholly reliable or discrete) for patient classification. PMID:2286695
A Selective Overview of Variable Selection in High Dimensional Feature Space
Fan, Jianqing
2010-01-01
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976
A Selective Overview of Variable Selection in High Dimensional Feature Space.
Fan, Jianqing; Lv, Jinchi
2010-01-01
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976
NASA Astrophysics Data System (ADS)
Heggie, D.; Hut, P.
2003-10-01
focus on N = 106 for two main reasons: first, direct numerical integrations of N-body systems are beginning to approach this threshold, and second, globular star clusters provide remarkably accurate physical instantiations of the idealized N-body problem with N = 105 - 106. The authors are distinguished contributors to the study of star-cluster dynamics and the gravitational N-body problem. The book contains lucid and concise descriptions of most of the important tools in the subject, with only a modest bias towards the authors' own interests. These tools include the two-body relaxation approximation, the Vlasov and Fokker-Planck equations, regularization of close encounters, conducting fluid models, Hill's approximation, Heggie's law for binary star evolution, symplectic integration algorithms, Liapunov exponents, and so on. The book also provides an up-to-date description of the principal processes that drive the evolution of idealized N-body systems - two-body relaxation, mass segregation, escape, core collapse and core bounce, binary star hardening, gravothermal oscillations - as well as additional processes such as stellar collisions and tidal shocks that affect real star clusters but not idealized N-body systems. In a relatively short (300 pages plus appendices) book such as this, many topics have to be omitted. The reader who is hoping to learn about the phenomenology of star clusters will be disappointed, as the description of their properties is limited to only a page of text; there is also almost no discussion of other, equally interesting N-body systems such as galaxies(N approx 106 - 1012), open clusters (N simeq 102 - 104), planetary systems, or the star clusters surrounding black holes that are found in the centres of most galaxies. All of these omissions are defensible decisions. Less defensible is the uneven set of references in the text; for example, nowhere is the reader informed that the classic predecessor to this work was Spitzer's 1987 monograph
Kennedy, Angie C; Adams, Adrienne E
2016-04-01
Using a cluster analysis approach with a sample of 205 young mothers recruited from community sites in an urban Midwestern setting, we examined the effects of cumulative violence exposure (community violence exposure, witnessing intimate partner violence, physical abuse by a caregiver, and sexual victimization, all with onset prior to age 13) on school participation, as mediated by attention and behavior problems in school. We identified five clusters of cumulative exposure, and found that the HiAll cluster (high levels of exposure to all four types) consistently fared the worst, with significantly higher attention and behavior problems, and lower school participation, in comparison with the LoAll cluster (low levels of exposure to all types). Behavior problems were a significant mediator of the effects of cumulative violence exposure on school participation, but attention problems were not. PMID:25538121
Graphics Processing Units and High-Dimensional Optimization
Zhou, Hua; Lange, Kenneth; Suchard, Marc A.
2011-01-01
This paper discusses the potential of graphics processing units (GPUs) in high-dimensional optimization problems. A single GPU card with hundreds of arithmetic cores can be inserted in a personal computer and dramatically accelerates many statistical algorithms. To exploit these devices fully, optimization algorithms should reduce to multiple parallel tasks, each accessing a limited amount of data. These criteria favor EM and MM algorithms that separate parameters and data. To a lesser extent block relaxation and coordinate descent and ascent also qualify. We demonstrate the utility of GPUs in nonnegative matrix factorization, PET image reconstruction, and multidimensional scaling. Speedups of 100 fold can easily be attained. Over the next decade, GPUs will fundamentally alter the landscape of computational statistics. It is time for more statisticians to get on-board. PMID:21847315
High-dimensional bolstered error estimation
Sima, Chao; Braga-Neto, Ulisses M.; Dougherty, Edward R.
2011-01-01
Motivation: In small-sample settings, bolstered error estimation has been shown to perform better than cross-validation and competitively with bootstrap with regard to various criteria. The key issue for bolstering performance is the variance setting for the bolstering kernel. Heretofore, this variance has been determined in a non-parametric manner from the data. Although bolstering based on this variance setting works well for small feature sets, results can deteriorate for high-dimensional feature spaces. Results: This article computes an optimal kernel variance depending on the classification rule, sample size, model and feature space, both the original number and the number remaining after feature selection. A key point is that the optimal variance is robust relative to the model. This allows us to develop a method for selecting a suitable variance to use in real-world applications where the model is not known, but the other factors in determining the optimal kernel are known. Availability: Companion website at http://compbio.tgen.org/paper_supp/high_dim_bolstering Contact: edward@mail.ece.tamu.edu PMID:21914630
Solution of relativistic quantum optics problems using clusters of graphical processing units
Gordon, D.F. Hafizi, B.; Helle, M.H.
2014-06-15
Numerical solution of relativistic quantum optics problems requires high performance computing due to the rapid oscillations in a relativistic wavefunction. Clusters of graphical processing units are used to accelerate the computation of a time dependent relativistic wavefunction in an arbitrary external potential. The stationary states in a Coulomb potential and uniform magnetic field are determined analytically and numerically, so that they can used as initial conditions in fully time dependent calculations. Relativistic energy levels in extreme magnetic fields are recovered as a means of validation. The relativistic ionization rate is computed for an ion illuminated by a laser field near the usual barrier suppression threshold, and the ionizing wavefunction is displayed.
Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences
NASA Technical Reports Server (NTRS)
Budalakoti, Suratna; Srivastava, Ashok N.; Akella, Ram; Turkov, Eugene
2006-01-01
This paper addresses the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. The approach taken uses unsupervised clustering of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by detailed analysis of outliers to detect anomalies. As the LCS measure is expensive to compute, the first part of the paper discusses existing algorithms, such as the Hunt-Szymanski algorithm, that have low time-complexity. We then discuss why these algorithms often do not work well in practice and present a new hybrid algorithm for computing the LCS that, in our tests, outperforms the Hunt-Szymanski algorithm by a factor of five. The second part of the paper presents new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an outlier. The algorithms provide a coherent description to an analyst of the anomalies in the sequence, compared to more normal sequences. The algorithms we present are general and domain-independent, so we discuss applications in related areas such as anomaly detection.
Optimal control problem for the three-sector economic model of a cluster
NASA Astrophysics Data System (ADS)
Murzabekov, Zainel; Aipanov, Shamshi; Usubalieva, Saltanat
2016-08-01
The problem of optimal control for the three-sector economic model of a cluster is considered. Task statement is to determine the optimal distribution of investment and manpower in moving the system from a given initial state to desired final state. To solve the optimal control problem with finite-horizon planning, in case of fixed ends of trajectories, with box constraints, the method of Lagrange multipliers of a special type is used. This approach allows to represent the desired control in the form of synthesis control, depending on state of the system and current time. The results of numerical calculations for an instance of three-sector model of the economy show the effectiveness of the proposed method.
Solving the inverse Ising problem by mean-field methods in a clustered phase space with many states
NASA Astrophysics Data System (ADS)
Decelle, Aurélien; Ricci-Tersenghi, Federico
2016-07-01
In this work we explain how to properly use mean-field methods to solve the inverse Ising problem when the phase space is clustered, that is, many states are present. The clustering of the phase space can occur for many reasons, e.g., when a system undergoes a phase transition, but also when data are collected in different regimes (e.g., quiescent and spiking regimes in neural networks). Mean-field methods for the inverse Ising problem are typically used without taking into account the eventual clustered structure of the input configurations and may lead to very poor inference (e.g., in the low-temperature phase of the Curie-Weiss model). In this work we explain how to modify mean-field approaches when the phase space is clustered and we illustrate the effectiveness of our method on different clustered structures (low-temperature phases of Curie-Weiss and Hopfield models).
Solving the inverse Ising problem by mean-field methods in a clustered phase space with many states.
Decelle, Aurélien; Ricci-Tersenghi, Federico
2016-07-01
In this work we explain how to properly use mean-field methods to solve the inverse Ising problem when the phase space is clustered, that is, many states are present. The clustering of the phase space can occur for many reasons, e.g., when a system undergoes a phase transition, but also when data are collected in different regimes (e.g., quiescent and spiking regimes in neural networks). Mean-field methods for the inverse Ising problem are typically used without taking into account the eventual clustered structure of the input configurations and may lead to very poor inference (e.g., in the low-temperature phase of the Curie-Weiss model). In this work we explain how to modify mean-field approaches when the phase space is clustered and we illustrate the effectiveness of our method on different clustered structures (low-temperature phases of Curie-Weiss and Hopfield models). PMID:27575082
NASA Astrophysics Data System (ADS)
Kel'manov, A. V.; Khandeev, V. I.
2016-02-01
The strongly NP-hard problem of partitioning a finite set of points of Euclidean space into two clusters of given sizes (cardinalities) minimizing the sum (over both clusters) of the intracluster sums of squared distances from the elements of the clusters to their centers is considered. It is assumed that the center of one of the sought clusters is specified at the desired (arbitrary) point of space (without loss of generality, at the origin), while the center of the other one is unknown and determined as the mean value over all elements of this cluster. It is shown that unless P = NP, there is no fully polynomial-time approximation scheme for this problem, and such a scheme is substantiated in the case of a fixed space dimension.
High dimensional decision dilemmas in climate models
NASA Astrophysics Data System (ADS)
Bracco, A.; Neelin, J. D.; Luo, H.; McWilliams, J. C.; Meyerson, J. E.
2013-05-01
An important source of uncertainty in climate models is linked to the calibration of model parameters. Interest in systematic and automated parameter optimization procedures stems from the desire to improve the model climatology and to quantify the average sensitivity associated with potential changes in the climate system. Neelin et al. (2010) used a quadratic metamodel to objectively calibrate an atmospheric circulation model (AGCM) around four adjustable parameters. The metamodel accurately estimates global spatial averages of common fields of climatic interest, from precipitation, to low and high level winds, from temperature at various levels to sea level pressure and geopotential height, while providing a computationally cheap strategy to explore the influence of parameter settings. Here, guided by the metamodel, the ambiguities or dilemmas related to the decision making process in relation to model sensitivity and optimization are examined. Simulations of current climate are subject to considerable regional-scale biases. Those biases may vary substantially depending on the climate variable considered, and/or on the performance metric adopted. Common dilemmas are associated with model revisions yielding improvement in one field or regional pattern or season, but degradation in another, or improvement in the model climatology but degradation in the interannual variability representation. Challenges are posed to the modeler by the high dimensionality of the model output fields and by the large number of adjustable parameters. The use of the metamodel in the optimization strategy helps visualize trade-offs at a regional level, e.g. how mismatches between sensitivity and error spatial fields yield regional errors under minimization of global objective functions.
High dimensional decision dilemmas in climate models
NASA Astrophysics Data System (ADS)
Bracco, A.; Neelin, J. D.; Luo, H.; McWilliams, J. C.; Meyerson, J. E.
2013-10-01
An important source of uncertainty in climate models is linked to the calibration of model parameters. Interest in systematic and automated parameter optimization procedures stems from the desire to improve the model climatology and to quantify the average sensitivity associated with potential changes in the climate system. Building upon on the smoothness of the response of an atmospheric circulation model (AGCM) to changes of four adjustable parameters, Neelin et al. (2010) used a quadratic metamodel to objectively calibrate the AGCM. The metamodel accurately estimates global spatial averages of common fields of climatic interest, from precipitation, to low and high level winds, from temperature at various levels to sea level pressure and geopotential height, while providing a computationally cheap strategy to explore the influence of parameter settings. Here, guided by the metamodel, the ambiguities or dilemmas related to the decision making process in relation to model sensitivity and optimization are examined. Simulations of current climate are subject to considerable regional-scale biases. Those biases may vary substantially depending on the climate variable considered, and/or on the performance metric adopted. Common dilemmas are associated with model revisions yielding improvement in one field or regional pattern or season, but degradation in another, or improvement in the model climatology but degradation in the interannual variability representation. Challenges are posed to the modeler by the high dimensionality of the model output fields and by the large number of adjustable parameters. The use of the metamodel in the optimization strategy helps visualize trade-offs at a regional level, e.g., how mismatches between sensitivity and error spatial fields yield regional errors under minimization of global objective functions.
Toward a nonlinear ensemble filter for high-dimensional systems
NASA Astrophysics Data System (ADS)
Bengtsson, Thomas; Snyder, Chris; Nychka, Doug
2003-12-01
Many geophysical problems are characterized by high-dimensional, nonlinear systems and pose difficult challenges for real-time data assimilation (updating) and forecasting. The present work builds on the ensemble Kalman filter (EnsKF), with the goal of producing ensemble filtering techniques applicable to non-Gaussian densities and high-dimensional systems. Three filtering algorithms, based on representing the prior density as a Gaussian mixture, are presented. The first, referred to as a mixture ensemble Kalman filter (XEnsF), models local covariance structures adaptively using nearest neighbors. The XEnsF is effective in a three-dimensional system, but the required ensemble grows rapidly with the dimension and, even in a 40-dimensional system, we find the XEnsF to be unstable and inferior to the EnsKF for all computationally feasible ensemble sizes. A second algorithm, the local-local ensemble filter (LLEnsF), combines localizations in physical as well as phase space, allowing the update step in high-dimensional systems to be decomposed into a sequence of lower-dimensional updates tractable by the XEnsF. Given the same prior forecasts in a 40-dimensional system, the LLEnsF update produces more accurate state estimates than the EnsKF if the forecast distributions are sufficiently non-Gaussian. Cycling the LLEnsF for long times, however, produces results inferior to the EnsKF because the LLEnsF ignores spatial continuity or smoothness between local state estimates. To address this weakness of the LLEnsF, we consider ways of enforcing spatial smoothness by conditioning the local updates on the prior estimates outside the localization in physical space. These considerations yield a third algorithm, which is a hybrid of the LLEnsF and the EnsKF. The hybrid uses information from the EnsKF to ensure spatial continuity of local updates and outperforms the EnsKF by 5.7% in RMS error in the 40-dimensional system.
Visualization of High-Dimensionality Data Using Virtual Reality
NASA Astrophysics Data System (ADS)
Djorgovski, S. G.; Donalek, C.; Davidoff, S.; Lombeyda, S.
2015-12-01
An effective visualization of complex and high-dimensionality data sets is now a critical bottleneck on the path from data to discovery in all fields. Visual pattern recognition is the bridge between human intuition and understanding, and the quantitative content of the data and the relationships present there (correlations, outliers, clustering, etc.). We are developing a novel platform for visualization of complex, multi-dimensional data, using immersive virtual reality (VR), that leverages the recent rapid developments in the availability of commodity hardware and development software. VR immersion has been shown to significantly increase the effective visual perception and intuition, compared to the traditional flat-screen tools. This allows to more easily perceive higher dimensional spaces, with an advantage for a visual exploration of complex data compared to the traditional visualization methods. Immersive VR also offers a natural way for a collaborative visual exploration of data, with multiple users interacting with each other and with their data in the same perceptive data space.
Engineering two-photon high-dimensional states through quantum interference
Zhang, Yingwen; Roux, Filippus S.; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew
2016-01-01
Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits. PMID:26933685
Engineering two-photon high-dimensional states through quantum interference.
Zhang, Yingwen; Roux, Filippus S; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew
2016-02-01
Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits. PMID:26933685
Collaborative Care Outcomes for Pediatric Behavioral Health Problems: A Cluster Randomized Trial
Campo, John; Kilbourne, Amy M.; Hart, Jonathan; Sakolsky, Dara; Wisniewski, Stephen
2014-01-01
OBJECTIVE: To assess the efficacy of collaborative care for behavior problems, attention-deficit/hyperactivity disorder (ADHD), and anxiety in pediatric primary care (Doctor Office Collaborative Care; DOCC). METHODS: Children and their caregivers participated from 8 pediatric practices that were cluster randomized to DOCC (n = 160) or enhanced usual care (EUC; n = 161). In DOCC, a care manager delivered a personalized, evidence-based intervention. EUC patients received psychoeducation and a facilitated specialty care referral. Care processes measures were collected after the 6-month intervention period. Family outcome measures included the Vanderbilt ADHD Diagnostic Parent Rating Scale, Parenting Stress Index-Short Form, Individualized Goal Attainment Ratings, and Clinical Global Impression-Improvement Scale. Most measures were collected at baseline, and 6-, 12-, and 18-month assessments. Provider outcome measures examined perceived treatment change, efficacy, and obstacles, and practice climate. RESULTS: DOCC (versus EUC) was associated with higher rates of treatment initiation (99.4% vs 54.2%; P < .001) and completion (76.6% vs 11.6%, P < .001), improvement in behavior problems, hyperactivity, and internalizing problems (P < .05 to .01), and parental stress (P < .05–.001), remission in behavior and internalizing problems (P < .01, .05), goal improvement (P < .05 to .001), treatment response (P < .05), and consumer satisfaction (P < .05). DOCC pediatricians reported greater perceived practice change, efficacy, and skill use to treat ADHD (P < .05 to .01). CONCLUSIONS: Implementing a collaborative care intervention for behavior problems in community pediatric practices is feasible and broadly effective, supporting the utility of integrated behavioral health care services. PMID:24664093
NASA Astrophysics Data System (ADS)
Li, Weixuan; Lin, Guang; Li, Bing
2016-09-01
Many uncertainty quantification (UQ) approaches suffer from the curse of dimensionality, that is, their computational costs become intractable for problems involving a large number of uncertainty parameters. In these situations, the classic Monte Carlo often remains the preferred method of choice because its convergence rate O (n - 1 / 2), where n is the required number of model simulations, does not depend on the dimension of the problem. However, many high-dimensional UQ problems are intrinsically low-dimensional, because the variation of the quantity of interest (QoI) is often caused by only a few latent parameters varying within a low-dimensional subspace, known as the sufficient dimension reduction (SDR) subspace in the statistics literature. Motivated by this observation, we propose two inverse regression-based UQ algorithms (IRUQ) for high-dimensional problems. Both algorithms use inverse regression to convert the original high-dimensional problem to a low-dimensional one, which is then efficiently solved by building a response surface for the reduced model, for example via the polynomial chaos expansion. The first algorithm, which is for the situations where an exact SDR subspace exists, is proved to converge at rate O (n-1), hence much faster than MC. The second algorithm, which doesn't require an exact SDR, employs the reduced model as a control variate to reduce the error of the MC estimate. The accuracy gain could still be significant, depending on how well the reduced model approximates the original high-dimensional one. IRUQ also provides several additional practical advantages: it is non-intrusive; it does not require computing the high-dimensional gradient of the QoI; and it reports an error bar so the user knows how reliable the result is.
Extremely high-dimensional feature selection via feature generating samplings.
Li, Shutao; Wei, Dan
2014-06-01
To select informative features on extremely high-dimensional problems, in this paper, a sampling scheme is proposed to enhance the efficiency of recently developed feature generating machines (FGMs). Note that in FGMs O(mlogr) time complexity should be taken to order the features by their scores; the entire computational cost of feature ordering will become unbearable when m is very large, for example, m > 10(11) , where m is the feature dimensionality and r is the size of the selected feature subset. To solve this problem, in this paper, we propose a feature generating sampling method, which can reduce this computational complexity to O(Gslog(G)+G(G+log(G))) while preserving the most informative features in a feature buffer, where Gs is the maximum number of nonzero features for each instance and G is the buffer size. Moreover, we show that our proposed sampling scheme can be deemed as the birth-death process based on random processes theory, which guarantees to include most of the informative features for feature selections. Empirical studies on real-world datasets show the effectiveness of the proposed sampling method. PMID:23864272
Avoiding common pitfalls when clustering biological data.
Ronan, Tom; Qi, Zhijie; Naegle, Kristen M
2016-01-01
Clustering is an unsupervised learning method, which groups data points based on similarity, and is used to reveal the underlying structure of data. This computational approach is essential to understanding and visualizing the complex data that are acquired in high-throughput multidimensional biological experiments. Clustering enables researchers to make biological inferences for further experiments. Although a powerful technique, inappropriate application can lead biological researchers to waste resources and time in experimental follow-up. We review common pitfalls identified from the published molecular biology literature and present methods to avoid them. Commonly encountered pitfalls relate to the high-dimensional nature of biological data from high-throughput experiments, the failure to consider more than one clustering method for a given problem, and the difficulty in determining whether clustering has produced meaningful results. We present concrete examples of problems and solutions (clustering results) in the form of toy problems and real biological data for these issues. We also discuss ensemble clustering as an easy-to-implement method that enables the exploration of multiple clustering solutions and improves robustness of clustering solutions. Increased awareness of common clustering pitfalls will help researchers avoid overinterpreting or misinterpreting the results and missing valuable insights when clustering biological data. PMID:27303057
NASA Astrophysics Data System (ADS)
Hill, C.
2008-12-01
Low cost graphic cards today use many, relatively simple, compute cores to deliver support for memory bandwidth of more than 100GB/s and theoretical floating point performance of more than 500 GFlop/s. Right now this performance is, however, only accessible to highly parallel algorithm implementations that, (i) can use a hundred or more, 32-bit floating point, concurrently executing cores, (ii) can work with graphics memory that resides on the graphics card side of the graphics bus and (iii) can be partially expressed in a language that can be compiled by a graphics programming tool. In this talk we describe our experiences implementing a complete, but relatively simple, time dependent shallow-water equations simulation targeting a cluster of 30 computers each hosting one graphics card. The implementation takes into account the considerations (i), (ii) and (iii) listed previously. We code our algorithm as a series of numerical kernels. Each kernel is designed to be executed by multiple threads of a single process. Kernels are passed memory blocks to compute over which can be persistent blocks of memory on a graphics card. Each kernel is individually implemented using the NVidia CUDA language but driven from a higher level supervisory code that is almost identical to a standard model driver. The supervisory code controls the overall simulation timestepping, but is written to minimize data transfer between main memory and graphics memory (a massive performance bottle-neck on current systems). Using the recipe outlined we can boost the performance of our cluster by nearly an order of magnitude, relative to the same algorithm executing only on the cluster CPU's. Achieving this performance boost requires that many threads are available to each graphics processor for execution within each numerical kernel and that the simulations working set of data can fit into the graphics card memory. As we describe, this puts interesting upper and lower bounds on the problem sizes
High-dimensional mode analyzers for spatial quantum entanglement
Oemrawsingh, S. S. R.; Jong, J. A. de; Ma, X.; Aiello, A.; Eliel, E. R.; Hooft, G. W. 't; Woerdman, J. P.
2006-03-15
By analyzing entangled photon states in terms of high-dimensional spatial mode superpositions, it becomes feasible to expose high-dimensional entanglement, and even the nonlocality of twin photons. To this end, a proper analyzer should be designed that is capable of handling a large number of spatial modes, while still being convenient to use in an experiment. We compare two variants of a high-dimensional spatial mode analyzer on the basis of classical and quantum considerations. These analyzers have been tested in classical optical experiments.
Statistical mechanics of complex neural systems and high dimensional data
NASA Astrophysics Data System (ADS)
Advani, Madhu; Lahiri, Subhaneil; Ganguli, Surya
2013-03-01
Recent experimental advances in neuroscience have opened new vistas into the immense complexity of neuronal networks. This proliferation of data challenges us on two parallel fronts. First, how can we form adequate theoretical frameworks for understanding how dynamical network processes cooperate across widely disparate spatiotemporal scales to solve important computational problems? Second, how can we extract meaningful models of neuronal systems from high dimensional datasets? To aid in these challenges, we give a pedagogical review of a collection of ideas and theoretical methods arising at the intersection of statistical physics, computer science and neurobiology. We introduce the interrelated replica and cavity methods, which originated in statistical physics as powerful ways to quantitatively analyze large highly heterogeneous systems of many interacting degrees of freedom. We also introduce the closely related notion of message passing in graphical models, which originated in computer science as a distributed algorithm capable of solving large inference and optimization problems involving many coupled variables. We then show how both the statistical physics and computer science perspectives can be applied in a wide diversity of contexts to problems arising in theoretical neuroscience and data analysis. Along the way we discuss spin glasses, learning theory, illusions of structure in noise, random matrices, dimensionality reduction and compressed sensing, all within the unified formalism of the replica method. Moreover, we review recent conceptual connections between message passing in graphical models, and neural computation and learning. Overall, these ideas illustrate how statistical physics and computer science might provide a lens through which we can uncover emergent computational functions buried deep within the dynamical complexities of neuronal networks.
NASA Astrophysics Data System (ADS)
Evangelista, Francesco A.
2011-06-01
We report a general implementation of alternative formulations of single-reference coupled cluster theory (extended, unitary, and variational) with arbitrary-order truncation of the cluster operator. These methods are applied to compute the energy of Ne and the equilibrium properties of HF and C2. Potential energy curves for the dissociation of HF and the BeH2 model computed with the extended, variational, and unitary coupled cluster approaches are compared to those obtained from the multireference coupled cluster approach of Mukherjee et al. [J. Chem. Phys. 110, 6171 (1999)] and the internally contracted multireference coupled cluster approach [F. A. Evangelista and J. Gauss, J. Chem. Phys. 134, 114102 (2011), 10.1063/1.3559149]. In the case of Ne, HF, and C2, the alternative coupled cluster approaches yield almost identical bond length, harmonic vibrational frequency, and anharmonic constant, which are more accurate than those from traditional coupled cluster theory. For potential energy curves, the alternative coupled cluster methods are found to be more accurate than traditional coupled cluster theory, but are three to ten times less accurate than multireference coupled cluster approaches. The most challenging benchmark, the BeH2 model, highlights the strong dependence of the alternative coupled cluster theories on the choice of the Fermi vacuum. When evaluated by the accuracy to cost ratio, the alternative coupled cluster methods are not competitive with respect to traditional CC theory, in other words, the simplest theory is found to be the most effective one.
HYPOTHESIS TESTING FOR HIGH-DIMENSIONAL SPARSE BINARY REGRESSION
Mukherjee, Rajarshi; Pillai, Natesh S.; Lin, Xihong
2015-01-01
In this paper, we study the detection boundary for minimax hypothesis testing in the context of high-dimensional, sparse binary regression models. Motivated by genetic sequencing association studies for rare variant effects, we investigate the complexity of the hypothesis testing problem when the design matrix is sparse. We observe a new phenomenon in the behavior of detection boundary which does not occur in the case of Gaussian linear regression. We derive the detection boundary as a function of two components: a design matrix sparsity index and signal strength, each of which is a function of the sparsity of the alternative. For any alternative, if the design matrix sparsity index is too high, any test is asymptotically powerless irrespective of the magnitude of signal strength. For binary design matrices with the sparsity index that is not too high, our results are parallel to those in the Gaussian case. In this context, we derive detection boundaries for both dense and sparse regimes. For the dense regime, we show that the generalized likelihood ratio is rate optimal; for the sparse regime, we propose an extended Higher Criticism Test and show it is rate optimal and sharp. We illustrate the finite sample properties of the theoretical results using simulation studies. PMID:26246645
Optimization of High-Dimensional Functions through Hypercube Evaluation
Abiyev, Rahib H.; Tunay, Mustafa
2015-01-01
A novel learning algorithm for solving global numerical optimization problems is proposed. The proposed learning algorithm is intense stochastic search method which is based on evaluation and optimization of a hypercube and is called the hypercube optimization (HO) algorithm. The HO algorithm comprises the initialization and evaluation process, displacement-shrink process, and searching space process. The initialization and evaluation process initializes initial solution and evaluates the solutions in given hypercube. The displacement-shrink process determines displacement and evaluates objective functions using new points, and the search area process determines next hypercube using certain rules and evaluates the new solutions. The algorithms for these processes have been designed and presented in the paper. The designed HO algorithm is tested on specific benchmark functions. The simulations of HO algorithm have been performed for optimization of functions of 1000-, 5000-, or even 10000 dimensions. The comparative simulation results with other approaches demonstrate that the proposed algorithm is a potential candidate for optimization of both low and high dimensional functions. PMID:26339237
Blöchliger, Nicolas; Caflisch, Amedeo; Vitalis, Andreas
2015-11-10
Data mining techniques depend strongly on how the data are represented and how distance between samples is measured. High-dimensional data often contain a large number of irrelevant dimensions (features) for a given query. These features act as noise and obfuscate relevant information. Unsupervised approaches to mine such data require distance measures that can account for feature relevance. Molecular dynamics simulations produce high-dimensional data sets describing molecules observed in time. Here, we propose to globally or locally weight simulation features based on effective rates. This emphasizes, in a data-driven manner, slow degrees of freedom that often report on the metastable states sampled by the molecular system. We couple this idea to several unsupervised learning protocols. Our approach unmasks slow side chain dynamics within the native state of a miniprotein and reveals additional metastable conformations of a protein. The approach can be combined with most algorithms for clustering or dimensionality reduction. PMID:26574336
High-dimensional genomic data bias correction and data integration using MANCIE
Zang, Chongzhi; Wang, Tao; Deng, Ke; Li, Bo; Hu, Sheng'en; Qin, Qian; Xiao, Tengfei; Zhang, Shihua; Meyer, Clifford A.; He, Housheng Hansen; Brown, Myles; Liu, Jun S.; Xie, Yang; Liu, X. Shirley
2016-01-01
High-dimensional genomic data analysis is challenging due to noises and biases in high-throughput experiments. We present a computational method matrix analysis and normalization by concordant information enhancement (MANCIE) for bias correction and data integration of distinct genomic profiles on the same samples. MANCIE uses a Bayesian-supported principal component analysis-based approach to adjust the data so as to achieve better consistency between sample-wise distances in the different profiles. MANCIE can improve tissue-specific clustering in ENCODE data, prognostic prediction in Molecular Taxonomy of Breast Cancer International Consortium and The Cancer Genome Atlas data, copy number and expression agreement in Cancer Cell Line Encyclopedia data, and has broad applications in cross-platform, high-dimensional data integration. PMID:27072482
High-dimensional genomic data bias correction and data integration using MANCIE.
Zang, Chongzhi; Wang, Tao; Deng, Ke; Li, Bo; Hu, Sheng'en; Qin, Qian; Xiao, Tengfei; Zhang, Shihua; Meyer, Clifford A; He, Housheng Hansen; Brown, Myles; Liu, Jun S; Xie, Yang; Liu, X Shirley
2016-01-01
High-dimensional genomic data analysis is challenging due to noises and biases in high-throughput experiments. We present a computational method matrix analysis and normalization by concordant information enhancement (MANCIE) for bias correction and data integration of distinct genomic profiles on the same samples. MANCIE uses a Bayesian-supported principal component analysis-based approach to adjust the data so as to achieve better consistency between sample-wise distances in the different profiles. MANCIE can improve tissue-specific clustering in ENCODE data, prognostic prediction in Molecular Taxonomy of Breast Cancer International Consortium and The Cancer Genome Atlas data, copy number and expression agreement in Cancer Cell Line Encyclopedia data, and has broad applications in cross-platform, high-dimensional data integration. PMID:27072482
Bit-table based biclustering and frequent closed itemset mining in high-dimensional binary data.
Király, András; Gyenesei, Attila; Abonyi, János
2014-01-01
During the last decade various algorithms have been developed and proposed for discovering overlapping clusters in high-dimensional data. The two most prominent application fields in this research, proposed independently, are frequent itemset mining (developed for market basket data) and biclustering (applied to gene expression data analysis). The common limitation of both methodologies is the limited applicability for very large binary data sets. In this paper we propose a novel and efficient method to find both frequent closed itemsets and biclusters in high-dimensional binary data. The method is based on simple but very powerful matrix and vector multiplication approaches that ensure that all patterns can be discovered in a fast manner. The proposed algorithm has been implemented in the commonly used MATLAB environment and freely available for researchers. PMID:24616651
Visual Exploration of High-Dimensional Data through Subspace Analysis and Dynamic Projections
Liu, S.; Wang, B.; Thiagarajan, Jayaraman J.; Bremer, Peer -Timo; Pascucci, Valerio
2015-06-01
We introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that create smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.
ClusterSculptor: Software for Expert-Steered Classification of Single Particle Mass Spectra
Zelenyuk, Alla; Imre, Dan G.; Nam, Eun Ju; Han, Yiping; Mueller, Klaus
2008-08-01
To take full advantage of the vast amount of highly detailed data acquired by single particle mass spectrometers requires that the data be organized according to some rules that have the potential to be insightful. Most commonly statistical tools are used to cluster the individual particle mass spectra on the basis of their similarity. Cluster analysis is a powerful strategy for the exploration of high-dimensional data in the absence of a-priori hypotheses or data classification models, and the results of cluster analysis can then be used to form such models. More often than not, when examining the data clustering results we find that many clusters contain particles of different types and that many particles of one type end up in a number of separate clusters. Our experience with cluster analysis shows that we have a vast amount of non-compiled knowledge and intuition that should be brought to bear in this effort. We will present new software we call ClusterSculptor that provides comprehensive and intuitive framework to aid scientists in data classification. ClusterSculptor uses k-means as the overall clustering engine, but allows tuning its parameters interactively, based on a non-distorted compact visual presentation of the inherent characteristics of the data in high-dimensional space. ClusterSculptor provides all the tools necessary for a high-dimensional activity we call cluster sculpting. ClusterSculptor is designed to be coupled to SpectraMiner, our data mining and visualization software package. The data are first visualized with SpectraMiner and identified problems are exported to ClusterSculptor, where the user steers the reclassification and recombination of clusters of tens of thousands particle mass spectra in real-time. The resulting sculpted clusters can be then imported back into SpectraMiner. Here we will greatly improved single particle chemical speciation in an example of application of this new tool to a number of particle types of atmospheric
Convex Clustering: An Attractive Alternative to Hierarchical Clustering
Chen, Gary K.; Chi, Eric C.; Ranola, John Michael O.; Lange, Kenneth
2015-01-01
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/ PMID:25965340
An Overview of Air Pollution Problem in Megacities and City Clusters in China
NASA Astrophysics Data System (ADS)
Tang, X.
2007-05-01
China has experienced the rapid economic growth in last twenty years. City clusters, which consist of one or several mega cities in close vicinity and many satellite cities and towns, are playing a leading role in Chinese economic growth, owing to their collective economic capacity and interdependency. However, accompanying with the economic boom, population growth and increased energy consumption, the air quality has been degrading in the past two decades. Air pollution in those areas is characterized by concurrent occurrence of high concentrations of multiple primary pollutants leading to form complex secondary pollution problem. After decades long efforts to control air pollution, both the government and scientific communities have realized that to control regional scale air pollution, regional efforts are needed. Field experiments covering the regions like Pearl River Delta region and Beijing City with surrounding areas are critical to understand the chemical and physical processes leading to the formation of regional scale air pollution. In order to formulate policy suggestions for air quality attainment during 2008 Beijing Olympic game and to propose objectives of air quality attainment in 2010 in Beijing, CAREBEIJING (Campaigns of Air Quality Research in Beijing and Surrounding Region) was organized by Peking University in 2006 to learn current air pollution situation of the region, and to identify the transport and transformation processes that lead to the impact of the surrounding area on air quality in Beijing. Same as the purpose for understanding the chemical and physical processes happened in regional scale, the fall and summer campaigns in 2004 and 2006 were carried out in Pearl River Delta. More than 16 domestic and foreign institutions were involved in these campaigns. The background, current status, problems, and some results of these campaigns will be introduced in this presentation.
Lee, Hyun Jung; McDonnell, Kevin T.; Zelenyuk, Alla; Imre, D.; Mueller, Klaus
2014-03-01
Although the Euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging inter-cluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multi-dimensional scaling (MDS) where one can often observe non-intuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our MDS plots also exhibit similar visual relationships as the method of parallel coordinates which is often used alongside to visualize the high-dimensional data in raw form. We then cast our metric into a bi-scale framework which distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate Euclidean distance.
Choosing ℓp norms in high-dimensional spaces based on hub analysis
Flexer, Arthur; Schnitzer, Dominik
2015-01-01
The hubness phenomenon is a recently discovered aspect of the curse of dimensionality. Hub objects have a small distance to an exceptionally large number of data points while anti-hubs lie far from all other data points. A closely related problem is the concentration of distances in high-dimensional spaces. Previous work has already advocated the use of fractional ℓp norms instead of the ubiquitous Euclidean norm to avoid the negative effects of distance concentration. However, which exact fractional norm to use is a largely unsolved problem. The contribution of this work is an empirical analysis of the relation of different ℓp norms and hubness. We propose an unsupervised approach for choosing an ℓp norm which minimizes hubs while simultaneously maximizing nearest neighbor classification. Our approach is evaluated on seven high-dimensional data sets and compared to three approaches that re-scale distances to avoid hubness. PMID:26640321
A quasi-Newton acceleration for high-dimensional optimization algorithms
Alexander, David; Lange, Kenneth
2010-01-01
In many statistical problems, maximum likelihood estimation by an EM or MM algorithm suffers from excruciatingly slow convergence. This tendency limits the application of these algorithms to modern high-dimensional problems in data mining, genomics, and imaging. Unfortunately, most existing acceleration techniques are ill-suited to complicated models involving large numbers of parameters. The squared iterative methods (SQUAREM) recently proposed by Varadhan and Roland constitute one notable exception. This paper presents a new quasi-Newton acceleration scheme that requires only modest increments in computation per iteration and overall storage and rivals or surpasses the performance of SQUAREM on several representative test problems. PMID:21359052
Autonomous mental development in high dimensional context and action spaces.
Joshi, Ameet; Weng, Juyang
2003-01-01
Autonomous Mental Development (AMD) of robots opened a new paradigm for developing machine intelligence, using neural network type of techniques and it fundamentally changed the way an intelligent machine is developed from manual to autonomous. The work presented here is a part of SAIL (Self-Organizing Autonomous Incremental Learner) project which deals with autonomous development of humanoid robot with vision, audition, manipulation and locomotion. The major issue addressed here is the challenge of high dimensional action space (5-10) in addition to the high dimensional context space (hundreds to thousands and beyond), typically required by an AMD machine. This is the first work that studies a high dimensional (numeric) action space in conjunction with a high dimensional perception (context state) space, under the AMD mode. Two new learning algorithms, Direct Update on Direction Cosines (DUDC) and High-Dimensional Conjugate Gradient Search (HCGS), are developed, implemented and tested. The convergence properties of both the algorithms and their targeted applications are discussed. Autonomous learning of speech production under reinforcement learning is studied as an example. PMID:12850025
Harnessing high-dimensional hyperentanglement through a biphoton frequency comb
NASA Astrophysics Data System (ADS)
Xie, Zhenda; Zhong, Tian; Shrestha, Sajan; Xu, Xinan; Liang, Junlin; Gong, Yan-Xiao; Bienfang, Joshua C.; Restelli, Alessandro; Shapiro, Jeffrey H.; Wong, Franco N. C.; Wei Wong, Chee
2015-08-01
Quantum entanglement is a fundamental resource for secure information processing and communications, and hyperentanglement or high-dimensional entanglement has been separately proposed for its high data capacity and error resilience. The continuous-variable nature of the energy-time entanglement makes it an ideal candidate for efficient high-dimensional coding with minimal limitations. Here, we demonstrate the first simultaneous high-dimensional hyperentanglement using a biphoton frequency comb to harness the full potential in both the energy and time domain. Long-postulated Hong-Ou-Mandel quantum revival is exhibited, with up to 19 time-bins and 96.5% visibilities. We further witness the high-dimensional energy-time entanglement through Franson revivals, observed periodically at integer time-bins, with 97.8% visibility. This qudit state is observed to simultaneously violate the generalized Bell inequality by up to 10.95 standard deviations while observing recurrent Clauser-Horne-Shimony-Holt S-parameters up to 2.76. Our biphoton frequency comb provides a platform for photon-efficient quantum communications towards the ultimate channel capacity through energy-time-polarization high-dimensional encoding.
Optimally splitting cases for training and testing high dimensional classifiers
2011-01-01
Background We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? Results We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. Conclusions By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split. PMID:21477282
An Effective Parameter Screening Strategy for High Dimensional Watershed Models
NASA Astrophysics Data System (ADS)
Khare, Y. P.; Martinez, C. J.; Munoz-Carpena, R.
2014-12-01
Watershed simulation models can assess the impacts of natural and anthropogenic disturbances on natural systems. These models have become important tools for tackling a range of water resources problems through their implementation in the formulation and evaluation of Best Management Practices, Total Maximum Daily Loads, and Basin Management Action Plans. For accurate applications of watershed models they need to be thoroughly evaluated through global uncertainty and sensitivity analyses (UA/SA). However, due to the high dimensionality of these models such evaluation becomes extremely time- and resource-consuming. Parameter screening, the qualitative separation of important parameters, has been suggested as an essential step before applying rigorous evaluation techniques such as the Sobol' and Fourier Amplitude Sensitivity Test (FAST) methods in the UA/SA framework. The method of elementary effects (EE) (Morris, 1991) is one of the most widely used screening methodologies. Some of the common parameter sampling strategies for EE, e.g. Optimized Trajectories [OT] (Campolongo et al., 2007) and Modified Optimized Trajectories [MOT] (Ruano et al., 2012), suffer from inconsistencies in the generated parameter distributions, infeasible sample generation time, etc. In this work, we have formulated a new parameter sampling strategy - Sampling for Uniformity (SU) - for parameter screening which is based on the principles of the uniformity of the generated parameter distributions and the spread of the parameter sample. A rigorous multi-criteria evaluation (time, distribution, spread and screening efficiency) of OT, MOT, and SU indicated that SU is superior to other sampling strategies. Comparison of the EE-based parameter importance rankings with those of Sobol' helped to quantify the qualitativeness of the EE parameter screening approach, reinforcing the fact that one should use EE only to reduce the resource burden required by FAST/Sobol' analyses but not to replace it.
Understanding 3D human torso shape via manifold clustering
NASA Astrophysics Data System (ADS)
Li, Sheng; Li, Peng; Fu, Yun
2013-05-01
Discovering the variations in human torso shape plays a key role in many design-oriented applications, such as suit designing. With recent advances in 3D surface imaging technologies, people can obtain 3D human torso data that provide more information than traditional measurements. However, how to find different human shapes from 3D torso data is still an open problem. In this paper, we propose to use spectral clustering approach on torso manifold to address this problem. We first represent high-dimensional torso data in a low-dimensional space using manifold learning algorithm. Then the spectral clustering method is performed to get several disjoint clusters. Experimental results show that the clusters discovered by our approach can describe the discrepancies in both genders and human shapes, and our approach achieves better performance than the compared clustering method.
NASA Astrophysics Data System (ADS)
Bastian, Nate; Cabrera-Ziri, Ivan; Salaris, Maurizio
2015-05-01
A number of stellar sources have been advocated as the origin of the enriched material required to explain the abundance anomalies seen in ancient globular clusters (GCs). Most studies to date have compared the yields from potential sources [asymptotic giant branch stars (AGBs), fast rotating massive stars (FRMS), high-mass interacting binaries (IBs), and very massive stars (VMS)] with observations of specific elements that are observed to vary from star-to-star in GCs, focusing on extreme GCs such as NGC 2808, which display large He variations. However, a consistency check between the results of fitting extreme cases with the requirements of more typical clusters, has rarely been done. Such a check is particularly timely given the constraints on He abundances in GCs now available. Here, we show that all of the popular enrichment sources fail to reproduce the observed trends in GCs, focusing primarily on Na, O and He. In particular, we show that any model that can fit clusters like NGC 2808, will necessarily fail (by construction) to fit more typical clusters like 47 Tuc or NGC 288. All sources severely overproduce He for most clusters. Additionally, given the large differences in He spreads between clusters, but similar spreads observed in Na-O, only sources with large degrees of stochasticity in the resulting yields will be able to fit the observations. We conclude that no enrichment source put forward so far (AGBs, FRMS, IBs, VMS - or combinations thereof) is consistent with the observations of GCs. Finally, the observed trends of increasing [N/Fe] and He spread with increasing cluster mass cannot be resolved within a self-enrichment framework, without further exacerbating the mass-budget problem.
Hypergraph-based anomaly detection of high-dimensional co-occurrences.
Silva, Jorge; Willett, Rebecca
2009-03-01
This paper addresses the problem of detecting anomalous multivariate co-occurrences using a limited number of unlabeled training observations. A novel method based on using a hypergraph representation of the data is proposed to deal with this very high-dimensional problem. Hypergraphs constitute an important extension of graphs which allow edges to connect more than two vertices simultaneously. A variational Expectation-Maximization algorithm for detecting anomalies directly on the hypergraph domain without any feature selection or dimensionality reduction is presented. The resulting estimate can be used to calculate a measure of anomalousness based on the False Discovery Rate. The algorithm has O(np) computational complexity, where n is the number of training observations and p is the number of potential participants in each co-occurrence event. This efficiency makes the method ideally suited for very high-dimensional settings, and requires no tuning, bandwidth or regularization parameters. The proposed approach is validated on both high-dimensional synthetic data and the Enron email database, where p > 75,000, and it is shown that it can outperform other state-of-the-art methods. PMID:19147882
High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries
Zollanvari, Amin
2015-01-01
High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical–statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject. PMID:27081307
Querying Patterns in High-Dimensional Heterogenous Datasets
ERIC Educational Resources Information Center
Singh, Vishwakarma
2012-01-01
The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a…
High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries.
Zollanvari, Amin
2015-01-01
High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical-statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject. PMID:27081307
Partially supervised speaker clustering.
Tang, Hao; Chu, Stephen Mingyu; Hasegawa-Johnson, Mark; Huang, Thomas S
2012-05-01
Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm—linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical
Lin, Wei; Feng, Rui; Li, Hongzhe
2014-01-01
In genetical genomics studies, it is important to jointly analyze gene expression data and genetic variants in exploring their associations with complex traits, where the dimensionality of gene expressions and genetic variants can both be much larger than the sample size. Motivated by such modern applications, we consider the problem of variable selection and estimation in high-dimensional sparse instrumental variables models. To overcome the difficulty of high dimensionality and unknown optimal instruments, we propose a two-stage regularization framework for identifying and estimating important covariate effects while selecting and estimating optimal instruments. The methodology extends the classical two-stage least squares estimator to high dimensions by exploiting sparsity using sparsity-inducing penalty functions in both stages. The resulting procedure is efficiently implemented by coordinate descent optimization. For the representative L1 regularization and a class of concave regularization methods, we establish estimation, prediction, and model selection properties of the two-stage regularized estimators in the high-dimensional setting where the dimensionality of co-variates and instruments are both allowed to grow exponentially with the sample size. The practical performance of the proposed method is evaluated by simulation studies and its usefulness is illustrated by an analysis of mouse obesity data. Supplementary materials for this article are available online. PMID:26392642
Ma Xiang; Zabaras, Nicholas
2010-05-20
A computational methodology is developed to address the solution of high-dimensional stochastic problems. It utilizes high-dimensional model representation (HDMR) technique in the stochastic space to represent the model output as a finite hierarchical correlated function expansion in terms of the stochastic inputs starting from lower-order to higher-order component functions. HDMR is efficient at capturing the high-dimensional input-output relationship such that the behavior for many physical systems can be modeled to good accuracy only by the first few lower-order terms. An adaptive version of HDMR is also developed to automatically detect the important dimensions and construct higher-order terms using only the important dimensions. The newly developed adaptive sparse grid collocation (ASGC) method is incorporated into HDMR to solve the resulting sub-problems. By integrating HDMR and ASGC, it is computationally possible to construct a low-dimensional stochastic reduced-order model of the high-dimensional stochastic problem and easily perform various statistic analysis on the output. Several numerical examples involving elementary mathematical functions and fluid mechanics problems are considered to illustrate the proposed method. The cases examined show that the method provides accurate results for stochastic dimensionality as high as 500 even with large-input variability. The efficiency of the proposed method is examined by comparing with Monte Carlo (MC) simulation.
High-dimensional statistical inference: From vector to matrix
NASA Astrophysics Data System (ADS)
Zhang, Anru
Statistical inference for sparse signals or low-rank matrices in high-dimensional settings is of significant interest in a range of contemporary applications. It has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. In this thesis, we consider several problems in including sparse signal recovery (compressed sensing under restricted isometry) and low-rank matrix recovery (matrix recovery via rank-one projections and structured matrix completion). The first part of the thesis discusses compressed sensing and affine rank minimization in both noiseless and noisy cases and establishes sharp restricted isometry conditions for sparse signal and low-rank matrix recovery. The analysis relies on a key technical tool which represents points in a polytope by convex combinations of sparse vectors. The technique is elementary while leads to sharp results. It is shown that, in compressed sensing, delta kA < 1/3, deltak A+ thetak,kA < 1, or deltatkA < √( t - 1)/t for any given constant t ≥ 4/3 guarantee the exact recovery of all k sparse signals in the noiseless case through the constrained ℓ1 minimization, and similarly in affine rank minimization delta rM < 1/3, deltar M + thetar, rM < 1, or deltatrM< √( t - 1)/t ensure the exact reconstruction of all matrices with rank at most r in the noiseless case via the constrained nuclear norm minimization. Moreover, for any epsilon > 0, delta kA < 1/3 + epsilon, deltak A + thetak,kA < 1 + epsilon, or deltatkA< √(t - 1) / t + epsilon are not sufficient to guarantee the exact recovery of all k-sparse signals for large k. Similar result also holds for matrix recovery. In addition, the conditions delta kA<1/3, deltak A+ thetak,kA<1, delta tkA < √(t - 1)/t and deltarM<1/3, delta rM+ thetar,rM<1, delta trM< √(t - 1)/ t are also shown to be sufficient respectively for stable recovery of approximately sparse signals and low-rank matrices in the noisy case
Censored Rank Independence Screening for High-dimensional Survival Data
Song, Rui; Lu, Wenbin; Ma, Shuangge; Jeng, X. Jessie
2014-01-01
Summary In modern statistical applications, the dimension of covariates can be much larger than the sample size. In the context of linear models, correlation screening (Fan and Lv, 2008) has been shown to reduce the dimension of such data effectively while achieving the sure screening property, i.e., all of the active variables can be retained with high probability. However, screening based on the Pearson correlation does not perform well when applied to contaminated covariates and/or censored outcomes. In this paper, we study censored rank independence screening of high-dimensional survival data. The proposed method is robust to predictors that contain outliers, works for a general class of survival models, and enjoys the sure screening property. Simulations and an analysis of real data demonstrate that the proposed method performs competitively on survival data sets of moderate size and high-dimensional predictors, even when these are contaminated. PMID:25663709
Ensemble of sparse classifiers for high-dimensional biological data.
Kim, Sunghan; Scalzo, Fabien; Telesca, Donatello; Hu, Xiao
2015-01-01
Biological data are often high in dimension while the number of samples is small. In such cases, the performance of classification can be improved by reducing the dimension of data, which is referred to as feature selection. Recently, a novel feature selection method has been proposed utilising the sparsity of high-dimensional biological data where a small subset of features accounts for most variance of the dataset. In this study we propose a new classification method for high-dimensional biological data, which performs both feature selection and classification within a single framework. Our proposed method utilises a sparse linear solution technique and the bootstrap aggregating algorithm. We tested its performance on four public mass spectrometry cancer datasets along with two other conventional classification techniques such as Support Vector Machines and Adaptive Boosting. The results demonstrate that our proposed method performs more accurate classification across various cancer datasets than those conventional classification techniques. PMID:26510301
Why neurons mix: high dimensionality for higher cognition.
Fusi, Stefano; Miller, Earl K; Rigotti, Mattia
2016-04-01
Neurons often respond to diverse combinations of task-relevant variables. This form of mixed selectivity plays an important computational role which is related to the dimensionality of the neural representations: high-dimensional representations with mixed selectivity allow a simple linear readout to generate a huge number of different potential responses. In contrast, neural representations based on highly specialized neurons are low dimensional and they preclude a linear readout from generating several responses that depend on multiple task-relevant variables. Here we review the conceptual and theoretical framework that explains the importance of mixed selectivity and the experimental evidence that recorded neural representations are high-dimensional. We end by discussing the implications for the design of future experiments. PMID:26851755
Some Unsolved Problems, Questions, and Applications of the Brightsen Nucleon Cluster Model
NASA Astrophysics Data System (ADS)
Smarandache, Florentin
2010-10-01
Brightsen Model is opposite to the Standard Model, and it was build on John Weeler's Resonating Group Structure Model and on Linus Pauling's Close-Packed Spheron Model. Among Brightsen Model's predictions and applications we cite the fact that it derives the average number of prompt neutrons per fission event, it provides a theoretical way for understanding the low temperature / low energy reactions and for approaching the artificially induced fission, it predicts that forces within nucleon clusters are stronger than forces between such clusters within isotopes; it predicts the unmatter entities inside nuclei that result from stable and neutral union of matter and antimatter, and so on. But these predictions have to be tested in the future at the new CERN laboratory.
Quantum Teleportation of High-dimensional Atomic Momenta State
NASA Astrophysics Data System (ADS)
Qurban, Misbah; Abbas, Tasawar; Rameez-ul-Islam; Ikram, Manzoor
2016-06-01
Atomic momenta states of the neutral atoms are known to be decoherence resistant and therefore present a viable solution for most of the quantum information tasks including the quantum teleportation. We present a systematic protocol for the teleportation of high-dimensional quantized momenta atomic states to the field state inside the cavities by applying standard cavity QED techniques. The proposal can be executed under prevailing experimental scenario.
NASA Astrophysics Data System (ADS)
Denissenkov, P. A.; VandenBerg, D. A.; Hartwick, F. D. A.; Herwig, F.; Weiss, A.; Paxton, B.
2015-04-01
We demonstrate that among the potential sources of the primordial abundance variations of the proton-capture elements in globular-cluster stars proposed so far, such as the hot-bottom burning in massive asymptotic giant branch stars and H burning in the convective cores of supermassive and fast-rotating massive main-sequence (MS) stars, only the supermassive MS stars with M > 104 M⊙ can explain all the observed abundance correlations without any fine-tuning of model parameters. We use our assumed chemical composition for the pristine gas in M13 (NGC 6205) and its mixtures with 50 and 90 per cent of the material partially processed in H burning in the 6 × 104 M⊙ MS model star as the initial compositions for the normal, intermediate, and extreme populations of low-mass stars in this globular cluster, as suggested by its O-Na anticorrelation. We evolve these stars from the zero-age MS to the red giant branch (RGB) tip with the thermohaline and parametric prescriptions for the RGB extra mixing. We find that the 3He-driven thermohaline convection cannot explain the evolutionary decline of [C/Fe] in M13 RGB stars, which, on the other hand, is well reproduced with the universal values for the mixing depth and rate calibrated using the observed decrease of [C/Fe] with MV in the globular cluster NGC5466 that does not have the primordial abundance variations.
High dimensional model representation method for fuzzy structural dynamics
NASA Astrophysics Data System (ADS)
Adhikari, S.; Chowdhury, R.; Friswell, M. I.
2011-03-01
Uncertainty propagation in multi-parameter complex structures possess significant computational challenges. This paper investigates the possibility of using the High Dimensional Model Representation (HDMR) approach when uncertain system parameters are modeled using fuzzy variables. In particular, the application of HDMR is proposed for fuzzy finite element analysis of linear dynamical systems. The HDMR expansion is an efficient formulation for high-dimensional mapping in complex systems if the higher order variable correlations are weak, thereby permitting the input-output relationship behavior to be captured by the terms of low-order. The computational effort to determine the expansion functions using the α-cut method scales polynomically with the number of variables rather than exponentially. This logic is based on the fundamental assumption underlying the HDMR representation that only low-order correlations among the input variables are likely to have significant impacts upon the outputs for most high-dimensional complex systems. The proposed method is first illustrated for multi-parameter nonlinear mathematical test functions with fuzzy variables. The method is then integrated with a commercial finite element software (ADINA). Modal analysis of a simplified aircraft wing with fuzzy parameters has been used to illustrate the generality of the proposed approach. In the numerical examples, triangular membership functions have been used and the results have been validated against direct Monte Carlo simulations. It is shown that using the proposed HDMR approach, the number of finite element function calls can be reduced without significantly compromising the accuracy.
McGrath, L M; Mustanski, B; Metzger, A; Pine, D S; Kistner-Griffin, E; Cook, E; Wakschlag, L S
2012-08-01
This study illustrates the application of a latent modeling approach to genotype-phenotype relationships and gene × environment interactions, using a novel, multidimensional model of adult female problem behavior, including maternal prenatal smoking. The gene of interest is the monoamine oxidase A (MAOA) gene which has been well studied in relation to antisocial behavior. Participants were adult women (N = 192) who were sampled from a prospective pregnancy cohort of non-Hispanic, white individuals recruited from a neighborhood health clinic. Structural equation modeling was used to model a female problem behavior phenotype, which included conduct problems, substance use, impulsive-sensation seeking, interpersonal aggression, and prenatal smoking. All of the female problem behavior dimensions clustered together strongly, with the exception of prenatal smoking. A main effect of MAOA genotype and a MAOA × physical maltreatment interaction were detected with the Conduct Problems factor. Our phenotypic model showed that prenatal smoking is not simply a marker of other maternal problem behaviors. The risk variant in the MAOA main effect and interaction analyses was the high activity MAOA genotype, which is discrepant from consensus findings in male samples. This result contributes to an emerging literature on sex-specific interaction effects for MAOA. PMID:22610759
TreeSOM: Cluster analysis in the self-organizing map.
Samsonova, Elena V; Kok, Joost N; Ijzerman, Ad P
2006-01-01
Clustering problems arise in various domains of science and engineering. A large number of methods have been developed to date. The Kohonen self-organizing map (SOM) is a popular tool that maps a high-dimensional space onto a small number of dimensions by placing similar elements close together, forming clusters. Cluster analysis is often left to the user. In this paper we present the method TreeSOM and a set of tools to perform unsupervised SOM cluster analysis, determine cluster confidence and visualize the result as a tree facilitating comparison with existing hierarchical classifiers. We also introduce a distance measure for cluster trees that allows one to select a SOM with the most confident clusters. PMID:16781116
Improving clustering by imposing network information
Gerber, Susanne; Horenko, Illia
2015-01-01
Cluster analysis is one of the most popular data analysis tools in a wide range of applied disciplines. We propose and justify a computationally efficient and straightforward-to-implement way of imposing the available information from networks/graphs (a priori available in many application areas) on a broad family of clustering methods. The introduced approach is illustrated on the problem of a noninvasive unsupervised brain signal classification. This task is faced with several challenging difficulties such as nonstationary noisy signals and a small sample size, combined with a high-dimensional feature space and huge noise-to-signal ratios. Applying this approach results in an exact unsupervised classification of very short signals, opening new possibilities for clustering methods in the area of a noninvasive brain-computer interface. PMID:26601225
NASA Astrophysics Data System (ADS)
Wu, Xinyuan; Liu, Changying
2016-02-01
In this paper, we are concerned with the initial boundary value problem of arbitrarily high-dimensional Klein-Gordon equations, posed on a bounded domain Ω ⊂ ℝd for d ≥ 1 and equipped with the requirement of boundary conditions. We derive and analyze an integral formula which is proved to be adapted to different boundary conditions for general Klein-Gordon equations in arbitrarily high-dimensional spaces. The formula gives a closed-form solution to arbitrarily high-dimensional homogeneous linear Klein-Gordon equations, which is totally different from the well-known D'Alembert, Poisson, and Kirchhoff formulas. Some applications are included as well.
Clustering as a tool of reinforced rejecting in pattern recognition problem
NASA Astrophysics Data System (ADS)
Ciecierski, Jakub; Dybisz, Bartlomiej; Homenda, Wladyslaw; Jastrzebska, Agnieszka
2016-06-01
In this paper pattern recognition problem with rejecting option is discussed. The problem is aimed at classification patterns from given classes (native patterns) and rejecting ones not belonging to these classes (foreign patterns). In practice the characteristics of the native patters are given, while no information about foreign ones is known. A rejecting tool is aimed at enclosing native patterns in compact geometrical figures and excluding foreign ones from them.
Ma, Huanfei; Lin, Wei; Lai, Ying-Cheng
2013-05-01
Detecting unstable periodic orbits (UPOs) in chaotic systems based solely on time series is a fundamental but extremely challenging problem in nonlinear dynamics. Previous approaches were applicable but mostly for low-dimensional chaotic systems. We develop a framework, integrating approximation theory of neural networks and adaptive synchronization, to address the problem of time-series-based detection of UPOs in high-dimensional chaotic systems. An example of finding UPOs from the classic Mackey-Glass equation is presented. PMID:23767476
Suh, C.; Glynn, S.; Scharf, J.; Contreras, M. A.; Noufi, R.; Jones, W. B.; Biagioni, D.
2011-07-01
We demonstrate how advanced exploratory data analysis coupled to data-mining techniques can be used to scrutinize the high-dimensional data space of photovoltaics in the context of thin films of Al-doped ZnO (AZO), which are essential materials as a transparent conducting oxide (TCO) layer in CuInxGa1-xSe2 (CIGS) solar cells. AZO data space, wherein each sample is synthesized from a different process history and assessed with various characterizations, is transformed, reorganized, and visualized in order to extract optimal process conditions. The data-analysis methods used include parallel coordinates, diffusion maps, and hierarchical agglomerative clustering algorithms combined with diffusion map embedding.
Exploring High-Dimensional Data Space: Identifying Optimal Process Conditions in Photovoltaics
Suh, C.; Biagioni, D.; Glynn, S.; Scharf, J.; Contreras, M. A.; Noufi, R.; Jones, W. B.
2011-01-01
We demonstrate how advanced exploratory data analysis coupled to data-mining techniques can be used to scrutinize the high-dimensional data space of photovoltaics in the context of thin films of Al-doped ZnO (AZO), which are essential materials as a transparent conducting oxide (TCO) layer in CuIn{sub x}Ga{sub 1-x}Se{sub 2} (CIGS) solar cells. AZO data space, wherein each sample is synthesized from a different process history and assessed with various characterizations, is transformed, reorganized, and visualized in order to extract optimal process conditions. The data-analysis methods used include parallel coordinates, diffusion maps, and hierarchical agglomerative clustering algorithms combined with diffusion map embedding.
Rupp, Matthias; Schneider, Petra; Schneider, Gisbert
2009-11-15
Measuring the (dis)similarity of molecules is important for many cheminformatics applications like compound ranking, clustering, and property prediction. In this work, we focus on real-valued vector representations of molecules (as opposed to the binary spaces of fingerprints). We demonstrate the influence which the choice of (dis)similarity measure can have on results, and provide recommendations for such choices. We review the mathematical concepts used to measure (dis)similarity in vector spaces, namely norms, metrics, inner products, and, similarity coefficients, as well as the relationships between them, employing (dis)similarity measures commonly used in cheminformatics as examples. We present several phenomena (empty space phenomenon, sphere volume related phenomena, distance concentration) in high-dimensional descriptor spaces which are not encountered in two and three dimensions. These phenomena are theoretically characterized and illustrated on both artificial and real (bioactivity) data. PMID:19266481
Grellmann, Claudia; Neumann, Jane; Bitzer, Sebastian; Kovacs, Peter; Tönjes, Anke; Westlye, Lars T.; Andreassen, Ole A.; Stumvoll, Michael; Villringer, Arno; Horstmann, Annette
2016-01-01
In recent years, the advent of great technological advances has produced a wealth of very high-dimensional data, and combining high-dimensional information from multiple sources is becoming increasingly important in an extending range of scientific disciplines. Partial Least Squares Correlation (PLSC) is a frequently used method for multivariate multimodal data integration. It is, however, computationally expensive in applications involving large numbers of variables, as required, for example, in genetic neuroimaging. To handle high-dimensional problems, dimension reduction might be implemented as pre-processing step. We propose a new approach that incorporates Random Projection (RP) for dimensionality reduction into PLSC to efficiently solve high-dimensional multimodal problems like genotype-phenotype associations. We name our new method PLSC-RP. Using simulated and experimental data sets containing whole genome SNP measures as genotypes and whole brain neuroimaging measures as phenotypes, we demonstrate that PLSC-RP is drastically faster than traditional PLSC while providing statistically equivalent results. We also provide evidence that dimensionality reduction using RP is data type independent. Therefore, PLSC-RP opens up a wide range of possible applications. It can be used for any integrative analysis that combines information from multiple sources. PMID:27375677
Hawking radiation of a high-dimensional rotating black hole
NASA Astrophysics Data System (ADS)
Ren, Zhao; Lichun, Zhang; Huaifan, Li; Yueqin, Wu
2010-01-01
We extend the classical Damour-Ruffini method and discuss Hawking radiation spectrum of high-dimensional rotating black hole using Tortoise coordinate transformation defined by taking the reaction of the radiation to the spacetime into consideration. Under the condition that the energy and angular momentum are conservative, taking self-gravitation action into account, we derive Hawking radiation spectrums which satisfy unitary principle in quantum mechanics. It is shown that the process that the black hole radiates particles with energy ω is a continuous tunneling process. We provide a theoretical basis for further studying the physical mechanism of black-hole radiation.
TimeSeer: Scagnostics for high-dimensional time series.
Dang, Tuan Nhon; Anand, Anushka; Wilkinson, Leland
2013-03-01
We introduce a method (Scagnostic time series) and an application (TimeSeer) for organizing multivariate time series and for guiding interactive exploration through high-dimensional data. The method is based on nine characterizations of the 2D distributions of orthogonal pairwise projections on a set of points in multidimensional euclidean space. These characterizations include measures, such as, density, skewness, shape, outliers, and texture. Working directly with these Scagnostic measures, we can locate anomalous or interesting subseries for further analysis. Our application is designed to handle the types of doubly multivariate data series that are often found in security, financial, social, and other sectors. PMID:23307611
An Adaptive ANOVA-based PCKF for High-Dimensional Nonlinear Inverse Modeling
LI, Weixuan; Lin, Guang; Zhang, Dongxiao
2014-02-01
The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos bases in the expansion helps to capture uncertainty more accurately but increases computational cost. Bases selection is particularly important for high-dimensional stochastic problems because the number of polynomial chaos bases required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE bases are pre-set based on users’ experience. Also, for sequential data assimilation problems, the bases kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE bases for different problems and automatically adjusts the number of bases in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm is tested with different examples and demonstrated great effectiveness in comparison with non-adaptive PCKF and En
An adaptive ANOVA-based PCKF for high-dimensional nonlinear inverse modeling
Li, Weixuan; Lin, Guang; Zhang, Dongxiao
2014-02-01
The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos basis functions in the expansion helps to capture uncertainty more accurately but increases computational cost. Selection of basis functions is particularly important for high-dimensional stochastic problems because the number of polynomial chaos basis functions required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE basis functions are pre-set based on users' experience. Also, for sequential data assimilation problems, the basis functions kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE basis functions for different problems and automatically adjusts the number of basis functions in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm was tested with different examples and demonstrated
Node Detection Using High-Dimensional Fuzzy Parcellation Applied to the Insular Cortex
Vercelli, Ugo; Diano, Matteo; Costa, Tommaso; Nani, Andrea; Duca, Sergio; Geminiani, Giuliano; Vercelli, Alessandro; Cauda, Franco
2016-01-01
Several functional connectivity approaches require the definition of a set of regions of interest (ROIs) that act as network nodes. Different methods have been developed to define these nodes and to derive their functional and effective connections, most of which are rather complex. Here we aim to propose a relatively simple “one-step” border detection and ROI estimation procedure employing the fuzzy c-mean clustering algorithm. To test this procedure and to explore insular connectivity beyond the two/three-region model currently proposed in the literature, we parcellated the insular cortex of 20 healthy right-handed volunteers scanned in a resting state. By employing a high-dimensional functional connectivity-based clustering process, we confirmed the two patterns of connectivity previously described. This method revealed a complex pattern of functional connectivity where the two previously detected insular clusters are subdivided into several other networks, some of which are not commonly associated with the insular cortex, such as the default mode network and parts of the dorsal attentional network. Furthermore, the detection of nodes was reliable, as demonstrated by the confirmative analysis performed on a replication group of subjects. PMID:26881093
Node Detection Using High-Dimensional Fuzzy Parcellation Applied to the Insular Cortex.
Vercelli, Ugo; Diano, Matteo; Costa, Tommaso; Nani, Andrea; Duca, Sergio; Geminiani, Giuliano; Vercelli, Alessandro; Cauda, Franco
2016-01-01
Several functional connectivity approaches require the definition of a set of regions of interest (ROIs) that act as network nodes. Different methods have been developed to define these nodes and to derive their functional and effective connections, most of which are rather complex. Here we aim to propose a relatively simple "one-step" border detection and ROI estimation procedure employing the fuzzy c-mean clustering algorithm. To test this procedure and to explore insular connectivity beyond the two/three-region model currently proposed in the literature, we parcellated the insular cortex of 20 healthy right-handed volunteers scanned in a resting state. By employing a high-dimensional functional connectivity-based clustering process, we confirmed the two patterns of connectivity previously described. This method revealed a complex pattern of functional connectivity where the two previously detected insular clusters are subdivided into several other networks, some of which are not commonly associated with the insular cortex, such as the default mode network and parts of the dorsal attentional network. Furthermore, the detection of nodes was reliable, as demonstrated by the confirmative analysis performed on a replication group of subjects. PMID:26881093
ERIC Educational Resources Information Center
Goetschel, Roy, Jr.
1987-01-01
Multivalent relations, inferred as relationships with an added dimension of discernment, are realized as weighted graphs with multivalued edges. A unified treatment of the threshold problem is discussed and a reliability measure is produced to judge various partitions. (Author/EM)
High dimensional biological data retrieval optimization with NoSQL technology
2014-01-01
Background High-throughput transcriptomic data generated by microarray experiments is the most abundant and frequently stored kind of data currently used in translational medicine studies. Although microarray data is supported in data warehouses such as tranSMART, when querying relational databases for hundreds of different patient gene expression records queries are slow due to poor performance. Non-relational data models, such as the key-value model implemented in NoSQL databases, hold promise to be more performant solutions. Our motivation is to improve the performance of the tranSMART data warehouse with a view to supporting Next Generation Sequencing data. Results In this paper we introduce a new data model better suited for high-dimensional data storage and querying, optimized for database scalability and performance. We have designed a key-value pair data model to support faster queries over large-scale microarray data and implemented the model using HBase, an implementation of Google's BigTable storage system. An experimental performance comparison was carried out against the traditional relational data model implemented in both MySQL Cluster and MongoDB, using a large publicly available transcriptomic data set taken from NCBI GEO concerning Multiple Myeloma. Our new key-value data model implemented on HBase exhibits an average 5.24-fold increase in high-dimensional biological data query performance compared to the relational model implemented on MySQL Cluster, and an average 6.47-fold increase on query performance on MongoDB. Conclusions The performance evaluation found that the new key-value data model, in particular its implementation in HBase, outperforms the relational model currently implemented in tranSMART. We propose that NoSQL technology holds great promise for large-scale data management, in particular for high-dimensional biological data such as that demonstrated in the performance evaluation described in this paper. We aim to use this new data
Arif, Muhammad
2012-06-01
In pattern classification problems, feature extraction is an important step. Quality of features in discriminating different classes plays an important role in pattern classification problems. In real life, pattern classification may require high dimensional feature space and it is impossible to visualize the feature space if the dimension of feature space is greater than four. In this paper, we have proposed a Similarity-Dissimilarity plot which can project high dimensional space to a two dimensional space while retaining important characteristics required to assess the discrimination quality of the features. Similarity-dissimilarity plot can reveal information about the amount of overlap of features of different classes. Separable data points of different classes will also be visible on the plot which can be classified correctly using appropriate classifier. Hence, approximate classification accuracy can be predicted. Moreover, it is possible to know about whom class the misclassified data points will be confused by the classifier. Outlier data points can also be located on the similarity-dissimilarity plot. Various examples of synthetic data are used to highlight important characteristics of the proposed plot. Some real life examples from biomedical data are also used for the analysis. The proposed plot is independent of number of dimensions of the feature space. PMID:20734222
Reconstructing high-dimensional two-photon entangled states via compressive sensing
Tonolini, Francesco; Chan, Susan; Agnew, Megan; Lindsay, Alan; Leach, Jonathan
2014-01-01
Accurately establishing the state of large-scale quantum systems is an important tool in quantum information science; however, the large number of unknown parameters hinders the rapid characterisation of such states, and reconstruction procedures can become prohibitively time-consuming. Compressive sensing, a procedure for solving inverse problems by incorporating prior knowledge about the form of the solution, provides an attractive alternative to the problem of high-dimensional quantum state characterisation. Using a modified version of compressive sensing that incorporates the principles of singular value thresholding, we reconstruct the density matrix of a high-dimensional two-photon entangled system. The dimension of each photon is equal to d = 17, corresponding to a system of 83521 unknown real parameters. Accurate reconstruction is achieved with approximately 2500 measurements, only 3% of the total number of unknown parameters in the state. The algorithm we develop is fast, computationally inexpensive, and applicable to a wide range of quantum states, thus demonstrating compressive sensing as an effective technique for measuring the state of large-scale quantum systems. PMID:25306850
Likelihood-Free Inference in High-Dimensional Models.
Kousathanas, Athanasios; Leuenberger, Christoph; Helfer, Jonas; Quinodoz, Mathieu; Foll, Matthieu; Wegmann, Daniel
2016-06-01
Methods that bypass analytical evaluations of the likelihood function have become an indispensable tool for statistical inference in many fields of science. These so-called likelihood-free methods rely on accepting and rejecting simulations based on summary statistics, which limits them to low-dimensional models for which the value of the likelihood is large enough to result in manageable acceptance rates. To get around these issues, we introduce a novel, likelihood-free Markov chain Monte Carlo (MCMC) method combining two key innovations: updating only one parameter per iteration and accepting or rejecting this update based on subsets of statistics approximately sufficient for this parameter. This increases acceptance rates dramatically, rendering this approach suitable even for models of very high dimensionality. We further derive that for linear models, a one-dimensional combination of statistics per parameter is sufficient and can be found empirically with simulations. Finally, we demonstrate that our method readily scales to models of very high dimensionality, using toy models as well as by jointly inferring the effective population size, the distribution of fitness effects (DFE) of segregating mutations, and selection coefficients for each locus from data of a recent experiment on the evolution of drug resistance in influenza. PMID:27052569
Power Enhancement in High Dimensional Cross-Sectional Tests
Fan, Jianqing; Liao, Yuan; Yao, Jiawei
2016-01-01
We propose a novel technique to boost the power of testing a high-dimensional vector H : θ = 0 against sparse alternatives where the null hypothesis is violated only by a couple of components. Existing tests based on quadratic forms such as the Wald statistic often suffer from low powers due to the accumulation of errors in estimating high-dimensional parameters. More powerful tests for sparse alternatives such as thresholding and extreme-value tests, on the other hand, require either stringent conditions or bootstrap to derive the null distribution and often suffer from size distortions due to the slow convergence. Based on a screening technique, we introduce a “power enhancement component”, which is zero under the null hypothesis with high probability, but diverges quickly under sparse alternatives. The proposed test statistic combines the power enhancement component with an asymptotically pivotal statistic, and strengthens the power under sparse alternatives. The null distribution does not require stringent regularity conditions, and is completely determined by that of the pivotal statistic. As specific applications, the proposed methods are applied to testing the factor pricing models and validating the cross-sectional independence in panel data models. PMID:26778846
Asymptotic Stability of High-dimensional Zakharov-Kuznetsov Solitons
NASA Astrophysics Data System (ADS)
Côte, Raphaël; Muñoz, Claudio; Pilod, Didier; Simpson, Gideon
2016-05-01
We prove that solitons (or solitary waves) of the Zakharov-Kuznetsov (ZK) equation, a physically relevant high dimensional generalization of the Korteweg-de Vries (KdV) equation appearing in Plasma Physics, and having mixed KdV and nonlinear Schrödinger (NLS) dynamics, are strongly asymptotically stable in the energy space. We also prove that the sum of well-arranged solitons is stable in the same space. Orbital stability of ZK solitons is well-known since the work of de Bouard [Proc R Soc Edinburgh 126:89-112, 1996]. Our proofs follow the ideas of Martel [SIAM J Math Anal 157:759-781, 2006] and Martel and Merle [Math Ann 341:391-427, 2008], applied for generalized KdV equations in one dimension. In particular, we extend to the high dimensional case several monotonicity properties for suitable half-portions of mass and energy; we also prove a new Liouville type property that characterizes ZK solitons, and a key Virial identity for the linear and nonlinear part of the ZK dynamics, obtained independently of the mixed KdV-NLS dynamics. This last Virial identity relies on a simple sign condition which is numerically tested for the two and three dimensional cases with no additional spectral assumptions required. Possible extensions to higher dimensions and different nonlinearities could be obtained after a suitable local well-posedness theory in the energy space, and the verification of a corresponding sign condition.
Sample size requirements for training high-dimensional risk predictors
Dobbin, Kevin K.; Song, Xiao
2013-01-01
A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the high-dimensional data and cannot handle a right-censored dependent variable. We present a new training sample size method that is non-parametric with respect to the high-dimensional vectors, and is developed for a right-censored response. The method can be applied to any prediction algorithm that satisfies a set of conditions. The sample size is chosen so that the expected performance of the predictor is within a user-defined tolerance of optimal. The central method is based on a pilot dataset. To quantify uncertainty, a method to construct a confidence interval for the tolerance is developed. Adequacy of the size of the pilot dataset is discussed. An alternative model-based version of our method for estimating the tolerance when no adequate pilot dataset is available is presented. The model-based method requires a covariance matrix be specified, but we show that the identity covariance matrix provides adequate sample size when the user specifies three key quantities. Application of the sample size method to two microarray datasets is discussed. PMID:23873895
New data assimilation system DNDAS for high-dimensional models
NASA Astrophysics Data System (ADS)
Qun-bo, Huang; Xiao-qun, Cao; Meng-bin, Zhu; Wei-min, Zhang; Bai-nian, Liu
2016-05-01
The tangent linear (TL) models and adjoint (AD) models have brought great difficulties for the development of variational data assimilation system. It might be impossible to develop them perfectly without great efforts, either by hand, or by automatic differentiation tools. In order to break these limitations, a new data assimilation system, dual-number data assimilation system (DNDAS), is designed based on the dual-number automatic differentiation principles. We investigate the performance of DNDAS with two different optimization schemes and subsequently give a discussion on whether DNDAS is appropriate for high-dimensional forecast models. The new data assimilation system can avoid the complicated reverse integration of the adjoint model, and it only needs the forward integration in the dual-number space to obtain the cost function and its gradient vector concurrently. To verify the correctness and effectiveness of DNDAS, we implemented DNDAS on a simple ordinary differential model and the Lorenz-63 model with different optimization methods. We then concentrate on the adaptability of DNDAS to the Lorenz-96 model with high-dimensional state variables. The results indicate that whether the system is simple or nonlinear, DNDAS can accurately reconstruct the initial condition for the forecast model and has a strong anti-noise characteristic. Given adequate computing resource, the quasi-Newton optimization method performs better than the conjugate gradient method in DNDAS. Project supported by the National Natural Science Foundation of China (Grant Nos. 41475094 and 41375113).
Nam, Julia EunJu; Mueller, Klaus
2013-02-01
Gaining a true appreciation of high-dimensional space remains difficult since all of the existing high-dimensional space exploration techniques serialize the space travel in some way. This is not so foreign to us since we, when traveling, also experience the world in a serial fashion. But we typically have access to a map to help with positioning, orientation, navigation, and trip planning. Here, we propose a multivariate data exploration tool that compares high-dimensional space navigation with a sightseeing trip. It decomposes this activity into five major tasks: 1) Identify the sights: use a map to identify the sights of interest and their location; 2) Plan the trip: connect the sights of interest along a specifyable path; 3) Go on the trip: travel along the route; 4) Hop off the bus: experience the location, look around, zoom into detail; and 5) Orient and localize: regain bearings in the map. We describe intuitive and interactive tools for all of these tasks, both global navigation within the map and local exploration of the data distributions. For the latter, we describe a polygonal touchpad interface which enables users to smoothly tilt the projection plane in high-dimensional space to produce multivariate scatterplots that best convey the data relationships under investigation. Motion parallax and illustrative motion trails aid in the perception of these transient patterns. We describe the use of our system within two applications: 1) the exploratory discovery of data configurations that best fit a personal preference in the presence of tradeoffs and 2) interactive cluster analysis via cluster sculpting in N-D. PMID:22350201
GX-Means: A model-based divide and merge algorithm for geospatial image clustering
Vatsavai, Raju; Symons, Christopher T; Chandola, Varun; Jun, Goo
2011-01-01
One of the practical issues in clustering is the specification of the appropriate number of clusters, which is not obvious when analyzing geospatial datasets, partly because they are huge (both in size and spatial extent) and high dimensional. In this paper we present a computationally efficient model-based split and merge clustering algorithm that incrementally finds model parameters and the number of clusters. Additionally, we attempt to provide insights into this problem and other data mining challenges that are encountered when clustering geospatial data. The basic algorithm we present is similar to the G-means and X-means algorithms; however, our proposed approach avoids certain limitations of these well-known clustering algorithms that are pertinent when dealing with geospatial data. We compare the performance of our approach with the G-means and X-means algorithms. Experimental evaluation on simulated data and on multispectral and hyperspectral remotely sensed image data demonstrates the effectiveness of our algorithm.
NASA Technical Reports Server (NTRS)
Pinsonneault, Marc H.; Stauffer, John; Soderblom, David R.; King, Jeremy R.; Hanson, Robert B.
1998-01-01
Parallax data from the Hipparcos mission allow the direct distance to open clusters to be compared with the distance inferred from main-sequence (MS) fitting. There are surprising differences between the two distance measurements. indicating either the need for changes in the cluster compositions or reddening, underlying problems with the technique of MS fitting, or systematic errors in the Hipparcos parallaxes at the 1 mas level. We examine the different possibilities, focusing on MS fitting in both metallicity-sensitive B-V and metallicity-insensitive V-I for five well-studied systems (the Hyades, Pleiades, alpha Per, Praesepe, and Coma Ber). The Hipparcos distances to the Hyades and alpha Per are within 1 sigma of the MS-fitting distance in B-V and V-I, while the Hipparcos distances to Coma Ber and the Pleiades are in disagreement with the MS-fitting distance at more than the 3 sigma level. There are two Hipparcos measurements of the distance to Praesepe; one is in good agreement with the MS-fitting distance and the other disagrees at the 2 sigma level. The distance estimates from the different colors are in conflict with one another for Coma but in agreement for the Pleiades. Changes in the relative cluster metal abundances, age related effects, helium, and reddening are shown to be unlikely to explain the puzzling behavior of the Pleiades. We present evidence for spatially dependent systematic errors at the 1 mas level in the parallaxes of Pleiades stars. The implications of this result are discussed.
Sparse subspace clustering: algorithm, theory, and applications.
Elhamifar, Ehsan; Vidal, René
2013-11-01
Many real-world problems deal with collections of high-dimensional data, such as images, videos, text, and web documents, DNA microarray data, and more. Often, such high-dimensional data lie close to low-dimensional structures corresponding to several classes or categories to which the data belong. In this paper, we propose and study an algorithm, called sparse subspace clustering, to cluster data points that lie in a union of low-dimensional subspaces. The key idea is that, among the infinitely many possible representations of a data point in terms of other points, a sparse representation corresponds to selecting a few points from the same subspace. This motivates solving a sparse optimization program whose solution is used in a spectral clustering framework to infer the clustering of the data into subspaces. Since solving the sparse optimization program is in general NP-hard, we consider a convex relaxation and show that, under appropriate conditions on the arrangement of the subspaces and the distribution of the data, the proposed minimization program succeeds in recovering the desired sparse representations. The proposed algorithm is efficient and can handle data points near the intersections of subspaces. Another key advantage of the proposed algorithm with respect to the state of the art is that it can deal directly with data nuisances, such as noise, sparse outlying entries, and missing entries, by incorporating the model of the data into the sparse optimization program. We demonstrate the effectiveness of the proposed algorithm through experiments on synthetic data as well as the two real-world problems of motion segmentation and face clustering. PMID:24051734
2012-01-01
Background Externalising and internalising problems affect one in seven school-aged children and are the single strongest predictor of mental health problems into early adolescence. As the burden of mental health problems persists globally, childhood prevention of mental health problems is paramount. Prevention can be offered to all children (universal) or to children at risk of developing mental health problems (targeted). The relative effectiveness and costs of a targeted only versus combined universal and targeted approach are unknown. This study aims to determine the effectiveness, costs and uptake of two approaches to early childhood prevention of mental health problems ie: a Combined universal-targeted approach, versus a Targeted only approach, in comparison to current primary care services (Usual care). Methods/design Three armed, population-level cluster randomised trial (2010–2014) within the universal, well child Maternal Child Health system, attended by more than 80% of families in Victoria, Australia at infant age eight months. Participants were families of eight month old children from nine participating local government areas. Randomised to one of three groups: Combined, Targeted or Usual care. The interventions comprises (a) the Combined universal and targeted program where all families are offered the universal Toddlers Without Tears group parenting program followed by the targeted Family Check-Up one-on-one program or (b) the Targeted Family Check-Up program. The Family Check-Up program is only offered to children at risk of behavioural problems. Participants will be analysed according to the trial arm to which they were randomised, using logistic and linear regression models to compare primary and secondary outcomes. An economic evaluation (cost consequences analysis) will compare incremental costs to all incremental outcomes from a societal perspective. Discussion This trial will inform public health policy by making recommendations about the
Nakano, Takashi; Otsuka, Makoto; Yoshimoto, Junichiro; Doya, Kenji
2015-01-01
A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the past, even though these are inevitable and constraining features of learning in real environments. This class of problem is formally known as partially observable reinforcement learning (PORL) problems. It provides a generalization of reinforcement learning to partially observable domains. In addition, observations in the real world tend to be rich and high-dimensional. In this work, we use a spiking neural network model to approximate the free energy of a restricted Boltzmann machine and apply it to the solution of PORL problems with high-dimensional observations. Our spiking network model solves maze tasks with perceptually ambiguous high-dimensional observations without knowledge of the true environment. An extended model with working memory also solves history-dependent tasks. The way spiking neural networks handle PORL problems may provide a glimpse into the underlying laws of neural information processing which can only be discovered through such a top-down approach. PMID:25734662
Williams, Kristine; Herman, Ruth; Bontempo, Daniel
2014-01-01
Purpose of the study Assisted living (AL) residents are at risk for cognitive and functional declines that eventually reduce their ability to care for themselves, thereby triggering nursing home placement. In developing a method to slow this decline, the efficacy of Reasoning Exercises in Assisted Living (REAL), a cognitive training intervention that teaches everyday reasoning and problem-solving skills to AL residents, was tested. Design and methods At thirteen randomized Midwestern facilities, AL residents whose Mini Mental State Examination scores ranged from 19–29 either were trained in REAL or a vitamin education attention control program or received no treatment at all. For 3 weeks, treated groups received personal training in their respective programs. Results Scores on the Every Day Problems Test for Cognitively Challenged Elders (EPCCE) and on the Direct Assessment of Functional Status (DAFS) showed significant increases only for the REAL group. For EPCCE, change from baseline immediately postintervention was +3.10 (P<0.01), and there was significant retention at the 3-month follow-up (d=2.71; P<0.01). For DAFS, change from baseline immediately postintervention was +3.52 (P<0.001), although retention was not as strong. Neither the attention nor the no-treatment control groups had significant gains immediately postintervention or at follow-up assessments. Post hoc across-group comparison of baseline change also highlights the benefits of REAL training. For EPCCE, the magnitude of gain was significantly larger in the REAL group versus the no-treatment control group immediately postintervention (d=3.82; P<0.01) and at the 3-month follow-up (d=3.80; P<0.01). For DAFS, gain magnitude immediately postintervention for REAL was significantly greater compared with in the attention control group (d=4.73; P<0.01). Implications REAL improves skills in everyday problem solving, which may allow AL residents to maintain self-care and extend AL residency. This benefit
Algorithmic tools for mining high-dimensional cytometry data
Chester, Cariad; Maecker, Holden T.
2015-01-01
The advent of mass cytometry has lead to an unprecedented increase in the number of analytes measured in individual cells, thereby increasing the complexity and information content of cytometric data. While this technology is ideally suited to detailed examination of the immune system, the applicability of the different methods for analyzing such complex data are less clear. Conventional data analysis by ‘manual’ gating of cells in biaxial dotplots is often subjective, time consuming, and neglectful of much of the information contained in a highly dimensional cytometric dataset. Algorithmic data mining has the promise to eliminate these concerns and several such tools have been recently applied to mass cytometry data. Herein, we review computational data mining tools that have been used to analyze mass cytometry data, outline their differences, and comment on their strengths and limitations. This review will help immunologists identify suitable algorithmic tools for their particular projects. PMID:26188071
High-dimensional quantum nature of ghost angular Young's diffraction
Chen Lixiang; Leach, Jonathan; Jack, Barry; Padgett, Miles J.; Franke-Arnold, Sonja; She Weilong
2010-09-15
We propose a technique to characterize the dimensionality of entangled sources affected by any environment, including phase and amplitude masks or atmospheric turbulence. We illustrate this technique on the example of angular ghost diffraction using the orbital angular momentum (OAM) spectrum generated by a nonlocal double slit. We realize a nonlocal angular double slit by placing single angular slits in the paths of the signal and idler modes of the entangled light field generated by parametric down-conversion. Based on the observed OAM spectrum and the measured Shannon dimensionality spectrum of the possible quantum channels that contribute to Young's ghost diffraction, we calculate the associated dimensionality D{sub total}. The measured D{sub total} ranges between 1 and 2.74 depending on the opening angle of the angular slits. The ability to quantify the nature of high-dimensional entanglement is vital when considering quantum information protocols.
Future of High-Dimensional Data-Driven Exoplanet Science
NASA Astrophysics Data System (ADS)
Ford, Eric B.
2016-03-01
The detection and characterization of exoplanets has come a long way since the 1990’s. For example, instruments specifically designed for Doppler planet surveys feature environmental controls to minimize instrumental effects and advanced calibration systems. Combining these instruments with powerful telescopes, astronomers have detected thousands of exoplanets. The application of Bayesian algorithms has improved the quality and reliability with which astronomers characterize the mass and orbits of exoplanets. Thanks to continued improvements in instrumentation, now the detection of extrasolar low-mass planets is limited primarily by stellar activity, rather than observational uncertainties. This presents a new set of challenges which will require cross-disciplinary research to combine improved statistical algorithms with an astrophysical understanding of stellar activity and the details of astronomical instrumentation. I describe these challenges and outline the roles of parameter estimation over high-dimensional parameter spaces, marginalizing over uncertainties in stellar astrophysics and machine learning for the next generation of Doppler planet searches.
Parsimonious description for predicting high-dimensional dynamics
Hirata, Yoshito; Takeuchi, Tomoya; Horai, Shunsuke; Suzuki, Hideyuki; Aihara, Kazuyuki
2015-01-01
When we observe a system, we often cannot observe all its variables and may have some of its limited measurements. Under such a circumstance, delay coordinates, vectors made of successive measurements, are useful to reconstruct the states of the whole system. Although the method of delay coordinates is theoretically supported for high-dimensional dynamical systems, practically there is a limitation because the calculation for higher-dimensional delay coordinates becomes more expensive. Here, we propose a parsimonious description of virtually infinite-dimensional delay coordinates by evaluating their distances with exponentially decaying weights. This description enables us to predict the future values of the measurements faster because we can reuse the calculated distances, and more accurately because the description naturally reduces the bias of the classical delay coordinates toward the stable directions. We demonstrate the proposed method with toy models of the atmosphere and real datasets related to renewable energy. PMID:26510518
Statistical validation of high-dimensional models of growing networks
NASA Astrophysics Data System (ADS)
Medo, Matúš
2014-03-01
The abundance of models of complex networks and the current insufficient validation standards make it difficult to judge which models are strongly supported by data and which are not. We focus here on likelihood maximization methods for models of growing networks with many parameters and compare their performance on artificial and real datasets. While high dimensionality of the parameter space harms the performance of direct likelihood maximization on artificial data, this can be improved by introducing a suitable penalization term. Likelihood maximization on real data shows that the presented approach is able to discriminate among available network models. To make large-scale datasets accessible to this kind of analysis, we propose a subset sampling technique and show that it yields substantial model evidence in a fraction of time necessary for the analysis of the complete data.
High-dimensional quantum key distribution using dispersive optics
NASA Astrophysics Data System (ADS)
Mower, Jacob; Zhang, Zheshen; Desjardins, Pierre; Lee, Catherine; Shapiro, Jeffrey H.; Englund, Dirk
2013-06-01
We propose a high-dimensional quantum key distribution (QKD) protocol that employs temporal correlations of entangled photons. The security of the protocol relies on measurements by Alice and Bob in one of two conjugate bases, implemented using dispersive optics. We show that this dispersion-based approach is secure against collective attacks. The protocol, which represents a QKD analog of pulse position modulation, is compatible with standard fiber telecommunications channels and wavelength division multiplexers. We describe several physical implementations to enhance the transmission rate and describe a heralded qudit source that is easy to implement and enables secret-key generation at >4 bits per character of distilled key across over 200 km of fiber.
High dimensional reflectance analysis of soil organic matter
NASA Technical Reports Server (NTRS)
Henderson, T. L.; Baumgardner, M. F.; Franzmeier, D. P.; Stott, D. E.; Coster, D. C.
1992-01-01
Recent breakthroughs in remote-sensing technology have led to the development of high spectral resolution imaging sensors for observation of earth surface features. This research was conducted to evaluate the effects of organic matter content and composition on narrowband soil reflectance across the visible and reflective infrared spectral ranges. Organic matter from four Indiana agricultural soils, ranging in organic C content from 0.99 to 1.72 percent, was extracted, fractionated, and purified. Six components of each soil were isolated and prepared for spectral analysis. Reflectance was measured in 210 narrow bands in the 400- to 2500-nm wavelength range. Statistical analysis of reflectance values indicated the potential of high dimensional reflectance data in specific visible, near-infrared, and middle-infrared bands to provide information about soil organic C content, but not organic matter composition. These bands also responded significantly to Fe- and Mn-oxide content.
Modeling for Process Control: High-Dimensional Systems
Lev S. Tsimring
2008-09-15
Most of other technologically important systems (among them, powders and other granular systems) are intrinsically nonlinear. This project is focused on building the dynamical models for granular systems as a prototype for nonlinear high-dimensional systems exhibiting complex non-equilibrium phenomena. Granular materials present a unique opportunity to study these issues in a technologically important and yet fundamentally interesting setting. Granular systems exhibit a rich variety of regimes from gas-like to solid-like depending on the external excitation. Based the combination of the rigorous asymptotic analysis, available experimental data and nonlinear signal processing tools, we developed a multi-scale approach to the modeling of granular systems from detailed description of grain-grain interaction on a micro-scale to continuous modeling of large-scale granular flows with important geophysical applications.
Building high dimensional imaging database for content based image search
NASA Astrophysics Data System (ADS)
Sun, Qinpei; Sun, Jianyong; Ling, Tonghui; Wang, Mingqing; Yang, Yuanyuan; Zhang, Jianguo
2016-03-01
In medical imaging informatics, content-based image retrieval (CBIR) techniques are employed to aid radiologists in the retrieval of images with similar image contents. CBIR uses visual contents, normally called as image features, to search images from large scale image databases according to users' requests in the form of a query image. However, most of current CBIR systems require a distance computation of image character feature vectors to perform query, and the distance computations can be time consuming when the number of image character features grows large, and thus this limits the usability of the systems. In this presentation, we propose a novel framework which uses a high dimensional database to index the image character features to improve the accuracy and retrieval speed of a CBIR in integrated RIS/PACS.
Spectral feature design in high dimensional multispectral data
NASA Technical Reports Server (NTRS)
Chen, Chih-Chien Thomas; Landgrebe, David A.
1988-01-01
The High resolution Imaging Spectrometer (HIRIS) is designed to acquire images simultaneously in 192 spectral bands in the 0.4 to 2.5 micrometers wavelength region. It will make possible the collection of essentially continuous reflectance spectra at a spectral resolution sufficient to extract significantly enhanced amounts of information from return signals as compared to existing systems. The advantages of such high dimensional data come at a cost of increased system and data complexity. For example, since the finer the spectral resolution, the higher the data rate, it becomes impractical to design the sensor to be operated continuously. It is essential to find new ways to preprocess the data which reduce the data rate while at the same time maintaining the information content of the high dimensional signal produced. Four spectral feature design techniques are developed from the Weighted Karhunen-Loeve Transforms: (1) non-overlapping band feature selection algorithm; (2) overlapping band feature selection algorithm; (3) Walsh function approach; and (4) infinite clipped optimal function approach. The infinite clipped optimal function approach is chosen since the features are easiest to find and their classification performance is the best. After the preprocessed data has been received at the ground station, canonical analysis is further used to find the best set of features under the criterion that maximal class separability is achieved. Both 100 dimensional vegetation data and 200 dimensional soil data were used to test the spectral feature design system. It was shown that the infinite clipped versions of the first 16 optimal features had excellent classification performance. The overall probability of correct classification is over 90 percent while providing for a reduced downlink data rate by a factor of 10.
Shaffer, Patrick; Valsson, Omar; Parrinello, Michele
2016-02-01
The capabilities of molecular simulations have been greatly extended by a number of widely used enhanced sampling methods that facilitate escaping from metastable states and crossing large barriers. Despite these developments there are still many problems which remain out of reach for these methods which has led to a vigorous effort in this area. One of the most important problems that remains unsolved is sampling high-dimensional free-energy landscapes and systems that are not easily described by a small number of collective variables. In this work we demonstrate a new way to compute free-energy landscapes of high dimensionality based on the previously introduced variationally enhanced sampling, and we apply it to the miniprotein chignolin. PMID:26787868
Shaffer, Patrick; Valsson, Omar; Parrinello, Michele
2016-01-01
The capabilities of molecular simulations have been greatly extended by a number of widely used enhanced sampling methods that facilitate escaping from metastable states and crossing large barriers. Despite these developments there are still many problems which remain out of reach for these methods which has led to a vigorous effort in this area. One of the most important problems that remains unsolved is sampling high-dimensional free-energy landscapes and systems that are not easily described by a small number of collective variables. In this work we demonstrate a new way to compute free-energy landscapes of high dimensionality based on the previously introduced variationally enhanced sampling, and we apply it to the miniprotein chignolin. PMID:26787868
A method for analysis of phenotypic change for phenotypes described by high-dimensional data.
Collyer, M L; Sekora, D J; Adams, D C
2015-10-01
The analysis of phenotypic change is important for several evolutionary biology disciplines, including phenotypic plasticity, evolutionary developmental biology, morphological evolution, physiological evolution, evolutionary ecology and behavioral evolution. It is common for researchers in these disciplines to work with multivariate phenotypic data. When phenotypic variables exceed the number of research subjects--data called 'high-dimensional data'--researchers are confronted with analytical challenges. Parametric tests that require high observation to variable ratios present a paradox for researchers, as eliminating variables potentially reduces effect sizes for comparative analyses, yet test statistics require more observations than variables. This problem is exacerbated with data that describe 'multidimensional' phenotypes, whereby a description of phenotype requires high-dimensional data. For example, landmark-based geometric morphometric data use the Cartesian coordinates of (potentially) many anatomical landmarks to describe organismal shape. Collectively such shape variables describe organism shape, although the analysis of each variable, independently, offers little benefit for addressing biological questions. Here we present a nonparametric method of evaluating effect size that is not constrained by the number of phenotypic variables, and motivate its use with example analyses of phenotypic change using geometric morphometric data. Our examples contrast different characterizations of body shape for a desert fish species, associated with measuring and comparing sexual dimorphism between two populations. We demonstrate that using more phenotypic variables can increase effect sizes, and allow for stronger inferences. PMID:25204302
Smart sampling and incremental function learning for very large high dimensional data.
Loyola R, Diego G; Pedergnana, Mattia; Gimeno García, Sebastián
2016-06-01
Very large high dimensional data are common nowadays and they impose new challenges to data-driven and data-intensive algorithms. Computational Intelligence techniques have the potential to provide powerful tools for addressing these challenges, but the current literature focuses mainly on handling scalability issues related to data volume in terms of sample size for classification tasks. This work presents a systematic and comprehensive approach for optimally handling regression tasks with very large high dimensional data. The proposed approach is based on smart sampling techniques for minimizing the number of samples to be generated by using an iterative approach that creates new sample sets until the input and output space of the function to be approximated are optimally covered. Incremental function learning takes place in each sampling iteration, the new samples are used to fine tune the regression results of the function learning algorithm. The accuracy and confidence levels of the resulting approximation function are assessed using the probably approximately correct computation framework. The smart sampling and incremental function learning techniques can be easily used in practical applications and scale well in the case of extremely large data. The feasibility and good results of the proposed techniques are demonstrated using benchmark functions as well as functions from real-world problems. PMID:26476936
A Dynamical Clustering Model of Brain Connectivity Inspired by the N -Body Problem.
Prasad, Gautam; Burkart, Josh; Joshi, Shantanu H; Nir, Talia M; Toga, Arthur W; Thompson, Paul M
2013-01-01
We present a method for studying brain connectivity by simulating a dynamical evolution of the nodes of the network. The nodes are treated as particles, and evolved under a simulated force analogous to gravitational acceleration in the well-known N -body problem. The particle nodes correspond to regions of the cortex. The locations of particles are defined as the centers of the respective regions on the cortex and their masses are proportional to each region's volume. The force of attraction is modeled on the gravitational force, and explicitly made proportional to the elements of a connectivity matrix derived from diffusion imaging data. We present experimental results of the simulation on a population of 110 subjects from the Alzheimer's Disease Neuroimaging Initiative (ADNI), consisting of healthy elderly controls, early mild cognitively impaired (eMCI), late MCI (LMCI), and Alzheimer's disease (AD) patients. Results show significant differences in the dynamic properties of connectivity networks in healthy controls, compared to eMCI as well as AD patients. PMID:25340177
A Dynamical Clustering Model of Brain Connectivity Inspired by the N -Body Problem
Prasad, Gautam; Burkart, Josh; Joshi, Shantanu H.; Nir, Talia M.; Toga, Arthur W.; Thompson, Paul M.
2014-01-01
We present a method for studying brain connectivity by simulating a dynamical evolution of the nodes of the network. The nodes are treated as particles, and evolved under a simulated force analogous to gravitational acceleration in the well-known N -body problem. The particle nodes correspond to regions of the cortex. The locations of particles are defined as the centers of the respective regions on the cortex and their masses are proportional to each region’s volume. The force of attraction is modeled on the gravitational force, and explicitly made proportional to the elements of a connectivity matrix derived from diffusion imaging data. We present experimental results of the simulation on a population of 110 subjects from the Alzheimer’s Disease Neuroimaging Initiative (ADNI), consisting of healthy elderly controls, early mild cognitively impaired (eMCI), late MCI (LMCI), and Alzheimer’s disease (AD) patients. Results show significant differences in the dynamic properties of connectivity networks in healthy controls, compared to eMCI as well as AD patients. PMID:25340177
Mining Approximate Order Preserving Clusters in the Presence of Noise
Zhang, Mengsheng; Wang, Wei; Liu, Jinze
2010-01-01
Subspace clustering has attracted great attention due to its capability of finding salient patterns in high dimensional data. Order preserving subspace clusters have been proven to be important in high throughput gene expression analysis, since functionally related genes are often co-expressed under a set of experimental conditions. Such co-expression patterns can be represented by consistent orderings of attributes. Existing order preserving cluster models require all objects in a cluster have identical attribute order without deviation. However, real data are noisy due to measurement technology limitation and experimental variability which prohibits these strict models from revealing true clusters corrupted by noise. In this paper, we study the problem of revealing the order preserving clusters in the presence of noise. We propose a noise-tolerant model called approximate order preserving cluster (AOPC). Instead of requiring all objects in a cluster have identical attribute order, we require that (1) at least a certain fraction of the objects have identical attribute order; (2) other objects in the cluster may deviate from the consensus order by up to a certain fraction of attributes. We also propose an algorithm to mine AOPC. Experiments on gene expression data demonstrate the efficiency and effectiveness of our algorithm. PMID:20689652
Sörensen, Till; Baumgart, Sabine; Durek, Pawel; Grützkau, Andreas; Häupl, Thomas
2015-07-01
Multiparametric fluorescence and mass cytometry offers new perspectives to disclose and to monitor the high diversity of cell populations in the peripheral blood for biomarker research. While high-end cytometric devices are currently available to detect theoretically up to 120 individual parameters at the single cell level, software tools are needed to analyze these complex datasets automatically in acceptable time and without operator bias or knowledge. We developed an automated analysis pipeline, immunoClust, for uncompensated fluorescence and mass cytometry data, which consists of two parts. First, cell events of each sample are grouped into individual clusters. Subsequently, a classification algorithm assorts these cell event clusters into populations comparable between different samples. The clustering of cell events is designed for datasets with large event counts in high dimensions as a global unsupervised method, sensitive to identify rare cell types even when next to large populations. Both parts use model-based clustering with an iterative expectation maximization algorithm and the integrated classification likelihood to obtain the clusters. A detailed description of both algorithms is presented. Testing and validation was performed using 1) blood cell samples of defined composition that were depleted of particular cell subsets by magnetic cell sorting, 2) datasets of the FlowCAP III challenges to identify populations of rare cell types and 3) high-dimensional fluorescence and mass-cytometry datasets for comparison with conventional manual gating procedures. In conclusion, the immunoClust-algorithm is a promising tool to standardize and automate the analysis of high-dimensional cytometric datasets. As a prerequisite for interpretation of such data, it will support our efforts in developing immunological biomarkers for chronic inflammatory disorders and therapy recommendations in personalized medicine. immunoClust is implemented as an R-package and is
A Hyperspherical Adaptive Sparse-Grid Method for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max D.; Burkardt, John V.
2015-06-24
This study proposes and analyzes a hyperspherical adaptive hierarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces. The method is motivated by the theoretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a function representation of the discontinuity hypersurface of an N-dimensional discontinuous quantity of interest, by virtue of a hyperspherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyperspherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. In addition, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous complexity analyses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.
Predicting Viral Infection From High-Dimensional Biomarker Trajectories
Chen, Minhua; Zaas, Aimee; Woods, Christopher; Ginsburg, Geoffrey S.; Lucas, Joseph; Dunson, David; Carin, Lawrence
2013-01-01
There is often interest in predicting an individual’s latent health status based on high-dimensional biomarkers that vary over time. Motivated by time-course gene expression array data that we have collected in two influenza challenge studies performed with healthy human volunteers, we develop a novel time-aligned Bayesian dynamic factor analysis methodology. The time course trajectories in the gene expressions are related to a relatively low-dimensional vector of latent factors, which vary dynamically starting at the latent initiation time of infection. Using a nonparametric cure rate model for the latent initiation times, we allow selection of the genes in the viral response pathway, variability among individuals in infection times, and a subset of individuals who are not infected. As we demonstrate using held-out data, this statistical framework allows accurate predictions of infected individuals in advance of the development of clinical symptoms, without labeled data and even when the number of biomarkers vastly exceeds the number of individuals under study. Biological interpretation of several of the inferred pathways (factors) is provided. PMID:23704802
An efficient chemical kinetics solver using high dimensional model representation
Shorter, J.A.; Ip, P.C.; Rabitz, H.A.
1999-09-09
A high dimensional model representation (HDMR) technique is introduced to capture the input-output behavior of chemical kinetic models. The HDMR expresses the output chemical species concentrations as a rapidly convergent hierarchical correlated function expansion in the input variables. In this paper, the input variables are taken as the species concentrations at time t{sub i} and the output is the concentrations at time t{sub i} + {delta}, where {delta} can be much larger than conventional integration time steps. A specially designed set of model runs is performed to determine the correlated functions making up the HDMR. The resultant HDMR can be used to (1) identify the key input variables acting independently or cooperatively on the output, and (2) create a high speed fully equivalent operational model (FEOM) serving to replace the original kinetic model and its differential equation solver. A demonstration of the HDMR technique is presented for stratospheric chemical kinetics. The FEOM proved to give accurate and stable chemical concentrations out to long times of many years. In addition, the FEOM was found to be orders of magnitude faster than a conventional stiff equation solver. This computational acceleration should have significance in many chemical kinetic applications.
High dimensional data analysis using multivariate generalized spatial quantiles
Mukhopadhyay, Nitai D.; Chatterjee, Snigdhansu
2015-01-01
High dimensional data routinely arises in image analysis, genetic experiments, network analysis, and various other research areas. Many such datasets do not correspond to well-studied probability distributions, and in several applications the data-cloud prominently displays non-symmetric and non-convex shape features. We propose using spatial quantiles and their generalizations, in particular, the projection quantile, for describing, analyzing and conducting inference with multivariate data. Minimal assumptions are made about the nature and shape characteristics of the underlying probability distribution, and we do not require the sample size to be as high as the data-dimension. We present theoretical properties of the generalized spatial quantiles, and an algorithm to compute them quickly. Our quantiles may be used to obtain multidimensional confidence or credible regions that are not required to conform to a pre-determined shape. We also propose a new notion of multidimensional order statistics, which may be used to obtain multidimensional outliers. Many of the features revealed using a generalized spatial quantile-based analysis would be missed if the data was shoehorned into a well-known probabilistic configuration. PMID:26617421
High-dimensional quantum cryptography with twisted light
NASA Astrophysics Data System (ADS)
Mirhosseini, Mohammad; Magaña-Loaiza, Omar S.; O'Sullivan, Malcolm N.; Rodenburg, Brandon; Malik, Mehul; Lavery, Martin P. J.; Padgett, Miles J.; Gauthier, Daniel J.; Boyd, Robert W.
2015-03-01
Quantum key distribution (QKD) systems often rely on polarization of light for encoding, thus limiting the amount of information that can be sent per photon and placing tight bounds on the error rates that such a system can tolerate. Here we describe a proof-of-principle experiment that indicates the feasibility of high-dimensional QKD based on the transverse structure of the light field allowing for the transfer of more than 1 bit per photon. Our implementation uses the orbital angular momentum (OAM) of photons and the corresponding mutually unbiased basis of angular position (ANG). Our experiment uses a digital micro-mirror device for the rapid generation of OAM and ANG modes at 4 kHz, and a mode sorter capable of sorting single photons based on their OAM and ANG content with a separation efficiency of 93%. Through the use of a seven-dimensional alphabet encoded in the OAM and ANG bases, we achieve a channel capacity of 2.05 bits per sifted photon. Our experiment demonstrates that, in addition to having an increased information capacity, multilevel QKD systems based on spatial-mode encoding can be more resilient against intercept-resend eavesdropping attacks.
Kruizinga, Ingrid; Jansen, Wilma; van Sprang, Nicolien C.; Carter, Alice S.; Raat, Hein
2015-01-01
Objective Effective early detection tools are needed in child health care to detect psychosocial problems among young children. This study aimed to evaluate the effectiveness of the Brief Infant-Toddler Social and Emotional Assessment (BITSEA), in reducing psychosocial problems at one year follow-up, compared to care as usual. Method Well-child centers in Rotterdam, the Netherlands, were allocated in a cluster randomized controlled trial to the intervention condition (BITSEA—15 centers), or to the control condition (‘care-as-usual’- 16 centers). Parents of 2610 2-year-old children (1,207 intervention; 1,403 control) provided informed consent and completed the baseline and 1-year follow-up questionnaire. Multilevel regression analyses were used to evaluate the effect of condition on psychosocial problems and health related quality of life (i.e. respectively Child Behavior Checklist and Infant-Toddler Quality of Life). The number of (pursuits of) referrals and acceptability of the BITSEA were also evaluated. Results Children in the intervention condition scored more favourably on the CBCL at follow-up than children in the control condition: B = -2.43 (95% confidence interval [95%CI] = -3.53;-1.33 p<0.001). There were no differences between conditions regarding ITQOL. Child health professionals reported referring fewer children in the intervention condition (n = 56, 5.7%), compared to the control condition (n = 95, 7.9%; p<0.05). There was no intervention effect on parents’ reported number of referrals pursued. It took less time to complete (parents) or work with (child health professional) the BITSEA, compared to care as usual. In the control condition, 84.2% of the parents felt (very) well prepared for the well-child visit, compared to 77.9% in the intervention condition (p<0.001). Conclusion The results support the use of the BITSEA as a tool for child health professionals in the early detection of psychosocial problems in 2-year-olds. We recommend future
Chen, Yi; Jakeman, John; Gittelson, Claude; Xiu, Dongbin
2015-01-08
In this paper we present a localized polynomial chaos expansion for partial differential equations (PDE) with random inputs. In particular, we focus on time independent linear stochastic problems with high dimensional random inputs, where the traditional polynomial chaos methods, and most of the existing methods, incur prohibitively high simulation cost. Furthermore, the local polynomial chaos method employs a domain decomposition technique to approximate the stochastic solution locally. In each subdomain, a subdomain problem is solved independently and, more importantly, in a much lower dimensional random space. In a postprocesing stage, accurate samples of the original stochastic problems are obtained from the samples of the local solutions by enforcing the correct stochastic structure of the random inputs and the coupling conditions at the interfaces of the subdomains. Overall, the method is able to solve stochastic PDEs in very large dimensions by solving a collection of low dimensional local problems and can be highly efficient. In our paper we present the general mathematical framework of the methodology and use numerical examples to demonstrate the properties of the method.
Genuinely high-dimensional nonlocality optimized by complementary measurements
NASA Astrophysics Data System (ADS)
Lim, James; Ryu, Junghee; Yoo, Seokwon; Lee, Changhyoup; Bang, Jeongho; Lee, Jinhyoung
2010-10-01
Qubits exhibit extreme nonlocality when their state is maximally entangled and this is observed by mutually unbiased local measurements. This criterion does not hold for the Bell inequalities of high-dimensional systems (qudits), recently proposed by Collins-Gisin-Linden-Massar-Popescu and Son-Lee-Kim. Taking an alternative approach, called the quantum-to-classical approach, we derive a series of Bell inequalities for qudits that satisfy the criterion as for the qubits. In the derivation each d-dimensional subsystem is assumed to be measured by one of d possible measurements with d being a prime integer. By applying to two qubits (d=2), we find that a derived inequality is reduced to the Clauser-Horne-Shimony-Holt inequality when the degree of nonlocality is optimized over all the possible states and local observables. Further applying to two and three qutrits (d=3), we find Bell inequalities that are violated for the three-dimensionally entangled states but are not violated by any two-dimensionally entangled states. In other words, the inequalities discriminate three-dimensional (3D) entanglement from two-dimensional (2D) entanglement and in this sense they are genuinely 3D. In addition, for the two qutrits we give a quantitative description of the relations among the three degrees of complementarity, entanglement and nonlocality. It is shown that the degree of complementarity jumps abruptly to very close to its maximum as nonlocality starts appearing. These characteristics imply that complementarity plays a more significant role in the present inequality compared with the previously proposed inequality.
NASA Technical Reports Server (NTRS)
Schmidt, Rudolf; Domingo, Vicente; Shawhan, Stanley D.; Bohlin, David
1988-01-01
The NASA/ESA Solar-Terrestrial Science Program, which consists of the four-spacecraft cluster mission and the Solar and Heliospheric Observatory (SOHO), is examined. It is expected that the SOHO spacecraft will be launched in 1995 to study solar interior structure and the physical processes associated with the solar corona. The SOHO design, operation, data, and ground segment are discussed. The Cluster mission is designed to study small-scale structures in the earth's plasma environment. The Soviet Union is expected to contribute two additional spacecraft, which will be similar to Cluster in instrumentation and design. The capabilities, mission strategy, spacecraft design, payload, and ground segment of Cluster are discussed.
NASA Astrophysics Data System (ADS)
Taşkin Kaya, Gülşen
2013-10-01
-output relationships in high-dimensional systems for many problems in science and engineering. The HDMR method is developed to improve the efficiency of the deducing high dimensional behaviors. The method is formed by a particular organization of low dimensional component functions, in which each function is the contribution of one or more input variables to the output variables.
High dimensional spatial modeling of extremes with applications to United States Rainfalls
NASA Astrophysics Data System (ADS)
Zhou, Jie
2007-12-01
Spatial statistical models are used to predict unobserved variables based on observed variables and to estimate unknown model parameters. Extreme value theory(EVT) is used to study large or small observations from a random phenomenon. Both spatial statistics and extreme value theory have been studied in a lot of areas such as agriculture, finance, industry and environmental science. This dissertation proposes two spatial statistical models which concentrate on non-Gaussian probability densities with general spatial covariance structures. The two models are also applied in analyzing United States Rainfalls and especially, rainfall extremes. When the data set is not too large, the first model is used. The model constructs a generalized linear mixed model(GLMM) which can be considered as an extension of Diggle's model-based geostatistical approach(Diggle et al. 1998). The approach improves conventional kriging with a form of generalized linear mixed structure. As for high dimensional problems, two different methods are established to improve the computational efficiency of Markov Chain Monte Carlo(MCMC) implementation. The first method is based on spectral representation of spatial dependence structures which provides good approximations on each MCMC iteration. The other method embeds high dimensional covariance matrices in matrices with block circulant structures. The eigenvalues and eigenvectors of block circulant matrices can be calculated exactly by Fast Fourier Transforms(FFT). The computational efficiency is gained by transforming the posterior matrices into lower dimensional matrices. This method gives us exact update on each MCMC iteration. Future predictions are also made by keeping spatial dependence structures fixed and using the relationship between present days and future days provided by some Global Climate Model(GCM). The predictions are refined by sampling techniques. Both ways of handling high dimensional covariance matrices are novel to analyze large
Fast and accurate probability density estimation in large high dimensional astronomical datasets
NASA Astrophysics Data System (ADS)
Gupta, Pramod; Connolly, Andrew J.; Gardner, Jeffrey P.
2015-01-01
Astronomical surveys will generate measurements of hundreds of attributes (e.g. color, size, shape) on hundreds of millions of sources. Analyzing these large, high dimensional data sets will require efficient algorithms for data analysis. An example of this is probability density estimation that is at the heart of many classification problems such as the separation of stars and quasars based on their colors. Popular density estimation techniques use binning or kernel density estimation. Kernel density estimation has a small memory footprint but often requires large computational resources. Binning has small computational requirements but usually binning is implemented with multi-dimensional arrays which leads to memory requirements which scale exponentially with the number of dimensions. Hence both techniques do not scale well to large data sets in high dimensions. We present an alternative approach of binning implemented with hash tables (BASH tables). This approach uses the sparseness of data in the high dimensional space to ensure that the memory requirements are small. However hashing requires some extra computation so a priori it is not clear if the reduction in memory requirements will lead to increased computational requirements. Through an implementation of BASH tables in C++ we show that the additional computational requirements of hashing are negligible. Hence this approach has small memory and computational requirements. We apply our density estimation technique to photometric selection of quasars using non-parametric Bayesian classification and show that the accuracy of the classification is same as the accuracy of earlier approaches. Since the BASH table approach is one to three orders of magnitude faster than the earlier approaches it may be useful in various other applications of density estimation in astrostatistics.
Multiscale hierarchical support vector clustering
NASA Astrophysics Data System (ADS)
Hansen, Michael Saas; Holm, David Alberg; Sjöstrand, Karl; Ley, Carsten Dan; Rowland, Ian John; Larsen, Rasmus
2008-03-01
Clustering is the preferred choice of method in many applications, and support vector clustering (SVC) has proven efficient for clustering noisy and high-dimensional data sets. A method for multiscale support vector clustering is demonstrated, using the recently emerged method for fast calculation of the entire regularization path of the support vector domain description. The method is illustrated on artificially generated examples, and applied for detecting blood vessels from high resolution time series of magnetic resonance imaging data. The obtained results are robust while the need for parameter estimation is reduced, compared to support vector clustering.
Möbius transformational high dimensional model representation on multi-way arrays
NASA Astrophysics Data System (ADS)
Özay, Evrim Korkmaz
2012-09-01
Transformational High Dimensional Model Representation has been used for continous structures with different transformations before. This work is inventive because not only for the transformation type but also its usage. Möbius Transformational High Dimensional Model Representation has been used at multi-way arrays, by using truncation approximant and inverse transformation an approximation has been obtained for original multi-way array.
The problem of the structure (state of helium) in small He{sub N}-CO clusters
Potapov, A. V. Panfilov, V. A.; Surin, L. A.; Dumesh, B. S.
2010-11-15
A second-order perturbation theory, developed for calculating the energy levels of the He-CO binary complex, is applied to small He{sub N}-CO clusters with N = 2-4, the helium atoms being considered as a single bound object. The interaction potential between the CO molecule and HeN is represented as a linear expansion in Legendre polynomials, in which the free rotation limit is chosen as the zero approximation and the angular dependence of the interaction is considered as a small perturbation. By fitting calculated rotational transitions to experimental values it was possible to determine the optimal parameters of the potential and to achieve good agreement (to within less than 1%) between calculated and experimental energy levels. As a result, the shape of the angular anisotropy of the interaction potential is obtained for various clusters. It turns out that the minimum of the potential energy is smoothly shifted from an angle between the axes of the CO molecule and the cluster of {theta} = 100{sup o} in He-CO to {theta} = 180{sup o} (the oxygen end) in He{sub 3}-CO and He{sub 4}-CO clusters. Under the assumption that the distribution of helium atoms with respect to the cluster axis is cylindrically symmetric, the structure of the cluster can be represented as a pyramid with the CO molecule at the vertex.
NASA Astrophysics Data System (ADS)
Jiang, Boyu; Hu, Xiaoping; Gao, Hao
2016-01-01
Two-dimensional magnetic resonance spectroscopy (2D MRS) is challenging, even with state-of-art compressive sensing methods, such as L1-sparsity method. In this work, using the prior that the 2D MRS can be regarded as a series of Lorentzian functions, we aim to develop a robust Lorentzian-sparsity based spectroscopy reconstruction method for high-dimensional MRS. The proposed method sparsifies 2D MRS in Lorentzian functions. Instead of thousands of pixel-wise variables, this Lorentzian-sparsity method significantly reduces the number of unknowns to several geometric variables, such as the center, magnitude and shape parameters for each Lorentzian function. The spectroscopy reconstruction is formulated as a nonlinear and nonconvex optimization problem, and the simulated annealing algorithm is developed to solve the problem. The proposed method was compared with inverse FFT method and L1-sparsity method, under various undersampling factors. While FFT and L1 results contained severe artifacts, the Lorentzian-sparsity results provided significantly improved spectroscopy. A new 2D MRS reconstruction method is proposed using the Lorentzian sparsity, with significantly improved MRS reconstruction quality, in comparison with standard inverse FFT method or state-of-art L1-sparsity method.
SPReM: Sparse Projection Regression Model For High-dimensional Linear Regression *
Sun, Qiang; Zhu, Hongtu; Liu, Yufeng; Ibrahim, Joseph G.
2014-01-01
The aim of this paper is to develop a sparse projection regression modeling (SPReM) framework to perform multivariate regression modeling with a large number of responses and a multivariate covariate of interest. We propose two novel heritability ratios to simultaneously perform dimension reduction, response selection, estimation, and testing, while explicitly accounting for correlations among multivariate responses. Our SPReM is devised to specifically address the low statistical power issue of many standard statistical approaches, such as the Hotelling’s T2 test statistic or a mass univariate analysis, for high-dimensional data. We formulate the estimation problem of SPREM as a novel sparse unit rank projection (SURP) problem and propose a fast optimization algorithm for SURP. Furthermore, we extend SURP to the sparse multi-rank projection (SMURP) by adopting a sequential SURP approximation. Theoretically, we have systematically investigated the convergence properties of SURP and the convergence rate of SURP estimates. Our simulation results and real data analysis have shown that SPReM out-performs other state-of-the-art methods. PMID:26527844
ERIC Educational Resources Information Center
Jitendra, Asha K.; Harwell, Michael R.; Dupuis, Danielle N.; Karl, Stacy R.; Lein, Amy E.; Simonson, Gregory; Slater, Susan C.
2015-01-01
This experimental study evaluated the effectiveness of a research-based intervention, schema-based instruction (SBI), on students' proportional problem solving. SBI emphasizes the underlying mathematical structure of problems, uses schematic diagrams to represent information in the problem text, provides explicit problem-solving and metacognitive…
NASA Technical Reports Server (NTRS)
Soderblom, David R.; King, Jeremy R.; Hanson, Robert B.; Jones, Burton F.; Fischer, Debra; Stauffer, John R.; Pinsonneault, Marc H.
1998-01-01
This paper examines the discrepancy between distances to nearby open clusters as determined by parallaxes from Hipparcos compared to traditional main-sequence fitting. The biggest difference is seen for the Pleiades, and our hypothesis is that if the Hipparcos distance to the Pleiades is correct, then similar subluminous zero-age main-sequence (ZAMS) stars should exist elsewhere, including in the immediate solar neighborhood. We examine a color-magnitude diagram of very young and nearby solar-type stars and show that none of them lie below the traditional ZAMS, despite the fact that the Hipparcos Pleiades parallax would place its members 0.3 mag below that ZAMS. We also present analyses and observations of solar-type stars that do lie below the ZAMS, and we show that they are subluminous because of low metallicity and that they have the kinematics of old stars.
High-dimensional analysis of the murine myeloid cell system.
Becher, Burkhard; Schlitzer, Andreas; Chen, Jinmiao; Mair, Florian; Sumatoh, Hermi R; Teng, Karen Wei Weng; Low, Donovan; Ruedl, Christiane; Riccardi-Castagnoli, Paola; Poidinger, Michael; Greter, Melanie; Ginhoux, Florent; Newell, Evan W
2014-12-01
Advances in cell-fate mapping have revealed the complexity in phenotype, ontogeny and tissue distribution of the mammalian myeloid system. To capture this phenotypic diversity, we developed a 38-antibody panel for mass cytometry and used dimensionality reduction with machine learning-aided cluster analysis to build a composite of murine (mouse) myeloid cells in the steady state across lymphoid and nonlymphoid tissues. In addition to identifying all previously described myeloid populations, higher-order analysis allowed objective delineation of otherwise ambiguous subsets, including monocyte-macrophage intermediates and an array of granulocyte variants. Using mice that cannot sense granulocyte macrophage-colony stimulating factor GM-CSF (Csf2rb(-/-)), which have discrete alterations in myeloid development, we confirmed differences in barrier tissue dendritic cells, lung macrophages and eosinophils. The methodology further identified variations in the monocyte and innate lymphoid cell compartment that were unexpected, which confirmed that this approach is a powerful tool for unambiguous and unbiased characterization of the myeloid system. PMID:25306126
Finite-key analysis of a practical decoy-state high-dimensional quantum key distribution
NASA Astrophysics Data System (ADS)
Bao, Haize; Bao, Wansu; Wang, Yang; Zhou, Chun; Chen, Ruike
2016-05-01
Compared with two-level quantum key distribution (QKD), high-dimensional QKD enables two distant parties to share a secret key at a higher rate. We provide a finite-key security analysis for the recently proposed practical high-dimensional decoy-state QKD protocol based on time-energy entanglement. We employ two methods to estimate the statistical fluctuation of the postselection probability and give a tighter bound on the secure-key capacity. By numerical evaluation, we show the finite-key effect on the secure-key capacity in different conditions. Moreover, our approach could be used to optimize parameters in practical implementations of high-dimensional QKD.
Linear stability theory as an early warning sign for transitions in high dimensional complex systems
NASA Astrophysics Data System (ADS)
Piovani, Duccio; Grujić, Jelena; Jeldtoft Jensen, Henrik
2016-07-01
We analyse in detail a new approach to the monitoring and forecasting of the onset of transitions in high dimensional complex systems by application to the Tangled Nature model of evolutionary ecology and high dimensional replicator systems with a stochastic element. A high dimensional stability matrix is derived in the mean field approximation to the stochastic dynamics. This allows us to determine the stability spectrum about the observed quasi-stable configurations. From overlap of the instantaneous configuration vector of the full stochastic system with the eigenvectors of the unstable directions of the deterministic mean field approximation, we are able to construct a good early-warning indicator of the transitions occurring intermittently.
Nadeau, R.M.
1995-10-01
This document contains information about the characterization and application of microearthquake clusters and fault zone dynamics. Topics discussed include: Seismological studies; fault-zone dynamics; periodic recurrence; scaling of microearthquakes to large earthquakes; implications of fault mechanics and seismic hazards; and wave propagation and temporal changes.
Sivakumar, Vidyashankar; Banerjee, Arindam; Ravikumar, Pradeep
2016-01-01
We consider the problem of high-dimensional structured estimation with norm-regularized estimators, such as Lasso, when the design matrix and noise are drawn from sub-exponential distributions. Existing results only consider sub-Gaussian designs and noise, and both the sample complexity and non-asymptotic estimation error have been shown to depend on the Gaussian width of suitable sets. In contrast, for the sub-exponential setting, we show that the sample complexity and the estimation error will depend on the exponential width of the corresponding sets, and the analysis holds for any norm. Further, using generic chaining, we show that the exponential width for any set will be at most logp times the Gaussian width of the set, yielding Gaussian width based results even for the sub-exponential case. Further, for certain popular estimators, viz Lasso and Group Lasso, using a VC-dimension based analysis, we show that the sample complexity will in fact be the same order as Gaussian designs. Our general analysis and results are the first in the sub-exponential setting, and are readily applicable to special sub-exponential families such as log-concave and extreme-value distributions. PMID:27563230
Individual-based models for adaptive diversification in high-dimensional phenotype spaces.
Ispolatov, Iaroslav; Madhok, Vaibhav; Doebeli, Michael
2016-02-01
Most theories of evolutionary diversification are based on equilibrium assumptions: they are either based on optimality arguments involving static fitness landscapes, or they assume that populations first evolve to an equilibrium state before diversification occurs, as exemplified by the concept of evolutionary branching points in adaptive dynamics theory. Recent results indicate that adaptive dynamics may often not converge to equilibrium points and instead generate complicated trajectories if evolution takes place in high-dimensional phenotype spaces. Even though some analytical results on diversification in complex phenotype spaces are available, to study this problem in general we need to reconstruct individual-based models from the adaptive dynamics generating the non-equilibrium dynamics. Here we first provide a method to construct individual-based models such that they faithfully reproduce the given adaptive dynamics attractor without diversification. We then show that a propensity to diversify can be introduced by adding Gaussian competition terms that generate frequency dependence while still preserving the same adaptive dynamics. For sufficiently strong competition, the disruptive selection generated by frequency-dependence overcomes the directional evolution along the selection gradient and leads to diversification in phenotypic directions that are orthogonal to the selection gradient. PMID:26598329
A Hyperspherical Adaptive Sparse-Grid Method for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max D.; Burkardt, John V.
2015-06-24
This study proposes and analyzes a hyperspherical adaptive hierarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces. The method is motivated by the theoretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a function representation of the discontinuity hypersurface of an N-dimensional discontinuous quantity of interest, by virtue of a hyperspherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyperspherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the newmore » technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. In addition, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous complexity analyses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.« less
A decision-theory approach to interpretable set analysis for high-dimensional data.
Boca, Simina M; Bravo, Héctor Céorrada; Caffo, Brian; Leek, Jeffrey T; Parmigiani, Giovanni
2013-09-01
A key problem in high-dimensional significance analysis is to find pre-defined sets that show enrichment for a statistical signal of interest; the classic example is the enrichment of gene sets for differentially expressed genes. Here, we propose a new decision-theory approach to the analysis of gene sets which focuses on estimating the fraction of non-null variables in a set. We introduce the idea of "atoms," non-overlapping sets based on the original pre-defined set annotations. Our approach focuses on finding the union of atoms that minimizes a weighted average of the number of false discoveries and missed discoveries. We introduce a new false discovery rate for sets, called the atomic false discovery rate (afdr), and prove that the optimal estimator in our decision-theory framework is to threshold the afdr. These results provide a coherent and interpretable framework for the analysis of sets that addresses the key issues of overlapping annotations and difficulty in interpreting p values in both competitive and self-contained tests. We illustrate our method and compare it to a popular existing method using simulated examples, as well as gene-set and brain ROI data analyses. PMID:23909925
Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics
Luo, Le; Li, Li
2014-01-01
Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications. PMID:24416136
Sparse grid discontinuous Galerkin methods for high-dimensional elliptic equations
NASA Astrophysics Data System (ADS)
Wang, Zixuan; Tang, Qi; Guo, Wei; Cheng, Yingda
2016-06-01
This paper constitutes our initial effort in developing sparse grid discontinuous Galerkin (DG) methods for high-dimensional partial differential equations (PDEs). Over the past few decades, DG methods have gained popularity in many applications due to their distinctive features. However, they are often deemed too costly because of the large degrees of freedom of the approximation space, which are the main bottleneck for simulations in high dimensions. In this paper, we develop sparse grid DG methods for elliptic equations with the aim of breaking the curse of dimensionality. Using a hierarchical basis representation, we construct a sparse finite element approximation space, reducing the degrees of freedom from the standard O (h-d) to O (h-1 |log2 h| d - 1) for d-dimensional problems, where h is the uniform mesh size in each dimension. Our method, based on the interior penalty (IP) DG framework, can achieve accuracy of O (hk |log2 h| d - 1) in the energy norm, where k is the degree of polynomials used. Error estimates are provided and confirmed by numerical tests in multi-dimensions.
Thompson, Paul M; Hayashi, Kiralee M; de Zubicaray, Greig; Janke, Andrew L; Rose, Stephen E; Semple, James; Doddrell, David M; Cannon, Tyrone D; Toga, Arthur W
2002-01-01
We briefly describe a set of algorithms to detect and visualize effects of disease and genetic factors on the brain. Extreme variations in cortical anatomy, even among normal subjects, complicate the detection and mapping of systematic effects on brain structure in human populations. We tackle this problem in two stages. First, we develop a cortical pattern matching approach, based on metrically covariant partial differential equations (PDEs), to associate corresponding regions of cortex in an MRI brain image database (N=102 scans). Second, these high-dimensional deformation maps are used to transfer within-subject cortical signals, including measures of gray matter distribution, shape asymmetries, and degenerative rates, to a common anatomic template for statistical analysis. We illustrate these techniques in two applications: (1) mapping dynamic patterns of gray matter loss in longitudinally scanned Alzheimer's disease patients; and (2) mapping genetic influences on brain structure. We extend statistics used widely in behavioral genetics to cortical manifolds. Specifically, we introduce methods based on h-squared distributed random fields to map hereditary influences on brain structure in human populations. PMID:19759832
A hyper-spherical adaptive sparse-grid method for high-dimensional discontinuity detection
Zhang, Guannan; Webster, Clayton G; Gunzburger, Max D; Burkardt, John V
2014-03-01
This work proposes and analyzes a hyper-spherical adaptive hi- erarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces is proposed. The method is motivated by the the- oretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a func- tion representation of the discontinuity hyper-surface of an N-dimensional dis- continuous quantity of interest, by virtue of a hyper-spherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyper-spherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smooth- ness of the hyper-surface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous error estimates and complexity anal- yses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.
Approximating high-dimensional dynamics by barycentric coordinates with linear programming
Hirata, Yoshito Aihara, Kazuyuki; Suzuki, Hideyuki; Shiro, Masanori; Takahashi, Nozomu; Mas, Paloma
2015-01-15
The increasing development of novel methods and techniques facilitates the measurement of high-dimensional time series but challenges our ability for accurate modeling and predictions. The use of a general mathematical model requires the inclusion of many parameters, which are difficult to be fitted for relatively short high-dimensional time series observed. Here, we propose a novel method to accurately model a high-dimensional time series. Our method extends the barycentric coordinates to high-dimensional phase space by employing linear programming, and allowing the approximation errors explicitly. The extension helps to produce free-running time-series predictions that preserve typical topological, dynamical, and/or geometric characteristics of the underlying attractors more accurately than the radial basis function model that is widely used. The method can be broadly applied, from helping to improve weather forecasting, to creating electronic instruments that sound more natural, and to comprehensively understanding complex biological data.
Hagen, Nathan; Kester, Robert T.; Gao, Liang; Tkaczyk, Tomasz S.
2012-01-01
The snapshot advantage is a large increase in light collection efficiency available to high-dimensional measurement systems that avoid filtering and scanning. After discussing this advantage in the context of imaging spectrometry, where the greatest effort towards developing snapshot systems has been made, we describe the types of measurements where it is applicable. We then generalize it to the larger context of high-dimensional measurements, where the advantage increases geometrically with measurement dimensionality. PMID:22791926
Nonlocality of high-dimensional two-photon orbital angular momentum states
Aiello, A.; Oemrawsingh, S. S. R.; Eliel, E. R.; Woerdman, J. P.
2005-11-15
We propose an interferometric method to investigate the nonlocality of high-dimensional two-photon orbital angular momentum states generated by spontaneous parametric down conversion. We incorporate two half-integer spiral phase plates and a variable-reflectivity output beam splitter into a Mach-Zehnder interferometer to build an orbital angular momentum analyzer. This setup enables testing the nonlocality of high-dimensional two-photon states by repeated use of the Clauser-Horne-Shimony-Holt inequality.
Modeling change from large-scale high-dimensional spatio-temporal array data
NASA Astrophysics Data System (ADS)
Lu, Meng; Pebesma, Edzer
2014-05-01
The massive data that come from Earth observation satellite and other sensors provide significant information for modeling global change. At the same time, the high dimensionality of the data has brought challenges in data acquisition, management, effective querying and processing. In addition, the output of earth system modeling tends to be data intensive and needs methodologies for storing, validation, analyzing and visualization, e.g. as maps. An important proportion of earth system observations and simulated data can be represented as multi-dimensional array data, which has received increasingly attention in big data management and spatial-temporal analysis. Study cases will be developed in natural science such as climate change, hydrological modeling, sediment dynamics, from which the addressing of big data problems is necessary. Multi-dimensional array-based database management and analytics system such as Rasdaman, SciDB, and R will be applied to these cases. From these studies will hope to learn the strengths and weaknesses of these systems, how they might work together or how semantics of array operations differ, through addressing the problems associated with big data. Research questions include: • How can we reduce dimensions spatially and temporally, or thematically? • How can we extend existing GIS functions to work on multidimensional arrays? • How can we combine data sets of different dimensionality or different resolutions? • Can map algebra be extended to an intelligible array algebra? • What are effective semantics for array programming of dynamic data driven applications? • In which sense are space and time special, as dimensions, compared to other properties? • How can we make the analysis of multi-spectral, multi-temporal and multi-sensor earth observation data easy?
NASA Astrophysics Data System (ADS)
Tripathy, Rohit; Bilionis, Ilias; Gonzalez, Marcial
2016-09-01
Uncertainty quantification (UQ) tasks, such as model calibration, uncertainty propagation, and optimization under uncertainty, typically require several thousand evaluations of the underlying computer codes. To cope with the cost of simulations, one replaces the real response surface with a cheap surrogate based, e.g., on polynomial chaos expansions, neural networks, support vector machines, or Gaussian processes (GP). However, the number of simulations required to learn a generic multivariate response grows exponentially as the input dimension increases. This curse of dimensionality can only be addressed, if the response exhibits some special structure that can be discovered and exploited. A wide range of physical responses exhibit a special structure known as an active subspace (AS). An AS is a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. If the dimensionality of the AS is low enough, then learning the link function is a much easier problem than the original problem of learning a high-dimensional function. The classic approach to discovering the AS requires gradient information, a fact that severely limits its applicability. Furthermore, and partly because of its reliance to gradients, it is not able to handle noisy observations. The latter is an essential trait if one wants to be able to propagate uncertainty through stochastic simulators, e.g., through molecular dynamics codes. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction. In particular, the AS is represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the
A Personalized Collaborative Recommendation Approach Based on Clustering of Customers
NASA Astrophysics Data System (ADS)
Wang, Pu
Collaborative filtering has been known to be the most successful recommender techniques in recommendation systems. Collaborative methods recommend items based on aggregated user ratings of those items and these techniques do not depend on the availability of textual descriptions. They share the common goal of assisting in the users search for items of interest, and thus attempt to address one of the key research problems of the information overload. Collaborative filtering systems can deal with large numbers of customers and with many different products. However there is a problem that the set of ratings is sparse, such that any two customers will most likely have only a few co-rated products. The high dimensional sparsity of the rating matrix and the problem of scalability result in low quality recommendations. In this paper, a personalized collaborative recommendation approach based on clustering of customers is presented. This method uses the clustering technology to form the customers centers. The personalized collaborative filtering approach based on clustering of customers can alleviate the scalability problem in the collaborative recommendations.
Machine learning etudes in astrophysics: selection functions for mock cluster catalogs
Hajian, Amir; Alvarez, Marcelo A.; Bond, J. Richard E-mail: malvarez@cita.utoronto.ca
2015-01-01
Making mock simulated catalogs is an important component of astrophysical data analysis. Selection criteria for observed astronomical objects are often too complicated to be derived from first principles. However the existence of an observed group of objects is a well-suited problem for machine learning classification. In this paper we use one-class classifiers to learn the properties of an observed catalog of clusters of galaxies from ROSAT and to pick clusters from mock simulations that resemble the observed ROSAT catalog. We show how this method can be used to study the cross-correlations of thermal Sunya'ev-Zeldovich signals with number density maps of X-ray selected cluster catalogs. The method reduces the bias due to hand-tuning the selection function and is readily scalable to large catalogs with a high-dimensional space of astrophysical features.
A rough set based rational clustering framework for determining correlated genes.
Jeyaswamidoss, Jeba Emilyn; Thangaraj, Kesavan; Ramar, Kadarkarai; Chitra, Muthusamy
2016-06-01
Cluster analysis plays a foremost role in identifying groups of genes that show similar behavior under a set of experimental conditions. Several clustering algorithms have been proposed for identifying gene behaviors and to understand their significance. The principal aim of this work is to develop an intelligent rough clustering technique, which will efficiently remove the irrelevant dimensions in a high-dimensional space and obtain appropriate meaningful clusters. This paper proposes a novel biclustering technique that is based on rough set theory. The proposed algorithm uses correlation coefficient as a similarity measure to simultaneously cluster both the rows and columns of a gene expression data matrix and mean squared residue to generate the initial biclusters. Furthermore, the biclusters are refined to form the lower and upper boundaries by determining the membership of the genes in the clusters using mean squared residue. The algorithm is illustrated with yeast gene expression data and the experiment proves the effectiveness of the method. The main advantage is that it overcomes the problem of selection of initial clusters and also the restriction of one object belonging to only one cluster by allowing overlapping of biclusters. PMID:27352972
Neugebauer, Romain; Schmittdiel, Julie A; Zhu, Zheng; Rassen, Jeremy A; Seeger, John D; Schneeweiss, Sebastian
2015-02-28
The high-dimensional propensity score (hdPS) algorithm was proposed for automation of confounding adjustment in problems involving large healthcare databases. It has been evaluated in comparative effectiveness research (CER) with point treatments to handle baseline confounding through matching or covariance adjustment on the hdPS. In observational studies with time-varying interventions, such hdPS approaches are often inadequate to handle time-dependent confounding and selection bias. Inverse probability weighting (IPW) estimation to fit marginal structural models can adequately handle these biases under the fundamental assumption of no unmeasured confounders. Upholding of this assumption relies on the selection of an adequate set of covariates for bias adjustment. We describe the application and performance of the hdPS algorithm to improve covariate selection in CER with time-varying interventions based on IPW estimation and explore stabilization of the resulting estimates using Super Learning. The evaluation is based on both the analysis of electronic health records data in a real-world CER study of adults with type 2 diabetes and a simulation study. This report (i) establishes the feasibility of IPW estimation with the hdPS algorithm based on large electronic health records databases, (ii) demonstrates little impact on inferences when supplementing the set of expert-selected covariates using the hdPS algorithm in a setting with extensive background knowledge, (iii) supports the application of the hdPS algorithm in discovery settings with little background knowledge or limited data availability, and (iv) motivates the application of Super Learning to stabilize effect estimates based on the hdPS algorithm. PMID:25488047
Reduced basis ANOVA methods for partial differential equations with high-dimensional random inputs
NASA Astrophysics Data System (ADS)
Liao, Qifeng; Lin, Guang
2016-07-01
In this paper we present a reduced basis ANOVA approach for partial deferential equations (PDEs) with random inputs. The ANOVA method combined with stochastic collocation methods provides model reduction in high-dimensional parameter space through decomposing high-dimensional inputs into unions of low-dimensional inputs. In this work, to further reduce the computational cost, we investigate spatial low-rank structures in the ANOVA-collocation method, and develop efficient spatial model reduction techniques using hierarchically generated reduced bases. We present a general mathematical framework of the methodology, validate its accuracy and demonstrate its efficiency with numerical experiments.
Kandrup, H.E.; Morrison, P.J.
1992-11-01
The Hamiltonian formulation of the Vlasov-Einstein system, which is appropriate for collisionless, self-gravitating systems like clusters of stars that are so dense that gravity must be described by the Einstein equation, is presented. In particular, it is demonstrated explicitly in the context of a 3 + 1 splitting that, for spherically symmetric configurations, the Vlasov-Einstein system can be viewed as a Hamiltonian system, where the dynamics is generated by a noncanonical Poisson bracket, with the Hamiltonian generating the evolution of the distribution function f (a noncanonical variable) being the conserved ADM mass-energy H{sub ADM}. An explicit expression is derived for the energy {delta}({sup 2})H{sub ADM} associated with an arbitrary phase space preserving perturbation of an arbitrary spherical equilibrium, and it is shown that the equilibrium must be linearly stable if {delta}({sup 2})H{sub ADM} is positive semi-definite. Insight into the Hamiltonian reformulation is provided by a description of general finite degree of freedom systems.
Kandrup, H.E. ); Morrison, P.J. . Inst. for Fusion Studies)
1992-11-01
The Hamiltonian formulation of the Vlasov-Einstein system, which is appropriate for collisionless, self-gravitating systems like clusters of stars that are so dense that gravity must be described by the Einstein equation, is presented. In particular, it is demonstrated explicitly in the context of a 3 + 1 splitting that, for spherically symmetric configurations, the Vlasov-Einstein system can be viewed as a Hamiltonian system, where the dynamics is generated by a noncanonical Poisson bracket, with the Hamiltonian generating the evolution of the distribution function f (a noncanonical variable) being the conserved ADM mass-energy H[sub ADM]. An explicit expression is derived for the energy [delta]([sup 2])H[sub ADM] associated with an arbitrary phase space preserving perturbation of an arbitrary spherical equilibrium, and it is shown that the equilibrium must be linearly stable if [delta]([sup 2])H[sub ADM] is positive semi-definite. Insight into the Hamiltonian reformulation is provided by a description of general finite degree of freedom systems.
NASA Astrophysics Data System (ADS)
Ahmad, Farooq; Malik, Manzoor A.; Bhat, M. Maqbool
2016-07-01
We derive the spatial pair correlation function in gravitational clustering for extended structure of galaxies (e.g. galaxies with halos) by using statistical mechanics of cosmological many-body problem. Our results indicate that in the limit of point masses (ɛ=0) the two-point correlation function varies as inverse square of relative separation of two galaxies. The effect of softening parameter `ɛ' on the pair correlation function is also studied and results indicate that two-point correlation function is affected by the softening parameter when the distance between galaxies is small. However, for larger distance between galaxies, the two-point correlation function is not affected at all. The correlation length r0 derived by our method depends on the random dispersion velocities < v2rangle^{1/2} and mean number density bar{n}, which is in agreement with N-body simulations and observations. Further, our results are applicable to the clusters of galaxies for their correlation functions and we apply our results to obtain the correlation length r0 for such systems which again agrees with the data of N-body simulations and observations.
van der Leij, Aryan; Blok, Henk; de Jong, Peter F.
2010-01-01
This study investigated the role of speech perception accuracy and speed in fluent word decoding of reading disabled (RD) children. A same-different phoneme discrimination task with natural speech tested the perception of single consonants and consonant clusters by young but persistent RD children. RD children were slower than chronological age (CA) controls in recognizing identical sounds, suggesting less distinct phonemic categories. In addition, after controlling for phonetic similarity Tallal’s (Brain Lang 9:182–198, 1980) fast transitions account of RD children’s speech perception problems was contrasted with Studdert-Kennedy’s (Read Writ Interdiscip J 15:5–14, 2002) similarity explanation. Results showed no specific RD deficit in perceiving fast transitions. Both phonetic similarity and fast transitions influenced accurate speech perception for RD children as well as CA controls. PMID:20652455
Average Transient Lifetime and Lyapunov Dimension for Transient Chaos in a High-Dimensional System
NASA Astrophysics Data System (ADS)
Chen, Hong; Tang, Jian-Xin; Tang, Shao-Yan; Xiang, Hong; Chen, Xin
2001-11-01
The average transient lifetime of a chaotic transient versus the Lyapunov dimension of a chaotic saddle is studied for high-dimensional nonlinear dynamical systems. Typically the average lifetime depends upon not only the system parameter but also the Lyapunov dimension of the chaotic saddle. The numerical example uses the delayed feedback differential equation.
Controlling chaos in a high dimensional system with periodic parametric perturbations
Mirus, K.A.; Sprott, J.C.
1998-10-01
The effect of applying a periodic perturbation to an accessible parameter of a high-dimensional (coupled-Lorenz) chaotic system is examined. Numerical results indicate that perturbation frequencies near the natural frequencies of the unstable periodic orbits of the chaotic system can result in limit cycles or significantly reduced dimension for relatively small perturbations.
Salient Region Detection via High-Dimensional Color Transform and Local Spatial Support.
Kim, Jiwhan; Han, Dongyoon; Tai, Yu-Wing; Kim, Junmo
2016-01-01
In this paper, we introduce a novel approach to automatically detect salient regions in an image. Our approach consists of global and local features, which complement each other to compute a saliency map. The first key idea of our work is to create a saliency map of an image by using a linear combination of colors in a high-dimensional color space. This is based on an observation that salient regions often have distinctive colors compared with backgrounds in human perception, however, human perception is complicated and highly nonlinear. By mapping the low-dimensional red, green, and blue color to a feature vector in a high-dimensional color space, we show that we can composite an accurate saliency map by finding the optimal linear combination of color coefficients in the high-dimensional color space. To further improve the performance of our saliency estimation, our second key idea is to utilize relative location and color contrast between superpixels as features and to resolve the saliency estimation from a trimap via a learning-based algorithm. The additional local features and learning-based algorithm complement the global estimation from the high-dimensional color transform-based algorithm. The experimental results on three benchmark datasets show that our approach is effective in comparison with the previous state-of-the-art saliency estimation methods. PMID:26529764
High-Dimensional Exploratory Item Factor Analysis by a Metropolis-Hastings Robbins-Monro Algorithm
ERIC Educational Resources Information Center
Cai, Li
2010-01-01
A Metropolis-Hastings Robbins-Monro (MH-RM) algorithm for high-dimensional maximum marginal likelihood exploratory item factor analysis is proposed. The sequence of estimates from the MH-RM algorithm converges with probability one to the maximum likelihood solution. Details on the computer implementation of this algorithm are provided. The…
NASA Astrophysics Data System (ADS)
Denis, Pablo A.
2014-04-01
By means of coupled cluster theory and correlation consistent basis sets we investigated the thermochemistry of dimethyl sulphide (DMS), dimethyl disulphide (DMDS) and four closely related sulphur-containing molecules: CH3SS, CH3S, CH3SH and CH3CH2SH. For the four closed-shell molecules studied, their enthalpies of formation (EOFs) were derived using bomb calorimetry. We found that the deviation of the EOF with respect to experiment was 0.96, 0.65, 1.24 and 1.29 kcal/mol, for CH3SH, CH3CH2SH, DMS and DMDS, respectively, when ΔHf,0 = 65.6 kcal/mol was utilised (JANAF value). However, if the recently proposed ΔHf,0 = 66.2 kcal/mol was used to estimate EOF, the errors dropped to 0.36, 0.05, 0.64 and 0.09 kcal/mol, respectively. In contrast, for the CH3SS radical, a better agreement with experiment was obtained if the 65.6 kcal/mol value was used. To compare with experiment avoiding the problem of the ΔHf,0 (S), we determined the CH3-S and CH3-SS bond dissociation energies (BDEs) in CH3S and CH3SS. At the coupled cluster with singles doubles and perturbative triples correction level of theory, these values are 48.0 and 71.4 kcal/mol, respectively. The latter BDEs are 1.5 and 1.2 kcal/mol larger than the experimental values. The agreement can be considered to be acceptable if we take into consideration that these two radicals present important challenges when determining their EOFs. It is our hope that this work stimulates new studies which help elucidate the problem of the EOF of atomic sulphur.
Xue, Hongqi; Wu, Yichao; Wu, Hulin
2013-01-01
In many regression problems, the relations between the covariates and the response may be nonlinear. Motivated by the application of reconstructing a gene regulatory network, we consider a sparse high-dimensional additive model with the additive components being some known nonlinear functions with unknown parameters. To identify the subset of important covariates, we propose a new method for simultaneous variable selection and parameter estimation by iteratively combining a large-scale variable screening (the nonlinear independence screening, NLIS) and a moderate-scale model selection (the nonnegative garrote, NNG) for the nonlinear additive regressions. We have shown that the NLIS procedure possesses the sure screening property and it is able to handle problems with non-polynomial dimensionality; and for finite dimension problems, the NNG for the nonlinear additive regressions has selection consistency for the unimportant covariates and also estimation consistency for the parameter estimates of the important covariates. The proposed method is applied to simulated data and a real data example for identifying gene regulations to illustrate its numerical performance. PMID:25170239
2009-01-01
Background The Screening Inventory of Psychosocial Problems (SIPP) is a short, validated self-reported questionnaire to identify psychosocial problems in Dutch cancer patients. The one-page 24-item questionnaire assesses physical complaints, psychological complaints and social and sexual problems. Very little is known about the effects of using the SIPP in consultation settings. Our study aims are to test the hypotheses that using the SIPP (a) may contribute to adequate referral to relevant psychosocial caregivers, (b) should facilitate communication between radiotherapists and cancer patients about psychosocial distress and (c) may prevent underdiagnosis of early symptoms reflecting psychosocial problems. This paper presents the design of a cluster randomised controlled trial (CRCT) evaluating the effectiveness of using the SIPP in cancer patients treated with radiotherapy. Methods/Design A CRCT is developed using a Solomon four-group design (two intervention and two control groups) to evaluate the effects of using the SIPP. Radiotherapists, instead of cancer patients, are randomly allocated to the experimental or control groups. Within these groups, all included cancer patients are randomised into two subgroups: with and without pre-measurement. Self-reported assessments are conducted at four times: a pre-test at baseline before the first consultation and a post-test directly following the first consultation, and three and 12 months after baseline measurement. The primary outcome measures are the number and types of referrals of cancer patients with psychosocial problems to relevant (psychosocial) caregivers. The secondary outcome measures are patients' satisfaction with the radiotherapist-patient communication, psychosocial distress and quality of life. Furthermore, a process evaluation will be carried out. Data of the effect-evaluation will be analysed according to the intention-to-treat principle and data regarding the types of referrals to health care
Survey on granularity clustering.
Ding, Shifei; Du, Mingjing; Zhu, Hong
2015-12-01
With the rapid development of uncertain artificial intelligent and the arrival of big data era, conventional clustering analysis and granular computing fail to satisfy the requirements of intelligent information processing in this new case. There is the essential relationship between granular computing and clustering analysis, so some researchers try to combine granular computing with clustering analysis. In the idea of granularity, the researchers expand the researches in clustering analysis and look for the best clustering results with the help of the basic theories and methods of granular computing. Granularity clustering method which is proposed and studied has attracted more and more attention. This paper firstly summarizes the background of granularity clustering and the intrinsic connection between granular computing and clustering analysis, and then mainly reviews the research status and various methods of granularity clustering. Finally, we analyze existing problem and propose further research. PMID:26557926
Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data
Deng, Yi; Chang, Changgee; Ido, Moges Seyoum; Long, Qi
2016-01-01
Multiple imputation (MI) has been widely used for handling missing data in biomedical research. In the presence of high-dimensional data, regularized regression has been used as a natural strategy for building imputation models, but limited research has been conducted for handling general missing data patterns where multiple variables have missing values. Using the idea of multiple imputation by chained equations (MICE), we investigate two approaches of using regularized regression to impute missing values of high-dimensional data that can handle general missing data patterns. We compare our MICE methods with several existing imputation methods in simulation studies. Our simulation results demonstrate the superiority of the proposed MICE approach based on an indirect use of regularized regression in terms of bias. We further illustrate the proposed methods using two data examples. PMID:26868061
A Shell Multi-dimensional Hierarchical Cubing Approach for High-Dimensional Cube
NASA Astrophysics Data System (ADS)
Zou, Shuzhi; Zhao, Li; Hu, Kongfa
The pre-computation of data cubes is critical for improving the response time of OLAP systems and accelerating data mining tasks in large data warehouses. However, as the sizes of data warehouses grow, the time it takes to perform this pre-computation becomes a significant performance bottleneck. In a high dimensional data warehouse, it might not be practical to build all these cuboids and their indices. In this paper, we propose a shell multi-dimensional hierarchical cubing algorithm, based on an extension of the previous minimal cubing approach. This method partitions the high dimensional data cube into low multi-dimensional hierarchical cube. Experimental results show that the proposed method is significantly more efficient than other existing cubing methods.
Buckley-James Boosting for Survival Analysis with High-Dimensional Biomarker Data*
Wang, Zhu; Wang, C.Y.
2010-01-01
There has been increasing interest in predicting patients’ survival after therapy by investigating gene expression microarray data. In the regression and classification models with high-dimensional genomic data, boosting has been successfully applied to build accurate predictive models and conduct variable selection simultaneously. We propose the Buckley-James boosting for the semiparametric accelerated failure time models with right censored survival data, which can be used to predict survival of future patients using the high-dimensional genomic data. In the spirit of adaptive LASSO, twin boosting is also incorporated to fit more sparse models. The proposed methods have a unified approach to fit linear models, non-linear effects models with possible interactions. The methods can perform variable selection and parameter estimation simultaneously. The proposed methods are evaluated by simulations and applied to a recent microarray gene expression data set for patients with diffuse large B-cell lymphoma under the current gold standard therapy. PMID:20597850
Simple, Scalable Proteomic Imaging for High-Dimensional Profiling of Intact Systems.
Murray, Evan; Cho, Jae Hun; Goodwin, Daniel; Ku, Taeyun; Swaney, Justin; Kim, Sung-Yon; Choi, Heejin; Park, Young-Gyun; Park, Jeong-Yoon; Hubbert, Austin; McCue, Margaret; Vassallo, Sara; Bakh, Naveed; Frosch, Matthew P; Wedeen, Van J; Seung, H Sebastian; Chung, Kwanghun
2015-12-01
Combined measurement of diverse molecular and anatomical traits that span multiple levels remains a major challenge in biology. Here, we introduce a simple method that enables proteomic imaging for scalable, integrated, high-dimensional phenotyping of both animal tissues and human clinical samples. This method, termed SWITCH, uniformly secures tissue architecture, native biomolecules, and antigenicity across an entire system by synchronizing the tissue preservation reaction. The heat- and chemical-resistant nature of the resulting framework permits multiple rounds (>20) of relabeling. We have performed 22 rounds of labeling of a single tissue with precise co-registration of multiple datasets. Furthermore, SWITCH synchronizes labeling reactions to improve probe penetration depth and uniformity of staining. With SWITCH, we performed combinatorial protein expression profiling of the human cortex and also interrogated the geometric structure of the fiber pathways in mouse brains. Such integrated high-dimensional information may accelerate our understanding of biological systems at multiple levels. PMID:26638076
Compressively Characterizing High-Dimensional Entangled States with Complementary, Random Filtering
NASA Astrophysics Data System (ADS)
Howland, Gregory A.; Knarr, Samuel H.; Schneeloch, James; Lum, Daniel J.; Howell, John C.
2016-04-01
The resources needed to conventionally characterize a quantum system are overwhelmingly large for high-dimensional systems. This obstacle may be overcome by abandoning traditional cornerstones of quantum measurement, such as general quantum states, strong projective measurement, and assumption-free characterization. Following this reasoning, we demonstrate an efficient technique for characterizing high-dimensional, spatial entanglement with one set of measurements. We recover sharp distributions with local, random filtering of the same ensemble in momentum followed by position—something the uncertainty principle forbids for projective measurements. Exploiting the expectation that entangled signals are highly correlated, we use fewer than 5000 measurements to characterize a 65,536-dimensional state. Finally, we use entropic inequalities to witness entanglement without a density matrix. Our method represents the sea change unfolding in quantum measurement, where methods influenced by the information theory and signal-processing communities replace unscalable, brute-force techniques—a progression previously followed by classical sensing.
Gentry, Amanda Elswick; Jackson-Cook, Colleen K; Lyon, Debra E; Archer, Kellie J
2015-01-01
The pathological description of the stage of a tumor is an important clinical designation and is considered, like many other forms of biomedical data, an ordinal outcome. Currently, statistical methods for predicting an ordinal outcome using clinical, demographic, and high-dimensional correlated features are lacking. In this paper, we propose a method that fits an ordinal response model to predict an ordinal outcome for high-dimensional covariate spaces. Our method penalizes some covariates (high-throughput genomic features) without penalizing others (such as demographic and/or clinical covariates). We demonstrate the application of our method to predict the stage of breast cancer. In our model, breast cancer subtype is a nonpenalized predictor, and CpG site methylation values from the Illumina Human Methylation 450K assay are penalized predictors. The method has been made available in the ordinalgmifs package in the R programming environment. PMID:26052223
Prediction of high-dimensional states subject to respiratory motion: a manifold learning approach
NASA Astrophysics Data System (ADS)
Liu, Wenyang; Sawant, Amit; Ruan, Dan
2016-07-01
The development of high-dimensional imaging systems in image-guided radiotherapy provides important pathways to the ultimate goal of real-time full volumetric motion monitoring. Effective motion management during radiation treatment usually requires prediction to account for system latency and extra signal/image processing time. It is challenging to predict high-dimensional respiratory motion due to the complexity of the motion pattern combined with the curse of dimensionality. Linear dimension reduction methods such as PCA have been used to construct a linear subspace from the high-dimensional data, followed by efficient predictions on the lower-dimensional subspace. In this study, we extend such rationale to a more general manifold and propose a framework for high-dimensional motion prediction with manifold learning, which allows one to learn more descriptive features compared to linear methods with comparable dimensions. Specifically, a kernel PCA is used to construct a proper low-dimensional feature manifold, where accurate and efficient prediction can be performed. A fixed-point iterative pre-image estimation method is used to recover the predicted value in the original state space. We evaluated and compared the proposed method with a PCA-based approach on level-set surfaces reconstructed from point clouds captured by a 3D photogrammetry system. The prediction accuracy was evaluated in terms of root-mean-squared-error. Our proposed method achieved consistent higher prediction accuracy (sub-millimeter) for both 200 ms and 600 ms lookahead lengths compared to the PCA-based approach, and the performance gain was statistically significant.
Prediction of high-dimensional states subject to respiratory motion: a manifold learning approach.
Liu, Wenyang; Sawant, Amit; Ruan, Dan
2016-07-01
The development of high-dimensional imaging systems in image-guided radiotherapy provides important pathways to the ultimate goal of real-time full volumetric motion monitoring. Effective motion management during radiation treatment usually requires prediction to account for system latency and extra signal/image processing time. It is challenging to predict high-dimensional respiratory motion due to the complexity of the motion pattern combined with the curse of dimensionality. Linear dimension reduction methods such as PCA have been used to construct a linear subspace from the high-dimensional data, followed by efficient predictions on the lower-dimensional subspace. In this study, we extend such rationale to a more general manifold and propose a framework for high-dimensional motion prediction with manifold learning, which allows one to learn more descriptive features compared to linear methods with comparable dimensions. Specifically, a kernel PCA is used to construct a proper low-dimensional feature manifold, where accurate and efficient prediction can be performed. A fixed-point iterative pre-image estimation method is used to recover the predicted value in the original state space. We evaluated and compared the proposed method with a PCA-based approach on level-set surfaces reconstructed from point clouds captured by a 3D photogrammetry system. The prediction accuracy was evaluated in terms of root-mean-squared-error. Our proposed method achieved consistent higher prediction accuracy (sub-millimeter) for both 200 ms and 600 ms lookahead lengths compared to the PCA-based approach, and the performance gain was statistically significant. PMID:27299958
Maximal violation of tight Bell inequalities for maximal high-dimensional entanglement
Lee, Seung-Woo; Jaksch, Dieter
2009-07-15
We propose a Bell inequality for high-dimensional bipartite systems obtained by binning local measurement outcomes and show that it is tight. We find a binning method for even d-dimensional measurement outcomes for which this Bell inequality is maximally violated by maximally entangled states. Furthermore, we demonstrate that the Bell inequality is applicable to continuous variable systems and yields strong violations for two-mode squeezed states.
Plurigon: three dimensional visualization and classification of high-dimensionality data
Martin, Bronwen; Chen, Hongyu; Daimon, Caitlin M.; Chadwick, Wayne; Siddiqui, Sana; Maudsley, Stuart
2013-01-01
High-dimensionality data is rapidly becoming the norm for biomedical sciences and many other analytical disciplines. Not only is the collection and processing time for such data becoming problematic, but it has become increasingly difficult to form a comprehensive appreciation of high-dimensionality data. Though data analysis methods for coping with multivariate data are well-documented in technical fields such as computer science, little effort is currently being expended to condense data vectors that exist beyond the realm of physical space into an easily interpretable and aesthetic form. To address this important need, we have developed Plurigon, a data visualization and classification tool for the integration of high-dimensionality visualization algorithms with a user-friendly, interactive graphical interface. Unlike existing data visualization methods, which are focused on an ensemble of data points, Plurigon places a strong emphasis upon the visualization of a single data point and its determining characteristics. Multivariate data vectors are represented in the form of a deformed sphere with a distinct topology of hills, valleys, plateaus, peaks, and crevices. The gestalt structure of the resultant Plurigon object generates an easily-appreciable model. User interaction with the Plurigon is extensive; zoom, rotation, axial and vector display, feature extraction, and anaglyph stereoscopy are currently supported. With Plurigon and its ability to analyze high-complexity data, we hope to see a unification of biomedical and computational sciences as well as practical applications in a wide array of scientific disciplines. Increased accessibility to the analysis of high-dimensionality data may increase the number of new discoveries and breakthroughs, ranging from drug screening to disease diagnosis to medical literature mining. PMID:23885241
Controlling chaos in low and high dimensional systems with periodic parametric perturbations
Mirus, K.A.; Sprott, J.C.
1998-06-01
The effect of applying a periodic perturbation to an accessible parameter of various chaotic systems is examined. Numerical results indicate that perturbation frequencies near the natural frequencies of the unstable periodic orbits of the chaotic systems can result in limit cycles for relatively small perturbations. Such perturbations can also control or significantly reduce the dimension of high-dimensional systems. Initial application to the control of fluctuations in a prototypical magnetic fusion plasma device will be reviewed.
Lessons learned in the analysis of high-dimensional data in vaccinomics.
Oberg, Ann L; McKinney, Brett A; Schaid, Daniel J; Pankratz, V Shane; Kennedy, Richard B; Poland, Gregory A
2015-09-29
The field of vaccinology is increasingly moving toward the generation, analysis, and modeling of extremely large and complex high-dimensional datasets. We have used data such as these in the development and advancement of the field of vaccinomics to enable prediction of vaccine responses and to develop new vaccine candidates. However, the application of systems biology to what has been termed "big data," or "high-dimensional data," is not without significant challenges-chief among them a paucity of gold standard analysis and modeling paradigms with which to interpret the data. In this article, we relate some of the lessons we have learned over the last decade of working with high-dimensional, high-throughput data as applied to the field of vaccinomics. The value of such efforts, however, is ultimately to better understand the immune mechanisms by which protective and non-protective responses to vaccines are generated, and to use this information to support a personalized vaccinology approach in creating better, and safer, vaccines for the public health. PMID:25957070
Unbiased Feature Selection in Learning Random Forests for High-Dimensional Data
Nguyen, Thanh-Tung; Huang, Joshua Zhexue; Nguyen, Thuy Thi
2015-01-01
Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures. PMID:25879059
Algamal, Zakariya Yahya; Lee, Muhammad Hisyam
2015-12-01
Cancer classification and gene selection in high-dimensional data have been popular research topics in genetics and molecular biology. Recently, adaptive regularized logistic regression using the elastic net regularization, which is called the adaptive elastic net, has been successfully applied in high-dimensional cancer classification to tackle both estimating the gene coefficients and performing gene selection simultaneously. The adaptive elastic net originally used elastic net estimates as the initial weight, however, using this weight may not be preferable for certain reasons: First, the elastic net estimator is biased in selecting genes. Second, it does not perform well when the pairwise correlations between variables are not high. Adjusted adaptive regularized logistic regression (AAElastic) is proposed to address these issues and encourage grouping effects simultaneously. The real data results indicate that AAElastic is significantly consistent in selecting genes compared to the other three competitor regularization methods. Additionally, the classification performance of AAElastic is comparable to the adaptive elastic net and better than other regularization methods. Thus, we can conclude that AAElastic is a reliable adaptive regularized logistic regression method in the field of high-dimensional cancer classification. PMID:26520484
Towards reliable multi-pathogen biosensors using high-dimensional encoding and decoding techniques
NASA Astrophysics Data System (ADS)
Chakrabartty, Shantanu; Liu, Yang
2008-08-01
Advances in micro-nano-biosensor fabrication are enabling technology that can integrate a large number of biological recognition elements within a single package. As a result, hundreds to millions of tests can be performed simultaneously and can facilitate rapid detection of multiple pathogens in a given sample. However, it is an open question as to how to exploit the high-dimensional nature of the multi-pathogen testing for improving the detection reliability a typical biosensor system. In this paper, we discuss two complementary high-dimensional encoding/decoding methods for improving the reliability of multi-pathogen detection. The first method uses a support vector machine (SVM) to learn the non-linear detection boundaries in the high-dimensional measurement space. The second method uses a forward error correcting (FEC) technique to synthetically introduce redundant patterns on the biosensor which can then be efficiently decoded. In this paper, experimental and simulation studies are based on a model conductimetric lateral flow immunoassay that uses antigen-antibody interaction in conjunction with a polyaniline transducer to detect presence or absence of pathogen in a given sample. Our results show that both SVM and FEC techniques can improve the detection performance by exploiting cross-reaction amongst multiple recognition sites on the biosensor. This is contrary to many existing methods used in pathogen detection technology where the main emphasis has been reducing the effects of cross-reaction and coupling instead of exploiting them as side information.
NASA Astrophysics Data System (ADS)
Regis, Rommel G.; Shoemaker, Christine A.
2013-05-01
This article presents the DYCORS (DYnamic COordinate search using Response Surface models) framework for surrogate-based optimization of HEB (High-dimensional, Expensive, and Black-box) functions that incorporates an idea from the DDS (Dynamically Dimensioned Search) algorithm. The iterate is selected from random trial solutions obtained by perturbing only a subset of the coordinates of the current best solution. Moreover, the probability of perturbing a coordinate decreases as the algorithm reaches the computational budget. Two DYCORS algorithms that use RBF (Radial Basis Function) surrogates are developed: DYCORS-LMSRBF is a modification of the LMSRBF algorithm while DYCORS-DDSRBF is an RBF-assisted DDS. Numerical results on a 14-D watershed calibration problem and on eleven 30-D and 200-D test problems show that DYCORS algorithms are generally better than EGO, DDS, LMSRBF, MADS with kriging, SQP, an RBF-assisted evolution strategy, and a genetic algorithm. Hence, DYCORS is a promising approach for watershed calibration and for HEB optimization.
Tian, Xinyu; Wang, Xuefeng; Chen, Jun
2014-01-01
Classic multinomial logit model, commonly used in multiclass regression problem, is restricted to few predictors and does not take into account the relationship among variables. It has limited use for genomic data, where the number of genomic features far exceeds the sample size. Genomic features such as gene expressions are usually related by an underlying biological network. Efficient use of the network information is important to improve classification performance as well as the biological interpretability. We proposed a multinomial logit model that is capable of addressing both the high dimensionality of predictors and the underlying network information. Group lasso was used to induce model sparsity, and a network-constraint was imposed to induce the smoothness of the coefficients with respect to the underlying network structure. To deal with the non-smoothness of the objective function in optimization, we developed a proximal gradient algorithm for efficient computation. The proposed model was compared to models with no prior structure information in both simulations and a problem of cancer subtype prediction with real TCGA (the cancer genome atlas) gene expression data. The network-constrained mode outperformed the traditional ones in both cases. PMID:25635165
ERIC Educational Resources Information Center
Dishion, Thomas J.; Ha, Thao; Veronneau, Marie-Helene
2012-01-01
The authors propose that peer relationships should be included in a life history perspective on adolescent problem behavior. Longitudinal analyses were used to examine deviant peer clustering as the mediating link between attenuated family ties, peer marginalization, and social disadvantage in early adolescence and sexual promiscuity in middle…
Matlab Cluster Ensemble Toolbox
Sapio, Vincent De; Kegelmeyer, Philip
2009-04-27
This is a Matlab toolbox for investigating the application of cluster ensembles to data classification, with the objective of improving the accuracy and/or speed of clustering. The toolbox divides the cluster ensemble problem into four areas, providing functionality for each. These include, (1) synthetic data generation, (2) clustering to generate individual data partitions and similarity matrices, (3) consensus function generation and final clustering to generate ensemble data partitioning, and (4) implementation of accuracy metrics. With regard to data generation, Gaussian data of arbitrary dimension can be generated. The kcenters algorithm can then be used to generate individual data partitions by either, (a) subsampling the data and clustering each subsample, or by (b) randomly initializing the algorithm and generating a clustering for each initialization. In either case an overall similarity matrix can be computed using a consensus function operating on the individual similarity matrices. A final clustering can be performed and performance metrics are provided for evaluation purposes.
Optimal cellular preservation for high dimensional flow cytometric analysis of multicentre trials.
Ng, Amanda A P; Lee, Bernett T K; Teo, Timothy S Y; Poidinger, Michael; Connolly, John E
2012-11-30
High dimensional flow cytometry is best served by centralized facilities. However, the difficulties around sample processing, storage and shipment make large scale international studies impractical. We therefore sought to identify optimized fixation procedures which fully leverage the analytical capability of high dimensional flow cytometry without the need for complex cell processing or a sustained cold chain. Whole blood staining procedure was employed to investigate the applicability of fixatives including Cyto-Chex® Blood Collection tube (Streck), Transfix® (Cytomark), 1% and 4% paraformaldehyde to centralized analysis of field trial samples. Samples were subjected to environmental conditions which mimic field studies, without refrigerated shipment and analyzed across 10 days, based on cell count and marker expression. This study showed that Cyto-Chex® demonstrated the least variability in absolute cell count relative to samples analyzed directly from donors in the absence of fixation. Transfix® was better at preserving the marker expression among all fixatives. However, Transfix® caused marked increased cell membrane permeabilization and was detrimental to intracellular marker identification. Paraformaldehyde fixation, at either 1% or 4% concentrations, was unfavorable for cell preservation under the conditions tested and thus not recommended. Using these data, we have created an online interactive tool which enables researchers to evaluate the impact of different fixatives on their panel of interest. In this study, we have identified Cyto-Chex® as the optimal cellular preservative for high dimensional flow cytometry in large scale studies for shipped whole blood samples, even in the absence of a sustained cold chain. PMID:22922462
NASA Astrophysics Data System (ADS)
Chen, Peng; Quarteroni, Alfio
2015-10-01
In this work we develop an adaptive and reduced computational algorithm based on dimension-adaptive sparse grid approximation and reduced basis methods for solving high-dimensional uncertainty quantification (UQ) problems. In order to tackle the computational challenge of "curse of dimensionality" commonly faced by these problems, we employ a dimension-adaptive tensor-product algorithm [16] and propose a verified version to enable effective removal of the stagnation phenomenon besides automatically detecting the importance and interaction of different dimensions. To reduce the heavy computational cost of UQ problems modelled by partial differential equations (PDE), we adopt a weighted reduced basis method [7] and develop an adaptive greedy algorithm in combination with the previous verified algorithm for efficient construction of an accurate reduced basis approximation. The efficiency and accuracy of the proposed algorithm are demonstrated by several numerical experiments.
High-dimensional chaos from self-sustained collisions of solitons
Yildirim, O. Ozgur E-mail: oozgury@gmail.com; Ham, Donhee E-mail: oozgury@gmail.com
2014-06-16
We experimentally demonstrate chaos generation based on collisions of electrical solitons on a nonlinear transmission line. The nonlinear line creates solitons, and an amplifier connected to it provides gain to these solitons for their self-excitation and self-sustenance. Critically, the amplifier also provides a mechanism to enable and intensify collisions among solitons. These collisional interactions are of intrinsically nonlinear nature, modulating the phase and amplitude of solitons, thus causing chaos. This chaos generated by the exploitation of the nonlinear wave phenomena is inherently high-dimensional, which we also demonstrate.
Computational flow cytometry: helping to make sense of high-dimensional immunology data.
Saeys, Yvan; Gassen, Sofie Van; Lambrecht, Bart N
2016-07-01
Recent advances in flow cytometry allow scientists to measure an increasing number of parameters per cell, generating huge and high-dimensional datasets. To analyse, visualize and interpret these data, newly available computational techniques should be adopted, evaluated and improved upon by the immunological community. Computational flow cytometry is emerging as an important new field at the intersection of immunology and computational biology; it allows new biological knowledge to be extracted from high-throughput single-cell data. This Review provides non-experts with a broad and practical overview of the many recent developments in computational flow cytometry. PMID:27320317
Scale-Invariant Sparse PCA on High Dimensional Meta-elliptical Data
Han, Fang; Liu, Han
2014-01-01
We propose a semiparametric method for conducting scale-invariant sparse principal component analysis (PCA) on high dimensional non-Gaussian data. Compared with sparse PCA, our method has weaker modeling assumption and is more robust to possible data contamination. Theoretically, the proposed method achieves a parametric rate of convergence in estimating the parameter of interests under a flexible semiparametric distribution family; Computationally, the proposed method exploits a rank-based procedure and is as efficient as sparse PCA; Empirically, our method outperforms most competing methods on both synthetic and real-world datasets. PMID:24932056
CytoSPADE: high-performance analysis and visualization of high-dimensional cytometry data
Linderman, Michael D.; Simonds, Erin F.; Qiu, Peng; Bruggner, Robert V.; Sheode, Ketaki; Meng, Teresa H.; Plevritis, Sylvia K.; Nolan, Garry P.
2012-01-01
Motivation: Recent advances in flow cytometry enable simultaneous single-cell measurement of 30+ surface and intracellular proteins. CytoSPADE is a high-performance implementation of an interface for the Spanning-tree Progression Analysis of Density-normalized Events algorithm for tree-based analysis and visualization of this high-dimensional cytometry data. Availability: Source code and binaries are freely available at http://cytospade.org and via Bioconductor version 2.10 onwards for Linux, OSX and Windows. CytoSPADE is implemented in R, C++ and Java. Contact: michael.linderman@mssm.edu Supplementary Information: Additional documentation available at http://cytospade.org. PMID:22782546
Inferring biological tasks using Pareto analysis of high-dimensional data.
Hart, Yuval; Sheftel, Hila; Hausser, Jean; Szekely, Pablo; Ben-Moshe, Noa Bossel; Korem, Yael; Tendler, Avichai; Mayo, Avraham E; Alon, Uri
2015-03-01
We present the Pareto task inference method (ParTI; http://www.weizmann.ac.il/mcb/UriAlon/download/ParTI) for inferring biological tasks from high-dimensional biological data. Data are described as a polytope, and features maximally enriched closest to the vertices (or archetypes) allow identification of the tasks the vertices represent. We demonstrate that human breast tumors and mouse tissues are well described by tetrahedrons in gene expression space, with specific tumor types and biological functions enriched at each of the vertices, suggesting four key tasks. PMID:25622107
NASA Astrophysics Data System (ADS)
Venghaus, Florian; Eisfeld, Wolfgang
2016-03-01
Robust diabatization techniques are key for the development of high-dimensional coupled potential energy surfaces (PESs) to be used in multi-state quantum dynamics simulations. In the present study we demonstrate that, besides the actual diabatization technique, common problems with the underlying electronic structure calculations can be the reason why a diabatization fails. After giving a short review of the theoretical background of diabatization, we propose a method based on the block-diagonalization to analyse the electronic structure data. This analysis tool can be used in three different ways: First, it allows to detect issues with the ab initio reference data and is used to optimize the setup of the electronic structure calculations. Second, the data from the block-diagonalization are utilized for the development of optimal parametrized diabatic model matrices by identifying the most significant couplings. Third, the block-diagonalization data are used to fit the parameters of the diabatic model, which yields an optimal initial guess for the non-linear fitting required by standard or more advanced energy based diabatization methods. The new approach is demonstrated by the diabatization of 9 electronic states of the propargyl radical, yielding fully coupled full-dimensional (12D) PESs in closed form.
NASA Astrophysics Data System (ADS)
Laloy, Eric; Linde, Niklas; Jacques, Diederik; Mariethoz, Grégoire
2016-04-01
The sequential geostatistical resampling (SGR) algorithm is a Markov chain Monte Carlo (MCMC) scheme for sampling from possibly non-Gaussian, complex spatially-distributed prior models such as geologic facies or categorical fields. In this work, we highlight the limits of standard SGR for posterior inference of high-dimensional categorical fields with realistically complex likelihood landscapes and benchmark a parallel tempering implementation (PT-SGR). Our proposed PT-SGR approach is demonstrated using synthetic (error corrupted) data from steady-state flow and transport experiments in categorical 7575- and 10,000-dimensional 2D conductivity fields. In both case studies, every SGR trial gets trapped in a local optima while PT-SGR maintains an higher diversity in the sampled model states. The advantage of PT-SGR is most apparent in an inverse transport problem where the posterior distribution is made bimodal by construction. PT-SGR then converges towards the appropriate data misfit much faster than SGR and partly recovers the two modes. In contrast, for the same computational resources SGR does not fit the data to the appropriate error level and hardly produces a locally optimal solution that looks visually similar to one of the two reference modes. Although PT-SGR clearly surpasses SGR in performance, our results also indicate that using a small number (16-24) of temperatures (and thus parallel cores) may not permit complete sampling of the posterior distribution by PT-SGR within a reasonable computational time (less than 1-2 weeks).
2013-01-01
Background There is large body of knowledge to support the importance of early interventions to improve child health and development. Nonetheless, it is important to identify cost-effective blends of preventive interventions with adequate coverage and feasible delivery modes. The aim of the Children and Parents in Focus trial is to compare two levels of parenting programme intensity and rate of exposure, with a control condition to address impact and cost-effectiveness of a universally offered evidence-based parenting programme in the Swedish context. Methods/Design The trial has a cluster randomised controlled design comprising three arms: Universal arm (with access to participation in Triple P - Positive Parenting Program, level 2); Universal Plus arm (with access to participation in Triple P - Positive Parenting Program, level 2 as well as level 3, and level 4 group); and Services as Usual arm. The sampling frame is Uppsala municipality in Sweden. Child health centres consecutively recruit parents of children aged 3 to 5 years before their yearly check-ups (during the years 2013–2017). Outcomes will be measured annually. The primary outcome will be children’s behavioural and emotional problems as rated by three informants: fathers, mothers and preschool teachers. The other outcomes will be parents’ behaviour and parents’ general health. Health economic evaluations will analyse cost-effectiveness of the interventions versus care as usual by comparing the costs and consequences in terms of impact on children’s mental health, parent’s mental health and health-related quality of life. Discussion This study addresses the need for comprehensive evaluation of the long-term effects, costs and benefits of early parenting interventions embedded within existing systems. In addition, the study will generate population-based data on the mental health and well-being of preschool aged children in Sweden. Trial registration ISRCTN: ISRCTN16513449. PMID:24131587
NASA Astrophysics Data System (ADS)
Zhang, Lili
This work aims to improve the capability of accurate information extraction from high-dimensional data, with a specific neural learning paradigm, the Self-Organizing Map (SOM). The SOM is an unsupervised learning algorithm that can faithfully sense the manifold structure and support supervised learning of relevant information from the data. Yet open problems regarding SOM learning exist. We focus on the following two issues. (1) Evaluation of topology preservation. Topology preservation is essential for SOMs in faithful representation of manifold structure. However, in reality, topology violations are not unusual, especially when the data have complicated structure. Measures capable of accurately quantifying and informatively expressing topology violations are lacking. One contribution of this work is a new measure, the Weighted Differential Topographic Function (WDTF), which differentiates an existing measure, the Topographic Function (TF), and incorporates detailed data distribution as an importance weighting of violations to distinguish severe violations from insignificant ones. Another contribution is an interactive visual tool, TopoView, which facilitates the visual inspection of violations on the SOM lattice. We show the effectiveness of the combined use of the WDTF and TopoView through a simple two-dimensional data set and two hyperspectral images. (2) Learning multiple latent variables from high-dimensional data. We use an existing two-layer SOM-hybrid supervised architecture, which captures the manifold structure in its SOM hidden layer, and then, uses its output layer to perform the supervised learning of latent variables. In the customary way, the output layer only uses the strongest output of the SOM neurons. This severely limits the learning capability. We allow multiple, k, strongest responses of the SOM neurons for the supervised learning. Moreover, the fact that different latent variables can be best learned with different values of k motivates a
Steinwand, Daniel R.; Maddox, Brian; Beckmann, Tim; Hamer, George
2003-01-01
Beowulf clusters can provide a cost-effective way to compute numerical models and process large amounts of remote sensing image data. Usually a Beowulf cluster is designed to accomplish a specific set of processing goals, and processing is very efficient when the problem remains inside the constraints of the original design. There are cases, however, when one might wish to compute a problem that is beyond the capacity of the local Beowulf system. In these cases, spreading the problem to multiple clusters or to other machines on the network may provide a cost-effective solution.
An Efficient Initialization Method for K-Means Clustering of Hyperspectral Data
NASA Astrophysics Data System (ADS)
Alizade Naeini, A.; Jamshidzadeh, A.; Saadatseresht, M.; Homayouni, S.
2014-10-01
K-means is definitely the most frequently used partitional clustering algorithm in the remote sensing community. Unfortunately due to its gradient decent nature, this algorithm is highly sensitive to the initial placement of cluster centers. This problem deteriorates for the high-dimensional data such as hyperspectral remotely sensed imagery. To tackle this problem, in this paper, the spectral signatures of the endmembers in the image scene are extracted and used as the initial positions of the cluster centers. For this purpose, in the first step, A Neyman-Pearson detection theory based eigen-thresholding method (i.e., the HFC method) has been employed to estimate the number of endmembers in the image. Afterwards, the spectral signatures of the endmembers are obtained using the Minimum Volume Enclosing Simplex (MVES) algorithm. Eventually, these spectral signatures are used to initialize the k-means clustering algorithm. The proposed method is implemented on a hyperspectral dataset acquired by ROSIS sensor with 103 spectral bands over the Pavia University campus, Italy. For comparative evaluation, two other commonly used initialization methods (i.e., Bradley & Fayyad (BF) and Random methods) are implemented and compared. The confusion matrix, overall accuracy and Kappa coefficient are employed to assess the methods' performance. The evaluations demonstrate that the proposed solution outperforms the other initialization methods and can be applied for unsupervised classification of hyperspectral imagery for landcover mapping.
NASA Astrophysics Data System (ADS)
Sargsyan, K.; Safta, C.; Ray, J.; Debusschere, B.; Najm, H.; Ricciuto, D. M.; Thornton, P. E.
2012-12-01
Uncertainty quantification capabilities have been boosted considerably by recent advances in associated algorithms and software, as well as increased computational capabilities. As a result, it has become possible to address uncertainties in complex climate models more quantitatively. However, there still remain numerous challenges when dealing with complex climate models. In this work, we highlight and address some of these challenges, using the Community Land Model (CLM) as the main benchmark system for algorithm development. To begin with, climate models are computationally intensive. This necessarily disqualifies pure Monte-Carlo algorithms for uncertainty estimation, since naive Monte-Carlo approaches require too many sampled simulations for reasonable accuracy. In this work, we build computationally inexpensive surrogate model in order to accelerate both forward and inverse UQ methods. We apply Polynomial Chaos (PC) spectral expansions to build surrogate relationships between output quantities and model parameters using as few forward model simulations as possible. Next, climate models typically suffer from the curse of dimensionality. For example, the CLM depends on about 80 input parameters with somewhat uncertain values. Representation of the input-output dependence requires prohibitively many basis functions for spectral expansions. Moreover, to obtain such a representation, one needs to sample an 80-dimensional space, which can at best be sparsely covered. We apply Bayesian compressive sensing (BCS) techniques in order to infer the best basis set for the PC surrogate model. BCS performs particularly well in high-dimensional settings when model simulations are very sparse. Furthermore, many climate models incorporate dependent uncertain parameters. In this context, we apply the Rosenblatt transformation, mapping dependent parameters into a computationally convenient set of independent variables. This allows efficient parameter sampling even in presence of
The end of gating? An introduction to automated analysis of high dimensional cytometry data.
Mair, Florian; Hartmann, Felix J; Mrdjen, Dunja; Tosevski, Vinko; Krieg, Carsten; Becher, Burkhard
2016-01-01
Ever since its invention half a century ago, flow cytometry has been a major tool for single-cell analysis, fueling advances in our understanding of a variety of complex cellular systems, in particular the immune system. The last decade has witnessed significant technical improvements in available cytometry platforms, such that more than 20 parameters can be analyzed on a single-cell level by fluorescence-based flow cytometry. The advent of mass cytometry has pushed this limit up to, currently, 50 parameters. However, traditional analysis approaches for the resulting high-dimensional datasets, such as gating on bivariate dot plots, have proven to be inefficient. Although a variety of novel computational analysis approaches to interpret these datasets are already available, they have not yet made it into the mainstream and remain largely unknown to many immunologists. Therefore, this review aims at providing a practical overview of novel analysis techniques for high-dimensional cytometry data including SPADE, t-SNE, Wanderlust, Citrus, and PhenoGraph, and how these applications can be used advantageously not only for the most complex datasets, but also for standard 14-parameter cytometry datasets. PMID:26548301
Methods for discovery and characterization of cell subsets in high dimensional mass cytometry data.
Diggins, Kirsten E; Ferrell, P Brent; Irish, Jonathan M
2015-07-01
The flood of high-dimensional data resulting from mass cytometry experiments that measure more than 40 features of individual cells has stimulated creation of new single cell computational biology tools. These tools draw on advances in the field of machine learning to capture multi-parametric relationships and reveal cells that are easily overlooked in traditional analysis. Here, we introduce a workflow for high dimensional mass cytometry data that emphasizes unsupervised approaches and visualizes data in both single cell and population level views. This workflow includes three central components that are common across mass cytometry analysis approaches: (1) distinguishing initial populations, (2) revealing cell subsets, and (3) characterizing subset features. In the implementation described here, viSNE, SPADE, and heatmaps were used sequentially to comprehensively characterize and compare healthy and malignant human tissue samples. The use of multiple methods helps provide a comprehensive view of results, and the largely unsupervised workflow facilitates automation and helps researchers avoid missing cell populations with unusual or unexpected phenotypes. Together, these methods develop a framework for future machine learning of cell identity. PMID:25979346
Quantum secret sharing based on modulated high-dimensional time-bin entanglement
Takesue, Hiroki; Inoue, Kyo
2006-07-15
We propose a scheme for quantum secret sharing (QSS) that uses a modulated high-dimensional time-bin entanglement. By modulating the relative phase randomly by {l_brace}0,{pi}{r_brace}, a sender with the entanglement source can randomly change the sign of the correlation of the measurement outcomes obtained by two distant recipients. The two recipients must cooperate if they are to obtain the sign of the correlation, which is used as a secret key. We show that our scheme is secure against intercept-and-resend (IR) and beam splitting attacks by an outside eavesdropper thanks to the nonorthogonality of high-dimensional time-bin entangled states. We also show that a cheating attempt based on an IR attack by one of the recipients can be detected by changing the dimension of the time-bin entanglement randomly and inserting two 'vacant' slots between the packets. Then, cheating attempts can be detected by monitoring the count rate in the vacant slots. The proposed scheme has better experimental feasibility than previously proposed entanglement-based QSS schemes.
Methods for discovery and characterization of cell subsets in high dimensional mass cytometry data
Diggins, Kirsten E.; Ferrell, P. Brent; Irish, Jonathan M.
2015-01-01
The flood of high-dimensional data resulting from mass cytometry experiments that measure more than 40 features of individual cells has stimulated creation of new single cell computational biology tools. These tools draw on advances in the field of machine learning to capture multi-parametric relationships and reveal cells that are easily overlooked in traditional analysis. Here, we introduce a workflow for high dimensional mass cytometry data that emphasizes unsupervised approaches and visualizes data in both single cell and population level views. This workflow includes three central components that are common across mass cytometry analysis approaches: 1) distinguishing initial populations, 2) revealing cell subsets, and 3) characterizing subset features. In the implementation described here, viSNE, SPADE, and heatmaps were used sequentially to comprehensively characterize and compare healthy and malignant human tissue samples. The use of multiple methods helps provide a comprehensive view of results, and the largely unsupervised workflow facilitates automation and helps researchers avoid missing cell populations with unusual or unexpected phenotypes. Together, these methods develop a framework for future machine learning of cell identity. PMID:25979346
Incremental isometric embedding of high-dimensional data using connected neighborhood graphs.
Zhao, Dongfang; Yang, Li
2009-01-01
Most nonlinear data embedding methods use bottom-up approaches for capturing the underlying structure of data distributed on a manifold in high dimensional space. These methods often share the first step which defines neighbor points of every data point by building a connected neighborhood graph so that all data points can be embedded to a single coordinate system. These methods are required to work incrementally for dimensionality reduction in many applications. Because input data stream may be under-sampled or skewed from time to time, building connected neighborhood graph is crucial to the success of incremental data embedding using these methods. This paper presents algorithms for updating $k$-edge-connected and $k$-connected neighborhood graphs after a new data point is added or an old data point is deleted. It further utilizes a simple algorithm for updating all-pair shortest distances on the neighborhood graph. Together with incremental classical multidimensional scaling using iterative subspace approximation, this paper devises an incremental version of Isomap with enhancements to deal with under-sampled or unevenly distributed data. Experiments on both synthetic and real-world data sets show that the algorithm is efficient and maintains low dimensional configurations of high dimensional data under various data distributions. PMID:19029548
ZeitZeiger: supervised learning for high-dimensional data from an oscillatory system
Hughey, Jacob J.; Hastie, Trevor; Butte, Atul J.
2016-01-01
Numerous biological systems oscillate over time or space. Despite these oscillators’ importance, data from an oscillatory system is problematic for existing methods of regularized supervised learning. We present ZeitZeiger, a method to predict a periodic variable (e.g. time of day) from a high-dimensional observation. ZeitZeiger learns a sparse representation of the variation associated with the periodic variable in the training observations, then uses maximum-likelihood to make a prediction for a test observation. We applied ZeitZeiger to a comprehensive dataset of genome-wide gene expression from the mammalian circadian oscillator. Using the expression of 13 genes, ZeitZeiger predicted circadian time (internal time of day) in each of 12 mouse organs to within ∼1 h, resulting in a multi-organ predictor of circadian time. Compared to the state-of-the-art approach, ZeitZeiger was faster, more accurate and used fewer genes. We then validated the multi-organ predictor on 20 additional datasets comprising nearly 800 samples. Our results suggest that ZeitZeiger not only makes accurate predictions, but also gives insight into the behavior and structure of the oscillator from which the data originated. As our ability to collect high-dimensional data from various biological oscillators increases, ZeitZeiger should enhance efforts to convert these data to knowledge. PMID:26819407
Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression
Laimighofer, Michael; Krumsiek, Jan; Theis, Fabian J.
2016-01-01
Abstract With widespread availability of omics profiling techniques, the analysis and interpretation of high-dimensional omics data, for example, for biomarkers, is becoming an increasingly important part of clinical medicine because such datasets constitute a promising resource for predicting survival outcomes. However, early experience has shown that biomarkers often generalize poorly. Thus, it is crucial that models are not overfitted and give accurate results with new data. In addition, reliable detection of multivariate biomarkers with high predictive power (feature selection) is of particular interest in clinical settings. We present an approach that addresses both aspects in high-dimensional survival models. Within a nested cross-validation (CV), we fit a survival model, evaluate a dataset in an unbiased fashion, and select features with the best predictive power by applying a weighted combination of CV runs. We evaluate our approach using simulated toy data, as well as three breast cancer datasets, to predict the survival of breast cancer patients after treatment. In all datasets, we achieve more reliable estimation of predictive power for unseen cases and better predictive performance compared to the standard CoxLasso model. Taken together, we present a comprehensive and flexible framework for survival models, including performance estimation, final feature selection, and final model construction. The proposed algorithm is implemented in an open source R package (SurvRank) available on CRAN. PMID:26894327
Zhang, Yu; Wu, Jianxin; Cai, Jianfei
2016-05-01
In large-scale visual recognition and image retrieval tasks, feature vectors, such as Fisher vector (FV) or the vector of locally aggregated descriptors (VLAD), have achieved state-of-the-art results. However, the combination of the large numbers of examples and high-dimensional vectors necessitates dimensionality reduction, in order to reduce its storage and CPU costs to a reasonable range. In spite of the popularity of various feature compression methods, this paper shows that the feature (dimension) selection is a better choice for high-dimensional FV/VLAD than the feature (dimension) compression methods, e.g., product quantization. We show that strong correlation among the feature dimensions in the FV and the VLAD may not exist, which renders feature selection a natural choice. We also show that, many dimensions in FV/VLAD are noise. Throwing them away using feature selection is better than compressing them and useful dimensions altogether using feature compression methods. To choose features, we propose an efficient importance sorting algorithm considering both the supervised and unsupervised cases, for visual recognition and image retrieval, respectively. Combining with the 1-bit quantization, feature selection has achieved both higher accuracy and less computational cost than feature compression methods, such as product quantization, on the FV and the VLAD image representations. PMID:27046897
A Robust Supervised Variable Selection for Noisy High-Dimensional Data
Kalina, Jan; Schlenker, Anna
2015-01-01
The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers. PMID:26137474
Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression.
Laimighofer, Michael; Krumsiek, Jan; Buettner, Florian; Theis, Fabian J
2016-04-01
With widespread availability of omics profiling techniques, the analysis and interpretation of high-dimensional omics data, for example, for biomarkers, is becoming an increasingly important part of clinical medicine because such datasets constitute a promising resource for predicting survival outcomes. However, early experience has shown that biomarkers often generalize poorly. Thus, it is crucial that models are not overfitted and give accurate results with new data. In addition, reliable detection of multivariate biomarkers with high predictive power (feature selection) is of particular interest in clinical settings. We present an approach that addresses both aspects in high-dimensional survival models. Within a nested cross-validation (CV), we fit a survival model, evaluate a dataset in an unbiased fashion, and select features with the best predictive power by applying a weighted combination of CV runs. We evaluate our approach using simulated toy data, as well as three breast cancer datasets, to predict the survival of breast cancer patients after treatment. In all datasets, we achieve more reliable estimation of predictive power for unseen cases and better predictive performance compared to the standard CoxLasso model. Taken together, we present a comprehensive and flexible framework for survival models, including performance estimation, final feature selection, and final model construction. The proposed algorithm is implemented in an open source R package (SurvRank) available on CRAN. PMID:26894327
Validi, AbdoulAhad
2014-03-01
This study introduces a non-intrusive approach in the context of low-rank separated representation to construct a surrogate of high-dimensional stochastic functions, e.g., PDEs/ODEs, in order to decrease the computational cost of Markov Chain Monte Carlo simulations in Bayesian inference. The surrogate model is constructed via a regularized alternative least-square regression with Tikhonov regularization using a roughening matrix computing the gradient of the solution, in conjunction with a perturbation-based error indicator to detect optimal model complexities. The model approximates a vector of a continuous solution at discrete values of a physical variable. The required number of random realizations to achieve a successful approximation linearly depends on the function dimensionality. The computational cost of the model construction is quadratic in the number of random inputs, which potentially tackles the curse of dimensionality in high-dimensional stochastic functions. Furthermore, this vector-valued separated representation-based model, in comparison to the available scalar-valued case, leads to a significant reduction in the cost of approximation by an order of magnitude equal to the vector size. The performance of the method is studied through its application to three numerical examples including a 41-dimensional elliptic PDE and a 21-dimensional cavity flow.
High-dimensional quantum state transfer in a noisy network environment
NASA Astrophysics Data System (ADS)
Qin, Wei; Li, Jun-Lin; Long, Gui-Lu
2015-04-01
We propose and analyze an efficient high-dimensional quantum state transfer protocol in an XX coupling spin network with a hypercube structure or chain structure. Under free spin wave approximation, unitary evolution results in a perfect high-dimensional quantum swap operation requiring neither external manipulation nor weak coupling. Evolution time is independent of either distance between registers or dimensions of sent states, which can improve the computational efficiency. In the low temperature regime and thermodynamic limit, the decoherence caused by a noisy environment is studied with a model of an antiferromagnetic spin bath coupled to quantum channels via an Ising-type interaction. It is found that while the decoherence reduces the fidelity of state transfer, increasing intra-channel coupling can strongly suppress such an effect. These observations demonstrate the robustness of the proposed scheme. Project supported by the National Natural Science Foundation of China (Grant Nos. 11175094 and 91221205) and the National Basic Research Program of China (Grant No. 2011CB9216002). Long Gui-Lu also thanks the support of Center of Atomic and Molecular Nanoscience of Tsinghua University, China.
ZeitZeiger: supervised learning for high-dimensional data from an oscillatory system.
Hughey, Jacob J; Hastie, Trevor; Butte, Atul J
2016-05-01
Numerous biological systems oscillate over time or space. Despite these oscillators' importance, data from an oscillatory system is problematic for existing methods of regularized supervised learning. We present ZeitZeiger, a method to predict a periodic variable (e.g. time of day) from a high-dimensional observation. ZeitZeiger learns a sparse representation of the variation associated with the periodic variable in the training observations, then uses maximum-likelihood to make a prediction for a test observation. We applied ZeitZeiger to a comprehensive dataset of genome-wide gene expression from the mammalian circadian oscillator. Using the expression of 13 genes, ZeitZeiger predicted circadian time (internal time of day) in each of 12 mouse organs to within ∼1 h, resulting in a multi-organ predictor of circadian time. Compared to the state-of-the-art approach, ZeitZeiger was faster, more accurate and used fewer genes. We then validated the multi-organ predictor on 20 additional datasets comprising nearly 800 samples. Our results suggest that ZeitZeiger not only makes accurate predictions, but also gives insight into the behavior and structure of the oscillator from which the data originated. As our ability to collect high-dimensional data from various biological oscillators increases, ZeitZeiger should enhance efforts to convert these data to knowledge. PMID:26819407
Relation chain based clustering analysis
NASA Astrophysics Data System (ADS)
Zhang, Cheng-ning; Zhao, Ming-yang; Luo, Hai-bo
2011-08-01
Clustering analysis is currently one of well-developed branches in data mining technology which is supposed to find the hidden structures in the multidimensional space called feature or pattern space. A datum in the space usually possesses a vector form and the elements in the vector represent several specifically selected features. These features are often of efficiency to the problem oriented. Generally, clustering analysis goes into two divisions: one is based on the agglomerative clustering method, and the other one is based on divisive clustering method. The former refers to a bottom-up process which regards each datum as a singleton cluster while the latter refers to a top-down process which regards entire data as a cluster. As the collected literatures, it is noted that the divisive clustering is currently overwhelming both in application and research. Although some famous divisive clustering methods are designed and well developed, clustering problems are still far from being solved. The k - means algorithm is the original divisive clustering method which initially assigns some important index values, such as the clustering number and the initial clustering prototype positions, and that could not be reasonable in some certain occasions. More than the initial problem, the k - means algorithm may also falls into local optimum, clusters in a rigid way and is not available for non-Gaussian distribution. One can see that seeking for a good or natural clustering result, in fact, originates from the one's understanding of the concept of clustering. Thus, the confusion or misunderstanding of the definition of clustering always derives some unsatisfied clustering results. One should consider the definition deeply and seriously. This paper demonstrates the nature of clustering, gives the way of understanding clustering, discusses the methodology of designing a clustering algorithm, and proposes a new clustering method based on relation chains among 2D patterns. In
High-Dimensional Circular Quantum Secret Sharing Using Orbital Angular Momentum
NASA Astrophysics Data System (ADS)
Tang, Dawei; Wang, Tie-jun; Mi, Sichen; Geng, Xiao-Meng; Wang, Chuan
2016-07-01
Quantum secret sharing is to distribute secret message securely between multi-parties. Here exploiting orbital angular momentum (OAM) state of single photons as the information carrier, we propose a high-dimensional circular quantum secret sharing protocol which increases the channel capacity largely. In the proposed protocol, the secret message is split into two parts, and each encoded on the OAM state of single photons. The security of the protocol is guaranteed by the laws of non-cloning theorem. And the secret messages could not be recovered except that the two receivers collaborated with each other. Moreover, the proposed protocol could be extended into high-level quantum systems, and the enhanced security could be achieved.
A two-state hysteresis model from high-dimensional friction.
Biswas, Saurabh; Chatterjee, Anindya
2015-07-01
In prior work (Biswas & Chatterjee 2014 Proc. R. Soc. A 470, 20130817 (doi:10.1098/rspa.2013.0817)), we developed a six-state hysteresis model from a high-dimensional frictional system. Here, we use a more intuitively appealing frictional system that resembles one studied earlier by Iwan. The basis functions now have simple analytical description. The number of states required decreases further, from six to the theoretical minimum of two. The number of fitted parameters is reduced by an order of magnitude, to just six. An explicit and faster numerical solution method is developed. Parameter fitting to match different specified hysteresis loops is demonstrated. In summary, a new two-state model of hysteresis is presented that is ready for practical implementation. Essential Matlab code is provided. PMID:26587279
Detection meeting control: Unstable steady states in high-dimensional nonlinear dynamical systems.
Ma, Huanfei; Ho, Daniel W C; Lai, Ying-Cheng; Lin, Wei
2015-10-01
We articulate an adaptive and reference-free framework based on the principle of random switching to detect and control unstable steady states in high-dimensional nonlinear dynamical systems, without requiring any a priori information about the system or about the target steady state. Starting from an arbitrary initial condition, a proper control signal finds the nearest unstable steady state adaptively and drives the system to it in finite time, regardless of the type of the steady state. We develop a mathematical analysis based on fast-slow manifold separation and Markov chain theory to validate the framework. Numerical demonstration of the control and detection principle using both classic chaotic systems and models of biological and physical significance is provided. PMID:26565299
Application of Edwards' statistical mechanics to high-dimensional jammed sphere packings.
Jin, Yuliang; Charbonneau, Patrick; Meyer, Sam; Song, Chaoming; Zamponi, Francesco
2010-11-01
The isostatic jamming limit of frictionless spherical particles from Edwards' statistical mechanics [Song et al., Nature (London) 453, 629 (2008)] is generalized to arbitrary dimension d using a liquid-state description. The asymptotic high-dimensional behavior of the self-consistent relation is obtained by saddle-point evaluation and checked numerically. The resulting random close packing density scaling ϕ∼d2(-d) is consistent with that of other approaches, such as replica theory and density-functional theory. The validity of various structural approximations is assessed by comparing with three- to six-dimensional isostatic packings obtained from simulations. These numerical results support a growing accuracy of the theoretical approach with dimension. The approach could thus serve as a starting point to obtain a geometrical understanding of the higher-order correlations present in jammed packings. PMID:21230456
Happ, Martin; Harrar, Solomon W; Bathke, Arne C
2016-07-01
We propose tests for main and simple treatment effects, time effects, as well as treatment by time interactions in possibly high-dimensional multigroup repeated measures designs. The proposed inference procedures extend the work by Brunner et al. (2012) from two to several treatment groups and remain valid for unbalanced data and under unequal covariance matrices. In addition to showing consistency when sample size and dimension tend to infinity at the same rate, we provide finite sample approximations and evaluate their performance in a simulation study, demonstrating better maintenance of the nominal α-level than the popular Box-Greenhouse-Geisser and Huynh-Feldt methods, and a gain in power for informatively increasing dimension. Application is illustrated using electroencephalography (EEG) data from a neurological study involving patients with Alzheimer's disease and other cognitive impairments. PMID:26700536
A two-state hysteresis model from high-dimensional friction
Biswas, Saurabh; Chatterjee, Anindya
2015-01-01
In prior work (Biswas & Chatterjee 2014 Proc. R. Soc. A 470, 20130817 (doi:10.1098/rspa.2013.0817)), we developed a six-state hysteresis model from a high-dimensional frictional system. Here, we use a more intuitively appealing frictional system that resembles one studied earlier by Iwan. The basis functions now have simple analytical description. The number of states required decreases further, from six to the theoretical minimum of two. The number of fitted parameters is reduced by an order of magnitude, to just six. An explicit and faster numerical solution method is developed. Parameter fitting to match different specified hysteresis loops is demonstrated. In summary, a new two-state model of hysteresis is presented that is ready for practical implementation. Essential Matlab code is provided. PMID:26587279
NASA Astrophysics Data System (ADS)
Tutukov, A. V.; Dremov, V. V.; Dremova, G. N.
2009-10-01
Numerical N-body studies of the dynamical evolution of a cluster of 1000 galaxies were carried out in order to investigate the role of dark matter in the formation of cD galaxies. Two models explicitly describing the darkmatter as a full-fledged component of the cluster having its own physical characteristics are constructed. These treat the dark matter as a continuous underlying substrate and as “grainy” matter. The ratio of the masses of the dark and luminous matter of the cluster is varied in the range 3-100. The observed logarithmic spectrum dN ˜ dM / M is used as an initial mass spectrum for the galaxies. A comparative numerical analysis of the evolution of the mass spectrum, the dynamics of mergers of the cluster galaxies, and the evolution of the growth of the central, supermassive cD galaxy suggests that dynamical friction associated with dark matter accelerates the formation of the cD galaxy via the absorption of galaxies colliding with it. Taking into account a dark-matter “substrate” removes the formation of multiple mass-accumulation centers, and makes it easier to form a cD galaxy that accumulates 1-2% of the cluster mass within the Hubble time scale (3-8 billion years), consistent with observations.
NASA Astrophysics Data System (ADS)
Gastegger, Michael; Kauffmann, Clemens; Behler, Jörg; Marquetand, Philipp
2016-05-01
Many approaches, which have been developed to express the potential energy of large systems, exploit the locality of the atomic interactions. A prominent example is the fragmentation methods in which the quantum chemical calculations are carried out for overlapping small fragments of a given molecule that are then combined in a second step to yield the system's total energy. Here we compare the accuracy of the systematic molecular fragmentation approach with the performance of high-dimensional neural network (HDNN) potentials introduced by Behler and Parrinello. HDNN potentials are similar in spirit to the fragmentation approach in that the total energy is constructed as a sum of environment-dependent atomic energies, which are derived indirectly from electronic structure calculations. As a benchmark set, we use all-trans alkanes containing up to eleven carbon atoms at the coupled cluster level of theory. These molecules have been chosen because they allow to extrapolate reliable reference energies for very long chains, enabling an assessment of the energies obtained by both methods for alkanes including up to 10 000 carbon atoms. We find that both methods predict high-quality energies with the HDNN potentials yielding smaller errors with respect to the coupled cluster reference.
Gastegger, Michael; Kauffmann, Clemens; Behler, Jörg; Marquetand, Philipp
2016-05-21
Many approaches, which have been developed to express the potential energy of large systems, exploit the locality of the atomic interactions. A prominent example is the fragmentation methods in which the quantum chemical calculations are carried out for overlapping small fragments of a given molecule that are then combined in a second step to yield the system's total energy. Here we compare the accuracy of the systematic molecular fragmentation approach with the performance of high-dimensional neural network (HDNN) potentials introduced by Behler and Parrinello. HDNN potentials are similar in spirit to the fragmentation approach in that the total energy is constructed as a sum of environment-dependent atomic energies, which are derived indirectly from electronic structure calculations. As a benchmark set, we use all-trans alkanes containing up to eleven carbon atoms at the coupled cluster level of theory. These molecules have been chosen because they allow to extrapolate reliable reference energies for very long chains, enabling an assessment of the energies obtained by both methods for alkanes including up to 10 000 carbon atoms. We find that both methods predict high-quality energies with the HDNN potentials yielding smaller errors with respect to the coupled cluster reference. PMID:27208939
NASA Astrophysics Data System (ADS)
Haussaire, Jean-Matthieu; Bocquet, Marc
2016-04-01
Atmospheric chemistry models are becoming increasingly complex, with multiphasic chemistry, size-resolved particulate matter, and possibly coupled to numerical weather prediction models. In the meantime, data assimilation methods have also become more sophisticated. Hence, it will become increasingly difficult to disentangle the merits of data assimilation schemes, of models, and of their numerical implementation in a successful high-dimensional data assimilation study. That is why we believe that the increasing variety of problems encountered in the field of atmospheric chemistry data assimilation puts forward the need for simple low-order models, albeit complex enough to capture the relevant dynamics, physics and chemistry that could impact the performance of data assimilation schemes. Following this analysis, we developped a low-order coupled chemistry meteorology model named L95-GRS [1]. The advective wind is simulated by the Lorenz-95 model, while the chemistry is made of 6 reactive species and simulates ozone concentrations. With this model, we carried out data assimilation experiments to estimate the state of the system as well as the forcing parameter of the wind and the emissions of chemical compounds. This model proved to be a powerful playground giving insights on the hardships of online and offline estimation of atmospheric pollution. Building on the results on this low-order model, we test advanced data assimilation methods on a state-of-the-art chemical transport model to check if the conclusions obtained with our low-order model still stand. References [1] Haussaire, J.-M. and Bocquet, M.: A low-order coupled chemistry meteorology model for testing online and offline data assimilation schemes, Geosci. Model Dev. Discuss., 8, 7347-7394, doi:10.5194/gmdd-8-7347-2015, 2015.
Slonim, Noam; Atwal, Gurinder Singh; Tkačik, Gašper; Bialek, William
2005-01-01
In an age of increasingly large data sets, investigators in many different disciplines have turned to clustering as a tool for data analysis and exploration. Existing clustering methods, however, typically depend on several nontrivial assumptions about the structure of data. Here, we reformulate the clustering problem from an information theoretic perspective that avoids many of these assumptions. In particular, our formulation obviates the need for defining a cluster “prototype,” does not require an a priori similarity metric, is invariant to changes in the representation of the data, and naturally captures nonlinear relations. We apply this approach to different domains and find that it consistently produces clusters that are more coherent than those extracted by existing algorithms. Finally, our approach provides a way of clustering based on collective notions of similarity rather than the traditional pairwise measures. PMID:16352721
Muetterties, Earl L.
1980-05-01
Metal cluster chemistry is one of the most rapidly developing areas of inorganic and organometallic chemistry. Prior to 1960 only a few metal clusters were well characterized. However, shortly after the early development of boron cluster chemistry, the field of metal cluster chemistry began to grow at a very rapid rate and a structural and a qualitative theoretical understanding of clusters came quickly. Analyzed here is the chemistry and the general significance of clusters with particular emphasis on the cluster research within my group. The importance of coordinately unsaturated, very reactive metal clusters is the major subject of discussion.
ERIC Educational Resources Information Center
Snellings, Patrick; van der Leij, Aryan; Blok, Henk; de Jong, Peter F.
2010-01-01
This study investigated the role of speech perception accuracy and speed in fluent word decoding of reading disabled (RD) children. A same-different phoneme discrimination task with natural speech tested the perception of single consonants and consonant clusters by young but persistent RD children. RD children were slower than chronological age…
Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models
Fan, Jianqing; Ma, Yunbei; Dai, Wei
2014-01-01
The varying-coefficient model is an important class of nonparametric statistical model that allows us to examine how the effects of covariates vary with exposure variables. When the number of covariates is large, the issue of variable selection arises. In this paper, we propose and investigate marginal nonparametric screening methods to screen variables in sparse ultra-high dimensional varying-coefficient models. The proposed nonparametric independence screening (NIS) selects variables by ranking a measure of the nonparametric marginal contributions of each covariate given the exposure variable. The sure independent screening property is established under some mild technical conditions when the dimensionality is of nonpolynomial order, and the dimensionality reduction of NIS is quantified. To enhance the practical utility and finite sample performance, two data-driven iterative NIS methods are proposed for selecting thresholding parameters and variables: conditional permutation and greedy methods, resulting in Conditional-INIS and Greedy-INIS. The effectiveness and flexibility of the proposed methods are further illustrated by simulation studies and real data applications. PMID:25309009
NASA Astrophysics Data System (ADS)
Sun, Yifei; Kumar, Mrinal
2015-05-01
In this paper, a tensor decomposition approach combined with Chebyshev spectral differentiation is presented to solve the high dimensional transient Fokker-Planck equations (FPE) arising in the simulation of polymeric fluids via multi-bead-spring (MBS) model. Generalizing the authors' previous work on the stationary FPE, the transient solution is obtained in a single CANDECOMP/PARAFAC decomposition (CPD) form for all times via the alternating least squares algorithm. This is accomplished by treating the temporal dimension in the same manner as all other spatial dimensions, thereby decoupling it from them. As a result, the transient solution is obtained without resorting to expensive time stepping schemes. A new, relaxed approach for imposing the vanishing boundary conditions is proposed, improving the quality of the approximation. The asymptotic behavior of the temporal basis functions is studied. The proposed solver scales very well with the dimensionality of the MBS model. Numerical results for systems up to 14 dimensional state space are successfully obtained on a regular personal computer and compared with the corresponding matrix Riccati differential equation (for linear models) or Monte Carlo simulations (for nonlinear models).
Valdivia, Fernando
2014-01-01
Introduction. Medial temporal lobe atrophy assessment via magnetic resonance imaging (MRI) has been proposed in recent criteria as an in vivo diagnostic biomarker of Alzheimer's disease (AD). However, practical application of these criteria in a clinical setting will require automated MRI analysis techniques. To this end, we wished to validate our automated, high-dimensional morphometry technique to the hypothetical prediction of future clinical status from baseline data in a cohort of subjects in a large, multicentric setting, compared to currently known clinical status for these subjects. Materials and Methods. The study group consisted of 214 controls, 371 mild cognitive impairment (147 having progressed to probable AD and 224 stable), and 181 probable AD from the Alzheimer's Disease Neuroimaging Initiative, with data acquired on 58 different 1.5 T scanners. We measured the sensitivity and specificity of our technique in a hierarchical fashion, first testing the effect of intensity standardization, then between different volumes of interest, and finally its generalizability for a large, multicentric cohort. Results. We obtained 73.2% prediction accuracy with 79.5% sensitivity for the prediction of MCI progression to clinically probable AD. The positive predictive value was 81.6% for MCI progressing on average within 1.5 (0.3 s.d.) year. Conclusion. With high accuracy, the technique's ability to identify discriminant medial temporal lobe atrophy has been demonstrated in a large, multicentric environment. It is suitable as an aid for clinical diagnostic of AD. PMID:25254139
Viewpoints: A High-Performance High-Dimensional Exploratory Data Analysis Tool
NASA Astrophysics Data System (ADS)
Gazis, P. R.; Levit, C.; Way, M. J.
2010-12-01
Scientific data sets continue to increase in both size and complexity. In the past, dedicated graphics systems at supercomputing centers were required to visualize large data sets, but as the price of commodity graphics hardware has dropped and its capability has increased, it is now possible, in principle, to view large complex data sets on a single workstation. To do this in practice, an investigator will need software that is written to take advantage of the relevant graphics hardware. The Viewpoints visualization package described herein is an example of such software. Viewpoints is an interactive tool for exploratory visual analysis of large high-dimensional (multivariate) data. It leverages the capabilities of modern graphics boards (GPUs) to run on a single workstation or laptop. Viewpoints is minimalist: it attempts to do a small set of useful things very well (or at least very quickly) in comparison with similar packages today. Its basic feature set includes linked scatter plots with brushing, dynamic histograms, normalization, and outlier detection/removal. Viewpoints was originally designed for astrophysicists, but it has since been used in a variety of fields that range from astronomy, quantum chemistry, fluid dynamics, machine learning, bioinformatics, and finance to information technology server log mining. In this article, we describe the Viewpoints package and show examples of its usage.
Biomarkers for combat-related PTSD: focus on molecular networks from high-dimensional data
Neylan, Thomas C.; Schadt, Eric E.; Yehuda, Rachel
2014-01-01
Posttraumatic stress disorder (PTSD) and other deployment-related outcomes originate from a complex interplay between constellations of changes in DNA, environmental traumatic exposures, and other biological risk factors. These factors affect not only individual genes or bio-molecules but also the entire biological networks that in turn increase or decrease the risk of illness or affect illness severity. This review focuses on recent developments in the field of systems biology which use multidimensional data to discover biological networks affected by combat exposure and post-deployment disease states. By integrating large-scale, high-dimensional molecular, physiological, clinical, and behavioral data, the molecular networks that directly respond to perturbations that can lead to PTSD can be identified and causally associated with PTSD, providing a path to identify key drivers. Reprogrammed neural progenitor cells from fibroblasts from PTSD patients could be established as an in vitro assay for high throughput screening of approved drugs to determine which drugs reverse the abnormal expression of the pathogenic biomarkers or neuronal properties. PMID:25206954
Maljovec, D.; Wang, B.; Pascucci, V.; Bremer, P. T.; Pernice, M.; Mandelli, D.; Nourgaliev, R.
2013-07-01
The next generation of methodologies for nuclear reactor Probabilistic Risk Assessment (PRA) explicitly accounts for the time element in modeling the probabilistic system evolution and uses numerical simulation tools to account for possible dependencies between failure events. The Monte-Carlo (MC) and the Dynamic Event Tree (DET) approaches belong to this new class of dynamic PRA methodologies. A challenge of dynamic PRA algorithms is the large amount of data they produce which may be difficult to visualize and analyze in order to extract useful information. We present a software tool that is designed to address these goals. We model a large-scale nuclear simulation dataset as a high-dimensional scalar function defined over a discrete sample of the domain. First, we provide structural analysis of such a function at multiple scales and provide insight into the relationship between the input parameters and the output. Second, we enable exploratory analysis for users, where we help the users to differentiate features from noise through multi-scale analysis on an interactive platform, based on domain knowledge and data characterization. Our analysis is performed by exploiting the topological and geometric properties of the domain, building statistical models based on its topological segmentations and providing interactive visual interfaces to facilitate such explorations. We provide a user's guide to our software tool by highlighting its analysis and visualization capabilities, along with a use case involving data from a nuclear reactor safety simulation. (authors)
A sparse grid based method for generative dimensionality reduction of high-dimensional data
NASA Astrophysics Data System (ADS)
Bohn, Bastian; Garcke, Jochen; Griebel, Michael
2016-03-01
Generative dimensionality reduction methods play an important role in machine learning applications because they construct an explicit mapping from a low-dimensional space to the high-dimensional data space. We discuss a general framework to describe generative dimensionality reduction methods, where the main focus lies on a regularized principal manifold learning variant. Since most generative dimensionality reduction algorithms exploit the representer theorem for reproducing kernel Hilbert spaces, their computational costs grow at least quadratically in the number n of data. Instead, we introduce a grid-based discretization approach which automatically scales just linearly in n. To circumvent the curse of dimensionality of full tensor product grids, we use the concept of sparse grids. Furthermore, in real-world applications, some embedding directions are usually more important than others and it is reasonable to refine the underlying discretization space only in these directions. To this end, we employ a dimension-adaptive algorithm which is based on the ANOVA (analysis of variance) decomposition of a function. In particular, the reconstruction error is used to measure the quality of an embedding. As an application, the study of large simulation data from an engineering application in the automotive industry (car crash simulation) is performed.
A novel algorithm for simultaneous SNP selection in high-dimensional genome-wide association studies
2012-01-01
Background Identification of causal SNPs in most genome wide association studies relies on approaches that consider each SNP individually. However, there is a strong correlation structure among SNPs that needs to be taken into account. Hence, increasingly modern computationally expensive regression methods are employed for SNP selection that consider all markers simultaneously and thus incorporate dependencies among SNPs. Results We develop a novel multivariate algorithm for large scale SNP selection using CAR score regression, a promising new approach for prioritizing biomarkers. Specifically, we propose a computationally efficient procedure for shrinkage estimation of CAR scores from high-dimensional data. Subsequently, we conduct a comprehensive comparison study including five advanced regression approaches (boosting, lasso, NEG, MCP, and CAR score) and a univariate approach (marginal correlation) to determine the effectiveness in finding true causal SNPs. Conclusions Simultaneous SNP selection is a challenging task. We demonstrate that our CAR score-based algorithm consistently outperforms all competing approaches, both uni- and multivariate, in terms of correctly recovered causal SNPs and SNP ranking. An R package implementing the approach as well as R code to reproduce the complete study presented here is available from http://strimmerlab.org/software/care/. PMID:23113980
Spanning high-dimensional expression space using ribosome-binding site combinatorics
Zelcbuch, Lior; Antonovsky, Niv; Bar-Even, Arren; Levin-Karp, Ayelet; Barenholz, Uri; Dayagi, Michal; Liebermeister, Wolfram; Flamholz, Avi; Noor, Elad; Amram, Shira; Brandis, Alexander; Bareia, Tasneem; Yofe, Ido; Jubran, Halim; Milo, Ron
2013-01-01
Protein levels are a dominant factor shaping natural and synthetic biological systems. Although proper functioning of metabolic pathways relies on precise control of enzyme levels, the experimental ability to balance the levels of many genes in parallel is a major outstanding challenge. Here, we introduce a rapid and modular method to span the expression space of several proteins in parallel. By combinatorially pairing genes with a compact set of ribosome-binding sites, we modulate protein abundance by several orders of magnitude. We demonstrate our strategy by using a synthetic operon containing fluorescent proteins to span a 3D color space. Using the same approach, we modulate a recombinant carotenoid biosynthesis pathway in Escherichia coli to reveal a diversity of phenotypes, each characterized by a distinct carotenoid accumulation profile. In a single combinatorial assembly, we achieve a yield of the industrially valuable compound astaxanthin 4-fold higher than previously reported. The methodology presented here provides an efficient tool for exploring a high-dimensional expression space to locate desirable phenotypes. PMID:23470993
Some results on pseudo-collar structures on high-dimensional manifolds
NASA Astrophysics Data System (ADS)
Rolland, Jeffrey Joseph
In this dissertation we outline a partial reverse to Quilen's plus construction in the high-dimensional manifold categor. We show that for any orientable manifold N with fundamental group Q and any fintely presented superperfect group S, there is a 1-sided s-cobordism (W, N, N-) with the fundamental group G of N- a semi-direct product of Q by S, that is, with G satisying 1 → S → G → Q → 1 and actually a semi-direct product. We then use a free product of Thompson's group V with itself to form a superperfect group S and start with an orientable manifold N with fundamental group Z, the integers, and form semi-direct products of (S x S ... x S) with Z and cobordism ( W1, N, N-), (W 2, N-, N--), (W3, N--, N---) and so on and glue these 1-sided s-cobordisms together to form an uncoutable family of 1-ended pseudo-collarable manifolds V all with non-pro-isomorphic fundamental group systems at infinity. Finally, we generalize a result of Guilbault and Tinsley to show that in M is a manifold with hypo-Abelian fundamental group with an element of infinite order, then there is an absolutely inward tame manifold V with boundary M which fails to be pseudo-collaarable.
A common, high-dimensional model of the representational space in human ventral temporal cortex
Haxby, James V.; Guntupalli, J. Swaroop; Connolly, Andrew C.; Halchenko, Yaroslav O.; Conroy, Bryan R.; Gobbini, M. Ida; Hanke, Michael; Ramadge, Peter J.
2011-01-01
Summary We present a high-dimensional model of the representational space in human ventral temporal (VT) cortex in which dimensions are response-tuning functions that are common across individuals and patterns of response are modeled as weighted sums of basis patterns associated with these response-tunings. We map response pattern vectors, measured with fMRI, from individual subjects’ voxel spaces into this common model space using a new method, ‘hyperalignment’. Hyperalignment parameters based on responses during one experiment – movie-viewing – identified 35 common response-tuning functions that captured fine-grained distinctions among a wide range of stimuli in the movie and in two category perception experiments. Between-subject classification (BSC, multivariate pattern classification based on other subjects’ data) of response pattern vectors in common model space greatly exceeded BSC of anatomically-aligned responses and matched within-subject classification. Results indicate that population codes for complex visual stimuli in VT cortex are based on response-tuning functions that are common across individuals. PMID:22017997
Quantum tomography of near-unitary processes in high-dimensional quantum systems
NASA Astrophysics Data System (ADS)
Lysne, Nathan; Sosa Martinez, Hector; Jessen, Poul; Baldwin, Charles; Kalev, Amir; Deutsch, Ivan
2016-05-01
Quantum Tomography (QT) is often considered the ideal tool for experimental debugging of quantum devices, capable of delivering complete information about quantum states (QST) or processes (QPT). In practice, the protocols used for QT are resource intensive and scale poorly with system size. In this situation, a well behaved model system with access to large state spaces (qudits) can serve as a useful platform for examining the tradeoffs between resource cost and accuracy inherent in QT. In past years we have developed one such experimental testbed, consisting of the electron-nuclear spins in the electronic ground state of individual Cs atoms. Our available toolkit includes high fidelity state preparation, complete unitary control, arbitrary orthogonal measurements, and accurate and efficient QST in Hilbert space dimensions up to d = 16. Using these tools, we have recently completed a comprehensive study of QPT in 4, 7 and 16 dimensions. Our results show that QPT of near-unitary processes is quite feasible if one chooses optimal input states and efficient QST on the outputs. We further show that for unitary processes in high dimensional spaces, one can use informationally incomplete QPT to achieve high-fidelity process reconstruction (90% in d = 16) with greatly reduced resource requirements.
Nonparametric Independence Screening in Sparse Ultra-High Dimensional Additive Models.
Fan, Jianqing; Feng, Yang; Song, Rui
2011-06-01
A variable screening procedure via correlation learning was proposed in Fan and Lv (2008) to reduce dimensionality in sparse ultra-high dimensional models. Even when the true model is linear, the marginal regression can be highly nonlinear. To address this issue, we further extend the correlation learning to marginal nonparametric learning. Our nonparametric independence screening is called NIS, a specific member of the sure independence screening. Several closely related variable screening procedures are proposed. Under general nonparametric models, it is shown that under some mild technical conditions, the proposed independence screening methods enjoy a sure screening property. The extent to which the dimensionality can be reduced by independence screening is also explicitly quantified. As a methodological extension, a data-driven thresholding and an iterative nonparametric independence screening (INIS) are also proposed to enhance the finite sample performance for fitting sparse additive models. The simulation results and a real data analysis demonstrate that the proposed procedure works well with moderate sample size and large dimension and performs better than competing methods. PMID:22279246
Nonparametric Independence Screening in Sparse Ultra-High Dimensional Varying Coefficient Models.
Fan, Jianqing; Ma, Yunbei; Dai, Wei
2014-01-01
The varying-coefficient model is an important class of nonparametric statistical model that allows us to examine how the effects of covariates vary with exposure variables. When the number of covariates is large, the issue of variable selection arises. In this paper, we propose and investigate marginal nonparametric screening methods to screen variables in sparse ultra-high dimensional varying-coefficient models. The proposed nonparametric independence screening (NIS) selects variables by ranking a measure of the nonparametric marginal contributions of each covariate given the exposure variable. The sure independent screening property is established under some mild technical conditions when the dimensionality is of nonpolynomial order, and the dimensionality reduction of NIS is quantified. To enhance the practical utility and finite sample performance, two data-driven iterative NIS methods are proposed for selecting thresholding parameters and variables: conditional permutation and greedy methods, resulting in Conditional-INIS and Greedy-INIS. The effectiveness and flexibility of the proposed methods are further illustrated by simulation studies and real data applications. PMID:25309009
Zhang, Zheng; Yang, Xiu; Oseledets, Ivan; Karniadakis, George E.; Daniel, Luca
2015-01-31
Hierarchical uncertainty quantification can reduce the computational cost of stochastic circuit simulation by employing spectral methods at different levels. This paper presents an efficient framework to simulate hierarchically some challenging stochastic circuits/systems that include high-dimensional subsystems. Due to the high parameter dimensionality, it is challenging to both extract surrogate models at the low level of the design hierarchy and to handle them in the high-level simulation. In this paper, we develop an efficient analysis of variance-based stochastic circuit/microelectromechanical systems simulator to efficiently extract the surrogate models at the low level. In order to avoid the curse of dimensionality, we employ tensor-train decomposition at the high level to construct the basis functions and Gauss quadrature points. As a demonstration, we verify our algorithm on a stochastic oscillator with four MEMS capacitors and 184 random parameters. This challenging example is efficiently simulated by our simulator at the cost of only 10min in MATLAB on a regular personal computer.
Dan Maljovec; Bei Wang; Valerio Pascucci; Peer-Timo Bremer; Michael Pernice; Robert Nourgaliev
2013-05-01
The next generation of methodologies for nuclear reactor Probabilistic Risk Assessment (PRA) explicitly accounts for the time element in modeling the probabilistic system evolution and uses numerical simulation tools to account for possible dependencies between failure events. The Monte-Carlo (MC) and the Dynamic Event Tree (DET) approaches belong to this new class of dynamic PRA methodologies. A challenge of dynamic PRA algorithms is the large amount of data they produce which may be difficult to visualize and analyze in order to extract useful information. We present a software tool that is designed to address these goals. We model a large-scale nuclear simulation dataset as a high-dimensional scalar function defined over a discrete sample of the domain. First, we provide structural analysis of such a function at multiple scales and provide insight into the relationship between the input parameters and the output. Second, we enable exploratory analysis for users, where we help the users to differentiate features from noise through multi-scale analysis on an interactive platform, based on domain knowledge and data characterization. Our analysis is performed by exploiting the topological and geometric properties of the domain, building statistical models based on its topological segmentations and providing interactive visual interfaces to facilitate such explorations. We provide a user’s guide to our software tool by highlighting its analysis and visualization capabilities, along with a use case involving dataset from a nuclear reactor safety simulation.
High-dimensional statistical measure for region-of-interest tracking.
Boltz, Sylvain; Debreuve, Eric; Barlaud, Michel
2009-06-01
This paper deals with region-of-interest (ROI) tracking in video sequences. The goal is to determine in successive frames the region which best matches, in terms of a similarity measure, a ROI defined in a reference frame. Some tracking methods define similarity measures which efficiently combine several visual features into a probability density function (PDF) representation, thus building a discriminative model of the ROI. This approach implies dealing with PDFs with domains of definition of high dimension. To overcome this obstacle, a standard solution is to assume independence between the different features in order to bring out low-dimension marginal laws and/or to make some parametric assumptions on the PDFs at the cost of generality. We discard these assumptions by proposing to compute the Kullback-Leibler divergence between high-dimensional PDFs using the k th nearest neighbor framework. In consequence, the divergence is expressed directly from the samples, i.e., without explicit estimation of the underlying PDFs. As an application, we defined 5, 7, and 13-dimensional feature vectors containing color information (including pixel-based, gradient-based and patch-based) and spatial layout. The proposed procedure performs tracking allowing for translation and scaling of the ROI. Experiments show its efficiency on a movie excerpt and standard test sequences selected for the specific conditions they exhibit: partial occlusions, variations of luminance, noise, and complex motion. PMID:19369157
Feature Augmentation via Nonparametrics and Selection (FANS) in High-Dimensional Classification
Feng, Yang; Jiang, Jiancheng; Tong, Xin
2015-01-01
We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing. PMID:27185970
Semi-implicit integration factor methods on sparse grids for high-dimensional systems
NASA Astrophysics Data System (ADS)
Wang, Dongyong; Chen, Weitao; Nie, Qing
2015-07-01
Numerical methods for partial differential equations in high-dimensional spaces are often limited by the curse of dimensionality. Though the sparse grid technique, based on a one-dimensional hierarchical basis through tensor products, is popular for handling challenges such as those associated with spatial discretization, the stability conditions on time step size due to temporal discretization, such as those associated with high-order derivatives in space and stiff reactions, remain. Here, we incorporate the sparse grids with the implicit integration factor method (IIF) that is advantageous in terms of stability conditions for systems containing stiff reactions and diffusions. We combine IIF, in which the reaction is treated implicitly and the diffusion is treated explicitly and exactly, with various sparse grid techniques based on the finite element and finite difference methods and a multi-level combination approach. The overall method is found to be efficient in terms of both storage and computational time for solving a wide range of PDEs in high dimensions. In particular, the IIF with the sparse grid combination technique is flexible and effective in solving systems that may include cross-derivatives and non-constant diffusion coefficients. Extensive numerical simulations in both linear and nonlinear systems in high dimensions, along with applications of diffusive logistic equations and Fokker-Planck equations, demonstrate the accuracy, efficiency, and robustness of the new methods, indicating potential broad applications of the sparse grid-based integration factor method.
Semi-implicit Integration Factor Methods on Sparse Grids for High-Dimensional Systems
Wang, Dongyong; Chen, Weitao; Nie, Qing
2015-01-01
Numerical methods for partial differential equations in high-dimensional spaces are often limited by the curse of dimensionality. Though the sparse grid technique, based on a one-dimensional hierarchical basis through tensor products, is popular for handling challenges such as those associated with spatial discretization, the stability conditions on time step size due to temporal discretization, such as those associated with high-order derivatives in space and stiff reactions, remain. Here, we incorporate the sparse grids with the implicit integration factor method (IIF) that is advantageous in terms of stability conditions for systems containing stiff reactions and diffusions. We combine IIF, in which the reaction is treated implicitly and the diffusion is treated explicitly and exactly, with various sparse grid techniques based on the finite element and finite difference methods and a multi-level combination approach. The overall method is found to be efficient in terms of both storage and computational time for solving a wide range of PDEs in high dimensions. In particular, the IIF with the sparse grid combination technique is flexible and effective in solving systems that may include cross-derivatives and non-constant diffusion coefficients. Extensive numerical simulations in both linear and nonlinear systems in high dimensions, along with applications of diffusive logistic equations and Fokker-Planck equations, demonstrate the accuracy, efficiency, and robustness of the new methods, indicating potential broad applications of the sparse grid-based integration factor method. PMID:25897178
NASA Astrophysics Data System (ADS)
Park, Youngyong; Do, Younghae; Altmeyer, Sebastian; Lai, Ying-Cheng; Lee, GyuWon
2015-02-01
We investigate high-dimensional nonlinear dynamical systems exhibiting multiple resonances under adiabatic parameter variations. Our motivations come from experimental considerations where time-dependent sweeping of parameters is a practical approach to probing and characterizing the bifurcations of the system. The question is whether bifurcations so detected are faithful representations of the bifurcations intrinsic to the original stationary system. Utilizing a harmonically forced, closed fluid flow system that possesses multiple resonances and solving the Navier-Stokes equation under proper boundary conditions, we uncover the phenomenon of the early effect. Specifically, as a control parameter, e.g., the driving frequency, is adiabatically increased from an initial value, resonances emerge at frequency values that are lower than those in the corresponding stationary system. The phenomenon is established by numerical characterization of physical quantities through the resonances, which include the kinetic energy and the vorticity field, and a heuristic analysis based on the concept of instantaneous frequency. A simple formula is obtained which relates the resonance points in the time-dependent and time-independent systems. Our findings suggest that, in general, any true bifurcation of a nonlinear dynamical system can be unequivocally uncovered through adiabatic parameter sweeping, in spite of a shift in the bifurcation point, which is of value to experimental studies of nonlinear dynamical systems.
Hou, Jiayi
2015-01-01
An ordinal scale is commonly used to measure health status and disease related outcomes in hospital settings as well as in translational medical research. In addition, repeated measurements are common in clinical practice for tracking and monitoring the progression of complex diseases. Classical methodology based on statistical inference, in particular, ordinal modeling has contributed to the analysis of data in which the response categories are ordered and the number of covariates (p) remains smaller than the sample size (n). With the emergence of genomic technologies being increasingly applied for more accurate diagnosis and prognosis, high-dimensional data where the number of covariates (p) is much larger than the number of samples (n), are generated. To meet the emerging needs, we introduce our proposed model which is a two-stage algorithm: Extend the Generalized Monotone Incremental Forward Stagewise (GMIFS) method to the cumulative logit ordinal model; and combine the GMIFS procedure with the classical mixed-effects model for classifying disease status in disease progression along with time. We demonstrate the efficiency and accuracy of the proposed models in classification using a time-course microarray dataset collected from the Inflammation and the Host Response to Injury study. PMID:25720102
Snyder, Abigail C.; Jiao, Yu
2010-10-01
Neutron experiments at the Spallation Neutron Source (SNS) at Oak Ridge National Laboratory (ORNL) frequently generate large amounts of data (on the order of 106-1012 data points). Hence, traditional data analysis tools run on a single CPU take too long to be practical and scientists are unable to efficiently analyze all data generated by experiments. Our goal is to develop a scalable algorithm to efficiently compute high-dimensional integrals of arbitrary functions. This algorithm can then be used to integrate the four-dimensional integrals that arise as part of modeling intensity from the experiments at the SNS. Here, three different one-dimensional numerical integration solvers from the GNU Scientific Library were modified and implemented to solve four-dimensional integrals. The results of these solvers on a final integrand provided by scientists at the SNS can be compared to the results of other methods, such as quasi-Monte Carlo methods, computing the same integral. A parallelized version of the most efficient method can allow scientists the opportunity to more effectively analyze all experimental data.
Spanning high-dimensional expression space using ribosome-binding site combinatorics.
Zelcbuch, Lior; Antonovsky, Niv; Bar-Even, Arren; Levin-Karp, Ayelet; Barenholz, Uri; Dayagi, Michal; Liebermeister, Wolfram; Flamholz, Avi; Noor, Elad; Amram, Shira; Brandis, Alexander; Bareia, Tasneem; Yofe, Ido; Jubran, Halim; Milo, Ron
2013-05-01
Protein levels are a dominant factor shaping natural and synthetic biological systems. Although proper functioning of metabolic pathways relies on precise control of enzyme levels, the experimental ability to balance the levels of many genes in parallel is a major outstanding challenge. Here, we introduce a rapid and modular method to span the expression space of several proteins in parallel. By combinatorially pairing genes with a compact set of ribosome-binding sites, we modulate protein abundance by several orders of magnitude. We demonstrate our strategy by using a synthetic operon containing fluorescent proteins to span a 3D color space. Using the same approach, we modulate a recombinant carotenoid biosynthesis pathway in Escherichia coli to reveal a diversity of phenotypes, each characterized by a distinct carotenoid accumulation profile. In a single combinatorial assembly, we achieve a yield of the industrially valuable compound astaxanthin 4-fold higher than previously reported. The methodology presented here provides an efficient tool for exploring a high-dimensional expression space to locate desirable phenotypes. PMID:23470993
NASA Astrophysics Data System (ADS)
Hallman, Eric J.; Burns, Jack O.; Motl, Patrick M.; Norman, Michael L.
2007-08-01
We have analyzed a large sample of numerically simulated clusters to demonstrate the adverse effects resulting from the use of X-ray-fitted β-model parameters with Sunyaev-Zeldovich effect (SZE) data. There is a fundamental incompatibility between β-model fits to X-ray surface brightness profiles and those done with SZE profiles. Since observational SZE radial profiles are in short supply, the X-ray parameters are often used in SZE analysis. We show that this leads to biased estimates of the integrated Compton y-parameter inside r500 calculated from clusters. We suggest a simple correction of the method, using a nonisothermal β-model modified by a universal temperature profile, which brings these calculated quantities into closer agreement with the true values.
Altermann, Susanne; Leavitt, Steven D.; Goward, Trevor; Nelsen, Matthew P.; Lumbsch, H. Thorsten
2014-01-01
The inclusion of molecular data is increasingly an integral part of studies assessing species boundaries. Analyses based on predefined groups may obscure patterns of differentiation, and population assignment tests provide an alternative for identifying population structure and barriers to gene flow. In this study, we apply population assignment tests implemented in the programs STRUCTURE and BAPS to single nucleotide polymorphisms from DNA sequence data generated for three previous studies of the lichenized fungal genus Letharia. Previous molecular work employing a gene genealogical approach circumscribed six species-level lineages within the genus, four putative lineages within the nominal taxon L. columbiana (Nutt.) J.W. Thomson and two sorediate lineages. We show that Bayesian clustering implemented in the program STRUCTURE was generally able to recover the same six putative Letharia lineages. Population assignments were largely consistent across a range of scenarios, including: extensive amounts of missing data, the exclusion of SNPs from variable markers, and inferences based on SNPs from as few as three gene regions. While our study provided additional evidence corroborating the six candidate Letharia species, the equivalence of these genetic clusters with species-level lineages is uncertain due, in part, to limited phylogenetic signal. Furthermore, both the BAPS analysis and the ad hoc ΔK statistic from results of the STRUCTURE analysis suggest that population structure can possibly be captured with fewer genetic groups. Our findings also suggest that uneven sampling across taxa may be responsible for the contrasting inferences of population substructure. Our results consistently supported two distinct sorediate groups, ‘L. lupina’ and L. vulpina, and subtle morphological differences support this distinction. Similarly, the putative apotheciate species ‘L. lucida’ was also consistently supported as a distinct genetic cluster. However, additional
A novel multi-manifold classification model via path-based clustering for image retrieval
NASA Astrophysics Data System (ADS)
Zhu, Rong; Yuan, Zhijun; Xuan, Junying
2011-12-01
Nowadays, with digital cameras and mass storage devices becoming increasingly affordable, each day thousands of pictures are taken and images on the Internet are emerged at an astonishing rate. Image retrieval is a process of searching valuable information that user demanded from huge images. However, it is hard to find satisfied results due to the well known "semantic gap". Image classification plays an essential role in retrieval process. But traditional methods will encounter problems when dealing with high-dimensional and large-scale image sets in applications. Here, we propose a novel multi-manifold classification model for image retrieval. Firstly, we simplify the classification of images from high-dimensional space into the one on low-dimensional manifolds, largely reducing the complexity of classification process. Secondly, considering that traditional distance measures often fail to find correct visual semantics of manifolds, especially when dealing with the images having complex data distribution, we also define two new distance measures based on path-based clustering, and further applied to the construction of a multi-class image manifold. One experiment was conducted on 2890 Web images. The comparison results between three methods show that the proposed method achieves the highest classification accuracy.
Bayesian Decision Theoretical Framework for Clustering
ERIC Educational Resources Information Center
Chen, Mo
2011-01-01
In this thesis, we establish a novel probabilistic framework for the data clustering problem from the perspective of Bayesian decision theory. The Bayesian decision theory view justifies the important questions: what is a cluster and what a clustering algorithm should optimize. We prove that the spectral clustering (to be specific, the…
Zhao, Lue Ping; Bolouri, Hamid
2016-04-01
Maturing omics technologies enable researchers to generate high dimension omics data (HDOD) routinely in translational clinical studies. In the field of oncology, The Cancer Genome Atlas (TCGA) provided funding support to researchers to generate different types of omics data on a common set of biospecimens with accompanying clinical data and has made the data available for the research community to mine. One important application, and the focus of this manuscript, is to build predictive models for prognostic outcomes based on HDOD. To complement prevailing regression-based approaches, we propose to use an object-oriented regression (OOR) methodology to identify exemplars specified by HDOD patterns and to assess their associations with prognostic outcome. Through computing patient's similarities to these exemplars, the OOR-based predictive model produces a risk estimate using a patient's HDOD. The primary advantages of OOR are twofold: reducing the penalty of high dimensionality and retaining the interpretability to clinical practitioners. To illustrate its utility, we apply OOR to gene expression data from non-small cell lung cancer patients in TCGA and build a predictive model for prognostic survivorship among stage I patients, i.e., we stratify these patients by their prognostic survival risks beyond histological classifications. Identification of these high-risk patients helps oncologists to develop effective treatment protocols and post-treatment disease management plans. Using the TCGA data, the total sample is divided into training and validation data sets. After building up a predictive model in the training set, we compute risk scores from the predictive model, and validate associations of risk scores with prognostic outcome in the validation data (P-value=0.015). PMID:26972839
Sill, Martin; Saadati, Maral; Benner, Axel
2015-01-01
Motivation: Principal component analysis (PCA) is a basic tool often used in bioinformatics for visualization and dimension reduction. However, it is known that PCA may not consistently estimate the true direction of maximal variability in high-dimensional, low sample size settings, which are typical for molecular data. Assuming that the underlying signal is sparse, i.e. that only a fraction of features contribute to a principal component (PC), this estimation consistency can be retained. Most existing sparse PCA methods use L1-penalization, i.e. the lasso, to perform feature selection. But, the lasso is known to lack variable selection consistency in high dimensions and therefore a subsequent interpretation of selected features can give misleading results. Results: We present S4VDPCA, a sparse PCA method that incorporates a subsampling approach, namely stability selection. S4VDPCA can consistently select the truly relevant variables contributing to a sparse PC while also consistently estimate the direction of maximal variability. The performance of the S4VDPCA is assessed in a simulation study and compared to other PCA approaches, as well as to a hypothetical oracle PCA that ‘knows’ the truly relevant features in advance and thus finds optimal, unbiased sparse PCs. S4VDPCA is computationally efficient and performs best in simulations regarding parameter estimation consistency and feature selection consistency. Furthermore, S4VDPCA is applied to a publicly available gene expression data set of medulloblastoma brain tumors. Features contributing to the first two estimated sparse PCs represent genes significantly over-represented in pathways typically deregulated between molecular subgroups of medulloblastoma. Availability and implementation: Software is available at https://github.com/mwsill/s4vdpca. Contact: m.sill@dkfz.de Supplementary information: Supplementary data are available at Bioinformatics online. PMID:25861969
From Ambiguities to Insights: Query-based Comparisons of High-Dimensional Data
NASA Astrophysics Data System (ADS)
Kowalski, Jeanne; Talbot, Conover; Tsai, Hua L.; Prasad, Nijaguna; Umbricht, Christopher; Zeiger, Martha A.
2007-11-01
Genomic technologies will revolutionize drag discovery and development; that much is universally agreed upon. The high dimension of data from such technologies has challenged available data analytic methods; that much is apparent. To date, large-scale data repositories have not been utilized in ways that permit their wealth of information to be efficiently processed for knowledge, presumably due in large part to inadequate analytical tools to address numerous comparisons of high-dimensional data. In candidate gene discovery, expression comparisons are often made between two features (e.g., cancerous versus normal), such that the enumeration of outcomes is manageable. With multiple features, the setting becomes more complex, in terms of comparing expression levels of tens of thousands transcripts across hundreds of features. In this case, the number of outcomes, while enumerable, become rapidly large and unmanageable, and scientific inquiries become more abstract, such as "which one of these (compounds, stimuli, etc.) is not like the others?" We develop analytical tools that promote more extensive, efficient, and rigorous utilization of the public data resources generated by the massive support of genomic studies. Our work innovates by enabling access to such metadata with logically formulated scientific inquires that define, compare and integrate query-comparison pair relations for analysis. We demonstrate our computational tool's potential to address an outstanding biomedical informatics issue of identifying reliable molecular markers in thyroid cancer. Our proposed query-based comparison (QBC) facilitates access to and efficient utilization of metadata through logically formed inquires expressed as query-based comparisons by organizing and comparing results from biotechnologies to address applications in biomedicine.
2014-01-01
Motivation It is common to get an optimal combination of markers for disease classification and prediction when multiple markers are available. Many approaches based on the area under the receiver operating characteristic curve (AUC) have been proposed. Existing works based on AUC in a high-dimensional context depend mainly on a non-parametric, smooth approximation of AUC, with no work using a parametric AUC-based approach, for high-dimensional data. Results We propose an AUC-based approach using penalized regression (AucPR), which is a parametric method used for obtaining a linear combination for maximizing the AUC. To obtain the AUC maximizer in a high-dimensional context, we transform a classical parametric AUC maximizer, which is used in a low-dimensional context, into a regression framework and thus, apply the penalization regression approach directly. Two kinds of penalization, lasso and elastic net, are considered. The parametric approach can avoid some of the difficulties of a conventional non-parametric AUC-based approach, such as the lack of an appropriate concave objective function and a prudent choice of the smoothing parameter. We apply the proposed AucPR for gene selection and classification using four real microarray and synthetic data. Through numerical studies, AucPR is shown to perform better than the penalized logistic regression and the nonparametric AUC-based method, in the sense of AUC and sensitivity for a given specificity, particularly when there are many correlated genes. Conclusion We propose a powerful parametric and easily-implementable linear classifier AucPR, for gene selection and disease prediction for high-dimensional data. AucPR is recommended for its good prediction performance. Beside gene expression microarray data, AucPR can be applied to other types of high-dimensional omics data, such as miRNA and protein data. PMID:25559769
Xie, Benhuai; Shen, Xiaotong
2009-01-01
Clustering analysis is one of the most widely used statistical tools in many emerging areas such as microarray data analysis. For microarray and other high-dimensional data, the presence of many noise variables may mask underlying clustering structures. Hence removing noise variables via variable selection is necessary. For simultaneous variable selection and parameter estimation, existing penalized likelihood approaches in model-based clustering analysis all assume a common diagonal covariance matrix across clusters, which however may not hold in practice. To analyze high-dimensional data, particularly those with relatively low sample sizes, this article introduces a novel approach that shrinks the variances together with means, in a more general situation with cluster-specific (diagonal) covariance matrices. Furthermore, selection of grouped variables via inclusion or exclusion of a group of variables altogether is permitted by a specific form of penalty, which facilitates incorporating subject-matter knowledge, such as gene functions in clustering microarray samples for disease subtype discovery. For implementation, EM algorithms are derived for parameter estimation, in which the M-steps clearly demonstrate the effects of shrinkage and thresholding. Numerical examples, including an application to acute leukemia subtype discovery with microarray gene expression data, are provided to demonstrate the utility and advantage of the proposed method. PMID:19920875
Value-balanced agglomerative connectivity clustering
NASA Astrophysics Data System (ADS)
Gupta, Gunjan K.; Ghosh, Joydeep
2001-03-01
In this paper we propose a new clustering framework for transactional data-sets involving large numbers of customers and products. Such transactional data pose particular issues such as very high dimensionality (greater than 10,000), and sparse categorical entries, that have been dealt with more effectively using a graph-based approach to clustering such as ROCK. But large transactional data raises certain other issues such as how to compare diverse products (e.g. milk vs. cars) cluster balancing and outlier removal, that need to be addressed. We first propose a new similarity measure that takes the value of the goods purchased into account, and form a value-based graph representation based on this similarity measure. A novel value-based balancing criterion that allows the user to control the balancing of clusters, is then defined. This balancing criterion is integrated with a value-based goodness measure for merging two clusters in an agglomerative clustering routine. Since graph-based clustering algorithms are very sensitive to outliers, we also propose a fast, effective and simple outlier detection and removal method based on under-clustering or over- partitioning. The performance of the proposed clustering framework is compared with leading graph-theoretic approaches such as ROCK and METIS.
Management of cluster headache.
Tfelt-Hansen, Peer C; Jensen, Rigmor H
2012-07-01
. For most cluster headache patients there are fairly good treatment options both for acute attacks and for prophylaxis. The big problem is the diagnosis of cluster headache as demonstrated by the diagnostic delay of 7 years. However, the relatively short-lasting attack of pain in one eye with typical associated symptoms should lead the family doctor to suspect cluster headache resulting in a referral to a neurologist or a headache centre with experience in the treatment of cluster headache. PMID:22650381
Matlab Cluster Ensemble Toolbox
2009-04-27
This is a Matlab toolbox for investigating the application of cluster ensembles to data classification, with the objective of improving the accuracy and/or speed of clustering. The toolbox divides the cluster ensemble problem into four areas, providing functionality for each. These include, (1) synthetic data generation, (2) clustering to generate individual data partitions and similarity matrices, (3) consensus function generation and final clustering to generate ensemble data partitioning, and (4) implementation of accuracy metrics. Withmore » regard to data generation, Gaussian data of arbitrary dimension can be generated. The kcenters algorithm can then be used to generate individual data partitions by either, (a) subsampling the data and clustering each subsample, or by (b) randomly initializing the algorithm and generating a clustering for each initialization. In either case an overall similarity matrix can be computed using a consensus function operating on the individual similarity matrices. A final clustering can be performed and performance metrics are provided for evaluation purposes.« less
Gomes, Tara; Wilton, Andrew S; Taylor, Valerie H; Ray, Joel G
2015-01-01
Objective To evaluate maternal medical and perinatal outcomes associated with antipsychotic drug use in pregnancy. Design High dimensional propensity score (HDPS) matched cohort study. Setting Multiple linked population health administrative databases in the entire province of Ontario, Canada. Participants Among women who delivered a singleton infant between 2003 and 2012, and who were eligible for provincially funded drug coverage, those with ≥2 consecutive prescriptions for an antipsychotic medication during pregnancy, at least one of which was filled in the first or second trimester, were selected. Of these antipsychotic drug users, 1021 were matched 1:1 with 1021 non-users by means of a HDPS algorithm. Main outcome measures The main maternal medical outcomes were gestational diabetes, hypertensive disorders of pregnancy, and venous thromboembolism. The main perinatal outcomes were preterm birth (<37 weeks), and a birth weight <3rd or >97th centile. Conditional Poisson regression analysis was used to generate rate ratios and 95% confidence intervals, adjusting for additionally prescribed non-antipsychotic psychotropic medications. Results Compared with non-users, women prescribed an antipsychotic medication in pregnancy did not seem to be at higher risk of gestational diabetes (rate ratio 1.10 (95% CI 0.77 to 1.57)), hypertensive disorders of pregnancy (1.12 (0.70 to 1.78)), or venous thromboembolism (0.95 (0.40 to 2.27)). The preterm birth rate, though high among antipsychotic users (14.5%) and matched non-users (14.3%), was not relatively different (rate ratio 0.99 (0.78 to 1.26)). Neither birth weight <3rd centile or >97th centile was associated with antipsychotic drug use in pregnancy (rate ratios 1.21 (0.81 to 1.82) and 1.26 (0.69 to 2.29) respectively). Conclusions Antipsychotic drug use in pregnancy had minimal evident impact on important maternal medical and short term perinatal outcomes. However, the rate of adverse outcomes is high enough to warrant
Sanfilippo, Antonio P.; Calapristi, Augustin J.; Crow, Vernon L.; Hetzler, Elizabeth G.; Turner, Alan E.
2004-05-26
We present an approach to the disambiguation of cluster labels that capitalizes on the notion of semantic similarity to assign WordNet senses to cluster labels. The approach provides interesting insights on how document clustering can provide the basis for developing a novel approach to word sense disambiguation.
ERIC Educational Resources Information Center
Ackerman, Brian P.; Schoff, Kristen; Levinson, Karen; Youngstrom, Eric; Izard, Carroll E.
1999-01-01
Examined relations between alternative representations of poverty cofactors and promotion processes, and problem behaviors of 6- and 7-year-olds from disadvantaged families. Found that single-index risk representations and promotion variables predicted aggression but not anxiety/depression. An additive model of individual risk indicators performed…
ERIC Educational Resources Information Center
Hou, Huei-Tse
2011-01-01
In some higher education courses that focus on case studies, teachers can provide situated scenarios (such as business bottlenecks and medical cases) and problem-solving discussion tasks for students to promote their cognitive skills. There is limited research on the content, performance, and behavioral patterns of teaching using online…
Cool Cluster Correctly Correlated
Sergey Aleksandrovich Varganov
2005-12-17
tens of atoms. Therefore, they are quantum objects. Some qualitative information about the geometries of such clusters can be obtained with classical empirical methods, for example geometry optimization using an empirical Lennard-Jones potential. However, to predict their accurate geometries and other physical and chemical properties it is necessary to solve a Schroedinger equation. If one is not interested in dynamics of clusters it is enough to solve the stationary (time-independent) Schroedinger equation (H{Phi}=E{Phi}). This equation represents a multidimensional eigenvalue problem. The solution of the Schroedinger equation is a set of eigenvectors (wave functions) and their eigenvalues (energies). The lowest energy solution (wave function) corresponds to the ground state of the cluster. The other solutions correspond to excited states. The wave function gives all information about the quantum state of the cluster and can be used to calculate different physical and chemical properties, such as photoelectron, X-ray, NMR, EPR spectra, dipole moment, polarizability etc. The dimensionality of the Schroedinger equation is determined by the number of particles (nuclei and electrons) in the cluster. The analytic solution is only known for a two particle problem. In order to solve the equation for clusters of interest it is necessary to make a number of approximations and use numerical methods.
Semi-Supervised Kernel Mean Shift Clustering.
Anand, Saket; Mittal, Sushil; Tuzel, Oncel; Meer, Peter
2014-06-01
Mean shift clustering is a powerful nonparametric technique that does not require prior knowledge of the number of clusters and does not constrain the shape of the clusters. However, being completely unsupervised, its performance suffers when the original distance metric fails to capture the underlying cluster structure. Despite recent advances in semi-supervised clustering methods, there has been little effort towards incorporating supervision into mean shift. We propose a semi-supervised framework for kernel mean shift clustering (SKMS) that uses only pairwise constraints to guide the clustering procedure. The points are first mapped to a high-dimensional kernel space where the constraints are imposed by a linear transformation of the mapped points. This is achieved by modifying the initial kernel matrix by minimizing a log det divergence-based objective function. We show the advantages of SKMS by evaluating its performance on various synthetic and real datasets while comparing with state-of-the-art semi-supervised clustering algorithms. PMID:26353281
NASA Astrophysics Data System (ADS)
Bambang Avip Priatna, M.; Lukman, Sumiaty, Encum
2016-02-01
This paper aims to determine the properties of Correspondence Analysis (CA) estimator to estimate latent variable models. The method used is the High-Dimensional AIC (HAIC) method with simulation of Bernoulli distribution data. Stages are: (1) determine the matrix CA; (2) create a model of the CA estimator to estimate the latent variables by using HAIC; (3) simulated the Bernoulli distribution data with repetition 1,000,748 times. The simulation results show the CA estimator models work well.
NASA Astrophysics Data System (ADS)
Zhan, You-Bang; Zhang, Qun-Yong; Wang, Yu-Wu; Ma, Peng-Cheng
2010-01-01
We propose a scheme to teleport an unknown single-qubit state by using a high-dimensional entangled state as the quantum channel. As a special case, a scheme for teleportation of an unknown single-qubit state via three-dimensional entangled state is investigated in detail. Also, this scheme can be directly generalized to an unknown f-dimensional state by using a d-dimensional entangled state (d > f) as the quantum channel.
Amir, El-ad David; Davis, Kara L; Tadmor, Michelle D; Simonds, Erin F; Levine, Jacob H; Bendall, Sean C; Shenfeld, Daniel K; Krishnaswamy, Smita; Nolan, Garry P; Pe’er, Dana
2014-01-01
High-dimensional single-cell technologies are revolutionizing the way we understand biological systems. Technologies such as mass cytometry measure dozens of parameters simultaneously in individual cells, making interpretation daunting. We developed viSNE, a tool to map high-dimensional cytometry data onto 2D while conserving high-dimensional structure. We integrated mass cytometry with viSNE to map healthy and cancerous bone marrow samples. Healthy bone marrow maps into a canonical shape that separates between immune subtypes. In leukemia, however, the shape is malformed: the maps of cancer samples are distinct from the healthy map and from each other. viSNE highlights structure in the heterogeneity of surface phenotype expression in cancer, traverses the progression from diagnosis to relapse, and identifies a rare leukemia population in minimal residual disease settings. As several new technologies raise the number of simultaneously measured parameters in each cell to the hundreds, viSNE will become a mainstay in analyzing and interpreting such experiments. PMID:23685480
McGraw, Elizabeth A; Ye, Yixin H; Foley, Brad; Chenoweth, Stephen F; Higgie, Megan; Hine, Emma; Blows, Mark W
2011-11-01
Although adaptive change is usually associated with complex changes in phenotype, few genetic investigations have been conducted on adaptations that involve sets of high-dimensional traits. Microarrays have supplied high-dimensional descriptions of gene expression, and phenotypic change resulting from adaptation often results in large-scale changes in gene expression. We demonstrate how genetic analysis of large-scale changes in gene expression generated during adaptation can be accomplished by determining high-dimensional variance partitioning within classical genetic experimental designs. A microarray experiment conducted on a panel of recombinant inbred lines (RILs) generated from two populations of Drosophila serrata that have diverged in response to natural selection, revealed genetic divergence in 10.6% of 3762 gene products examined. Over 97% of the genetic divergence in transcript abundance was explained by only 12 genetic modules. The two most important modules, explaining 50% of the genetic variance in transcript abundance, were genetically correlated with the morphological traits that are known to be under selection. The expression of three candidate genes from these two important genetic modules was assessed in an independent experiment using qRT-PCR on 430 individuals from the panel of RILs, and confirmed the genetic association between transcript abundance and morphological traits under selection. PMID:22023580
Bhadra, Anindya; Mallick, Bani K
2013-06-01
We describe a Bayesian technique to (a) perform a sparse joint selection of significant predictor variables and significant inverse covariance matrix elements of the response variables in a high-dimensional linear Gaussian sparse seemingly unrelated regression (SSUR) setting and (b) perform an association analysis between the high-dimensional sets of predictors and responses in such a setting. To search the high-dimensional model space, where both the number of predictors and the number of possibly correlated responses can be larger than the sample size, we demonstrate that a marginalization-based collapsed Gibbs sampler, in combination with spike and slab type of priors, offers a computationally feasible and efficient solution. As an example, we apply our method to an expression quantitative trait loci (eQTL) analysis on publicly available single nucleotide polymorphism (SNP) and gene expression data for humans where the primary interest lies in finding the significant associations between the sets of SNPs and possibly correlated genetic transcripts. Our method also allows for inference on the sparse interaction network of the transcripts (response variables) after accounting for the effect of the SNPs (predictor variables). We exploit properties of Gaussian graphical models to make statements concerning conditional independence of the responses. Our method compares favorably to existing Bayesian approaches developed for this purpose. PMID:23607608
NASA Technical Reports Server (NTRS)
Stothers, Richard B.; Chin, Chao-Wen
1992-01-01
New theoretical evolutionary sequences of models for stars with low metallicities, appropriate to the Small Magellanic Cloud, are derived with both standard Cox-Stewart opacities and the new Rogers-Iglesias opacities. Only those sequences with little or no convective core overshooting are found to be capable of reproducing the two most critical observations: the maximum effective temperature displayed by the hot evolved stars and the difference between the average bolometric magnitudes of the hot and cool evolved stars. An upper limit to the ratio of the mean overshoot distance beyond the classical Schwarzschild core boundary to the local pressure scale height is set at 0.2. It is inferred from the frequency of cool supergiants in NGC 330 that the Ledoux criterion, rather than the Schwarzschild criterion, for convection and semiconvection in the envelopes of massive stars is strongly favored. Residuals from the fitting for NGC 330 suggest the possibility of fast interior rotation in the stars of this cluster. NGC 330 and NGC 458 have ages of about 3 x 10 exp 7 and about 1 x 10 exp 8 yr, respectively.
NASA Astrophysics Data System (ADS)
Jazaeri, S.; Amiri-Simkooei, A. R.; Sharifi, M. A.
2012-02-01
GNSS ambiguity resolution is the key issue in the high-precision relative geodetic positioning and navigation applications. It is a problem of integer programming plus integer quality evaluation. Different integer search estimation methods have been proposed for the integer solution of ambiguity resolution. Slow rate of convergence is the main obstacle to the existing methods where tens of ambiguities are involved. Herein, integer search estimation for the GNSS ambiguity resolution based on the lattice theory is proposed. It is mathematically shown that the closest lattice point problem is the same as the integer least-squares (ILS) estimation problem and that the lattice reduction speeds up searching process. We have implemented three integer search strategies: Agrell, Eriksson, Vardy, Zeger (AEVZ), modification of Schnorr-Euchner enumeration (M-SE) and modification of Viterbo-Boutros enumeration (M-VB). The methods have been numerically implemented in several simulated examples under different scenarios and over 100 independent runs. The decorrelation process (or unimodular transformations) has been first used to transform the original ILS problem to a new one in all simulations. We have then applied different search algorithms to the transformed ILS problem. The numerical simulations have shown that AEVZ, M-SE, and M-VB are about 320, 120 and 50 times faster than LAMBDA, respectively, for a search space of dimension 40. This number could change to about 350, 160 and 60 for dimension 45. The AEVZ is shown to be faster than MLAMBDA by a factor of 5. Similar conclusions could be made using the application of the proposed algorithms to the real GPS data.
Pyne, Saumyadipta; Lee, Sharon X.; Wang, Kui; Irish, Jonathan; Tamayo, Pablo; Nazaire, Marc-Danie; Duong, Tarn; Ng, Shu-Kay; Hafler, David; Levy, Ronald; Nolan, Garry P.; Mesirov, Jill; McLachlan, Geoffrey J.
2014-01-01
In biomedical applications, an experimenter encounters different potential sources of variation in data such as individual samples, multiple experimental conditions, and multivariate responses of a panel of markers such as from a signaling network. In multiparametric cytometry, which is often used for analyzing patient samples, such issues are critical. While computational methods can identify cell populations in individual samples, without the ability to automatically match them across samples, it is difficult to compare and characterize the populations in typical experiments, such as those responding to various stimulations or distinctive of particular patients or time-points, especially when there are many samples. Joint Clustering and Matching (JCM) is a multi-level framework for simultaneous modeling and registration of populations across a cohort. JCM models every population with a robust multivariate probability distribution. Simultaneously, JCM fits a random-effects model to construct an overall batch template – used for registering populations across samples, and classifying new samples. By tackling systems-level variation, JCM supports practical biomedical applications involving large cohorts. Software for fitting the JCM models have been implemented in an R package EMMIX-JCM, available from http://www.maths.uq.edu.au/~gjm/mix_soft/EMMIX-JCM/. PMID:24983991
Local-Learning-Based Feature Selection for High-Dimensional Data Analysis
Sun, Yijun; Todorovic, Sinisa; Goodison, Steve
2012-01-01
This paper considers feature selection for data classification in the presence of a huge number of irrelevant features. We propose a new feature-selection algorithm that addresses several major issues with prior work, including problems with algorithm implementation, computational complexity, and solution accuracy. The key idea is to decompose an arbitrarily complex nonlinear problem into a set of locally linear ones through local learning, and then learn feature relevance globally within the large margin framework. The proposed algorithm is based on well-established machine learning and numerical analysis techniques, without making any assumptions about the underlying data distribution. It is capable of processing many thousands of features within minutes on a personal computer while maintaining a very high accuracy that is nearly insensitive to a growing number of irrelevant features. Theoretical analyses of the algorithm’s sample complexity suggest that the algorithm has a logarithmical sample complexity with respect to the number of features. Experiments on 11 synthetic and real-world data sets demonstrate the viability of our formulation of the feature-selection problem for supervised learning and the effectiveness of our algorithm. PMID:20634556
Automated Image Registration Using Geometrically Invariant Parameter Space Clustering (GIPSC)
Seedahmed, Gamal H.; Martucci, Louis M.
2002-09-01
Accurate, robust, and automatic image registration is a critical task in many typical applications, which employ multi-sensor and/or multi-date imagery information. In this paper we present a new approach to automatic image registration, which obviates the need for feature matching and solves for the registration parameters in a Hough-like approach. The basic idea underpinning, GIPSC methodology is to pair each data element belonging to two overlapping images, with all other data in each image, through a mathematical transformation. The results of pairing are encoded and exploited in histogram-like arrays as clusters of votes. Geometrically invariant features are adopted in this approach to reduce the computational complexity generated by the high dimensionality of the mathematical transformation. In this way, the problem of image registration is characterized, not by spatial or radiometric properties, but by the mathematical transformation that describes the geometrical relationship between the two images or more. While this approach does not require feature matching, it does permit recovery of matched features (e.g., points) as a useful by-product. The developed methodology incorporates uncertainty modeling using a least squares solution. Successful and promising experimental results of multi-date automatic image registration are reported in this paper.
The molecular matching problem
NASA Technical Reports Server (NTRS)
Kincaid, Rex K.
1993-01-01
Molecular chemistry contains many difficult optimization problems that have begun to attract the attention of optimizers in the Operations Research community. Problems including protein folding, molecular conformation, molecular similarity, and molecular matching have been addressed. Minimum energy conformations for simple molecular structures such as water clusters, Lennard-Jones microclusters, and short polypeptides have dominated the literature to date. However, a variety of interesting problems exist and we focus here on a molecular structure matching (MSM) problem.
A Nonparametric Bayesian Model for Nested Clustering.
Lee, Juhee; Müller, Peter; Zhu, Yitan; Ji, Yuan
2016-01-01
We propose a nonparametric Bayesian model for clustering where clusters of experimental units are determined by a shared pattern of clustering another set of experimental units. The proposed model is motivated by the analysis of protein activation data, where we cluster proteins such that all proteins in one cluster give rise to the same clustering of patients. That is, we define clusters of proteins by the way that patients group with respect to the corresponding protein activations. This is in contrast to (almost) all currently available models that use shared parameters in the sampling model to define clusters. This includes in particular model based clustering, Dirichlet process mixtures, product partition models, and more. We show results for two typical biostatistical inference problems that give rise to clustering. PMID:26519174
The control of high-dimensional chaos in time-delay systems to an arbitrary goal dynamics
NASA Astrophysics Data System (ADS)
Bünner, M. J.
1999-03-01
We present the control of high-dimensional chaos, with possibly a large number of positive Lyapunov exponents, of unknown time-delay systems to an arbitrary goal dynamics. We give an existence-and-uniqueness theorem for the control force. In the case of an unknown system, a formula to compute a model-based control force is derived. We give an example by demonstrating the control of the Mackey-Glass system toward a fixed point and a Rössler dynamics.
Okuno, Yuta; Small, Michael; Gotoda, Hiroshi
2015-04-01
We have examined the dynamics of self-excited thermoacoustic instability in a fundamentally and practically important gas-turbine model combustion system on the basis of complex network approaches. We have incorporated sophisticated complex networks consisting of cycle networks and phase space networks, neither of which has been considered in the areas of combustion physics and science. Pseudo-periodicity and high-dimensionality exist in the dynamics of thermoacoustic instability, including the possible presence of a clear power-law distribution and small-world-like nature. PMID:25933655
NASA Technical Reports Server (NTRS)
Benediktsson, J. A.; Swain, P. H.; Ersoy, O. K.
1993-01-01
Application of neural networks to classification of remote sensing data is discussed. Conventional two-layer backpropagation is found to give good results in classification of remote sensing data but is not efficient in training. A more efficient variant, based on conjugate-gradient optimization, is used for classification of multisource remote sensing and geographic data and very-high-dimensional data. The conjugate-gradient neural networks give excellent performance in classification of multisource data, but do not compare as well with statistical methods in classification of very-high-dimentional data.
Noe, F; Oswald, Marcus; Reinelt, Gerhard; Fischer, S.; Smith, Jeremy C
2006-01-01
The direct computation of rare transitions in high-dimensional dynamical systems such as biomolecules via numerical integration or Monte Carlo is limited by the sampling problem. Alternatively, the dynamics of these systems can be modeled by transition networks (TNs) which are weighted graphs whose edges represent transitions between stable states of the system. The computation of the globally best transition paths connecting two selected stable states is straightforward with available graph-theoretical methods. However, these methods require that the energy barriers of all TN edges be determined, which is often computationally infeasible for large systems. Here, we introduce energy-bounded TNs, in which the transition barriers are specified in terms of lower and upper bounds. We present algorithms permitting the determination of the globally best paths on these TNs while requiring the computation of only a small subset of the true transition barriers. Several variants of the algorithm are given which achieve improved performance, including a parallel version. The effectiveness of the approach is demonstrated by various benchmarks on random TNs and by computing the refolding pathways of a polypeptide: the best transition pathways between the alphaL helix, alphaR helix, and beta-hairpin conformations of the octaalanine (Ala8) molecule in aqueous solution.
Cost functions for pairwise data clustering
NASA Astrophysics Data System (ADS)
Angelini, L.; Nitti, L.; Pellicoro, M.; Stramaglia, S.
2001-07-01
Cost functions for non-hierarchical pairwise clustering are introduced, in the probabilistic autoencoder framework, by the request of maximal average similarity between input and the output of the autoencoder. Clustering is thus formulated as the problem of finding the ground state of Potts spins Hamiltonians. The partition, provided by this procedure, identifies clusters with dense connected regions in the data space.
NASA Technical Reports Server (NTRS)
1999-01-01
Penetrating 25,000 light-years of obscuring dust and myriad stars, NASA's Hubble Space Telescope has provided the clearest view yet of one of the largest young clusters of stars inside our Milky Way galaxy, located less than 100 light-years from the very center of the Galaxy. Having the equivalent mass greater than 10,000 stars like our sun, the monster cluster is ten times larger than typical young star clusters scattered throughout our Milky Way. It is destined to be ripped apart in just a few million years by gravitational tidal forces in the galaxy's core. But in its brief lifetime it shines more brightly than any other star cluster in the Galaxy. Quintuplet Cluster is 4 million years old. It has stars on the verge of blowing up as supernovae. It is the home of the brightest star seen in the galaxy, called the Pistol star. This image was taken in infrared light by Hubble's NICMOS camera in September 1997. The false colors correspond to infrared wavelengths. The galactic center stars are white, the red stars are enshrouded in dust or behind dust, and the blue stars are foreground stars between us and the Milky Way's center. The cluster is hidden from direct view behind black dust clouds in the constellation Sagittarius. If the cluster could be seen from earth it would appear to the naked eye as a 3rd magnitude star, 1/6th of a full moon's diameter apart.
NASA Astrophysics Data System (ADS)
Sjöstrand, Karl; Cardenas, Valerie A.; Larsen, Rasmus; Studholme, Colin
2008-03-01
Whole-brain morphometry denotes a group of methods with the aim of relating clinical and cognitive measurements to regions of the brain. Typically, such methods require the statistical analysis of a data set with many variables (voxels and exogenous variables) paired with few observations (subjects). A common approach to this ill-posed problem is to analyze each spatial variable separately, dividing the analysis into manageable subproblems. A disadvantage of this method is that the correlation structure of the spatial variables is not taken into account. This paper investigates the use of ridge regression to address this issue, allowing for a gradual introduction of correlation information into the model. We make the connections between ridge regression and voxel-wise procedures explicit and discuss relations to other statistical methods. Results are given on an in-vivo data set of deformation based morphometry from a study of cognitive decline in an elderly population.
Zawadzka-Kazimierczuk, Anna; Koźmiński, Wiktor; Billeter, Martin
2012-09-01
While NMR studies of proteins typically aim at structure, dynamics or interactions, resonance assignments represent in almost all cases the initial step of the analysis. With increasing complexity of the NMR spectra, for example due to decreasing extent of ordered structure, this task often becomes both difficult and time-consuming, and the recording of high-dimensional data with high-resolution may be essential. Random sampling of the evolution time space, combined with sparse multidimensional Fourier transform (SMFT), allows for efficient recording of very high dimensional spectra (≥4 dimensions) while maintaining high resolution. However, the nature of this data demands for automation of the assignment process. Here we present the program TSAR (Tool for SMFT-based Assignment of Resonances), which exploits all advantages of SMFT input. Moreover, its flexibility allows to process data from any type of experiments that provide sequential connectivities. The algorithm was tested on several protein samples, including a disordered 81-residue fragment of the δ subunit of RNA polymerase from Bacillus subtilis containing various repetitive sequences. For our test examples, TSAR achieves a high percentage of assigned residues without any erroneous assignments. PMID:22806130
NASA Astrophysics Data System (ADS)
McNamara, D. H.
2001-03-01
We examine the luminosity levels of the main-sequence turnoffs, MTOv, and horizontal branches, Mv(HB), in 16 globular clusters. An entirely new approach of inferring the luminosity levels by utilizing high-amplitude δ Scuti variables (HADS) is introduced. When the MTOv values are compared with theoretical values inferred from models, we find all 16 clusters (metal-strong to metal-poor) are coeval with an average age of ~11.3 Gyr. A considerable scatter of Mv(HB) values of clusters at similar [Fe/H] values is found. A trend for clusters with blue horizontal branches to have brighter Mv(HB) than clusters with blue-red horizontal branches is suggested by the data. The Mv(HB) values appear to depend on another or other parameters in addition to the [Fe/H] values. In spite of this problem, we derive an equation relating Mv(HB) values of globular clusters to their [Fe/H] values. We also derive an equation relating the MTOv values of clusters to their [Fe/H] values. Both of these equations can be utilized to find cluster distances. The distance modulus of the LMC is found to be 18.66 from the VTO values of three LMC globular clusters; RR Lyrae stars in seven globular clusters yield 18.61, and RR Lyrae stars in the LMC bar yield 18.64.
GPU-based Multilevel Clustering.
Chiosa, Iurie; Kolb, Andreas
2010-04-01
The processing power of parallel co-processors like the Graphics Processing Unit (GPU) are dramatically increasing. However, up until now only a few approaches have been presented to utilize this kind of hardware for mesh clustering purposes. In this paper we introduce a Multilevel clustering technique designed as a parallel algorithm and solely implemented on the GPU. Our formulation uses the spatial coherence present in the cluster optimization and hierarchical cluster merging to significantly reduce the number of comparisons in both parts . Our approach provides a fast, high quality and complete clustering analysis. Furthermore, based on the original concept we present a generalization of the method to data clustering. All advantages of the meshbased techniques smoothly carry over to the generalized clustering approach. Additionally, this approach solves the problem of the missing topological information inherent to general data clustering and leads to a Local Neighbors k-means algorithm. We evaluate both techniques by applying them to Centroidal Voronoi Diagram (CVD) based clustering. Compared to classical approaches, our techniques generate results with at least the same clustering quality. Our technique proves to scale very well, currently being limited only by the available amount of graphics memory. PMID:20421676
NASA Astrophysics Data System (ADS)
Miller, Christopher J. Miller
2012-03-01
There are many examples of clustering in astronomy. Stars in our own galaxy are often seen as being gravitationally bound into tight globular or open clusters. The Solar System's Trojan asteroids cluster at the gravitational Langrangian in front of Jupiter’s orbit. On the largest of scales, we find gravitationally bound clusters of galaxies, the Virgo cluster (in the constellation of Virgo at a distance of ˜50 million light years) being a prime nearby example. The Virgo cluster subtends an angle of nearly 8◦ on the sky and is known to contain over a thousand member galaxies. Galaxy clusters play an important role in our understanding of theUniverse. Clusters exist at peaks in the three-dimensional large-scale matter density field. Their sky (2D) locations are easy to detect in astronomical imaging data and their mean galaxy redshifts (redshift is related to the third spatial dimension: distance) are often better (spectroscopically) and cheaper (photometrically) when compared with the entire galaxy population in large sky surveys. Photometric redshift (z) [Photometric techniques use the broad band filter magnitudes of a galaxy to estimate the redshift. Spectroscopic techniques use the galaxy spectra and emission/absorption line features to measure the redshift] determinations of galaxies within clusters are accurate to better than delta_z = 0.05 [7] and when studied as a cluster population, the central galaxies form a line in color-magnitude space (called the the E/S0 ridgeline and visible in Figure 16.3) that contains galaxies with similar stellar populations [15]. The shape of this E/S0 ridgeline enables astronomers to measure the cluster redshift to within delta_z = 0.01 [23]. The most accurate cluster redshift determinations come from spectroscopy of the member galaxies, where only a fraction of the members need to be spectroscopically observed [25,42] to get an accurate redshift to the whole system. If light traces mass in the Universe, then the locations
ERIC Educational Resources Information Center
Pottawattamie County School System, Council Bluffs, IA.
The 15 occupational clusters (transportation, fine arts and humanities, communications and media, personal service occupations, construction, hospitality and recreation, health occupations, marine science occupations, consumer and homemaking-related occupations, agribusiness and natural resources, environment, public service, business and office…
NASA Astrophysics Data System (ADS)
Wagstaff, Kiri L.
2012-03-01
On obtaining a new data set, the researcher is immediately faced with the challenge of obtaining a high-level understanding from the observations. What does a typical item look like? What are the dominant trends? How many distinct groups are included in the data set, and how is each one characterized? Which observable values are common, and which rarely occur? Which items stand out as anomalies or outliers from the rest of the data? This challenge is exacerbated by the steady growth in data set size [11] as new instruments push into new frontiers of parameter space, via improvements in temporal, spatial, and spectral resolution, or by the desire to "fuse" observations from different modalities and instruments into a larger-picture understanding of the same underlying phenomenon. Data clustering algorithms provide a variety of solutions for this task. They can generate summaries, locate outliers, compress data, identify dense or sparse regions of feature space, and build data models. It is useful to note up front that "clusters" in this context refer to groups of items within some descriptive feature space, not (necessarily) to "galaxy clusters" which are dense regions in physical space. The goal of this chapter is to survey a variety of data clustering methods, with an eye toward their applicability to astronomical data analysis. In addition to improving the individual researcher’s understanding of a given data set, clustering has led directly to scientific advances, such as the discovery of new subclasses of stars [14] and gamma-ray bursts (GRBs) [38]. All clustering algorithms seek to identify groups within a data set that reflect some observed, quantifiable structure. Clustering is traditionally an unsupervised approach to data analysis, in the sense that it operates without any direct guidance about which items should be assigned to which clusters. There has been a recent trend in the clustering literature toward supporting semisupervised or constrained
Donchev, Todor I.; Petrov, Ivan G.
2011-05-31
Described herein is an apparatus and a method for producing atom clusters based on a gas discharge within a hollow cathode. The hollow cathode includes one or more walls. The one or more walls define a sputtering chamber within the hollow cathode and include a material to be sputtered. A hollow anode is positioned at an end of the sputtering chamber, and atom clusters are formed when a gas discharge is generated between the hollow anode and the hollow cathode.
Bustamante, Carlos D.; Valero-Cuevas, Francisco J.
2010-01-01
The field of complex biomechanical modeling has begun to rely on Monte Carlo techniques to investigate the effects of parameter variability and measurement uncertainty on model outputs, search for optimal parameter combinations, and define model limitations. However, advanced stochastic methods to perform data-driven explorations, such as Markov chain Monte Carlo (MCMC), become necessary as the number of model parameters increases. Here, we demonstrate the feasibility and, what to our knowledge is, the first use of an MCMC approach to improve the fitness of realistically large biomechanical models. We used a Metropolis–Hastings algorithm to search increasingly complex parameter landscapes (3, 8, 24, and 36 dimensions) to uncover underlying distributions of anatomical parameters of a “truth model” of the human thumb on the basis of simulated kinematic data (thumbnail location, orientation, and linear and angular velocities) polluted by zero-mean, uncorrelated multivariate Gaussian “measurement noise.” Driven by these data, ten Markov chains searched each model parameter space for the subspace that best fit the data (posterior distribution). As expected, the convergence time increased, more local minima were found, and marginal distributions broadened as the parameter space complexity increased. In the 36-D scenario, some chains found local minima but the majority of chains converged to the true posterior distribution (confirmed using a cross-validation dataset), thus demonstrating the feasibility and utility of these methods for realistically large biomechanical problems. PMID:19272906
He, Ling Yan; Wang, Tie-Jun; Wang, Chuan
2016-07-11
High-dimensional quantum system provides a higher capacity of quantum channel, which exhibits potential applications in quantum information processing. However, high-dimensional universal quantum logic gates is difficult to achieve directly with only high-dimensional interaction between two quantum systems and requires a large number of two-dimensional gates to build even a small high-dimensional quantum circuits. In this paper, we propose a scheme to implement a general controlled-flip (CF) gate where the high-dimensional single photon serve as the target qudit and stationary qubits work as the control logic qudit, by employing a three-level Λ-type system coupled with a whispering-gallery-mode microresonator. In our scheme, the required number of interaction times between the photon and solid state system reduce greatly compared with the traditional method which decomposes the high-dimensional Hilbert space into 2-dimensional quantum space, and it is on a shorter temporal scale for the experimental realization. Moreover, we discuss the performance and feasibility of our hybrid CF gate, concluding that it can be easily extended to a 2n-dimensional case and it is feasible with current technology. PMID:27410818
Orbit Clustering Based on Transfer Cost
NASA Technical Reports Server (NTRS)
Gustafson, Eric D.; Arrieta-Camacho, Juan J.; Petropoulos, Anastassios E.
2013-01-01
We propose using cluster analysis to perform quick screening for combinatorial global optimization problems. The key missing component currently preventing cluster analysis from use in this context is the lack of a useable metric function that defines the cost to transfer between two orbits. We study several proposed metrics and clustering algorithms, including k-means and the expectation maximization algorithm. We also show that proven heuristic methods such as the Q-law can be modified to work with cluster analysis.
Clustering of High Throughput Gene Expression Data
Pirim, Harun; Ekşioğlu, Burak; Perkins, Andy; Yüceer, Çetin
2012-01-01
High throughput biological data need to be processed, analyzed, and interpreted to address problems in life sciences. Bioinformatics, computational biology, and systems biology deal with biological problems using computational methods. Clustering is one of the methods used to gain insight into biological processes, particularly at the genomics level. Clearly, clustering can be used in many areas of biological data analysis. However, this paper presents a review of the current clustering algorithms designed especially for analyzing gene expression data. It is also intended to introduce one of the main problems in bioinformatics - clustering gene expression data - to the operations research community. PMID:23144527
A high-dimensional joint model for longitudinal outcomes of different nature.
Faes, Christel; Aerts, Marc; Molenberghs, Geert; Geys, Helena; Teuns, Greet; Bijnens, Luc
2008-09-30
In repeated dose-toxicity studies, many outcomes are repeatedly measured on the same animal to study the toxicity of a compound of interest. This is only one example in which one is confronted with the analysis of many outcomes, possibly of a different type. Probably the most common situation is that of an amalgamation of continuous and categorical outcomes. A possible approach towards the joint analysis of two longitudinal outcomes of a different nature is the use of random-effects models (Models for Discrete Longitudinal Data. Springer Series in Statistics. Springer: New York, 2005). Although a random-effects model can easily be extended to jointly model many outcomes of a different nature, computational problems arise as the number of outcomes increases. To avoid maximization of the full likelihood expression, Fieuws and Verbeke (Biometrics 2006; 62:424-431) proposed a pairwise modeling strategy in which all possible pairs are modeled separately, using a mixed model, yielding several different estimates for the same parameters. These latter estimates are then combined into a single set of estimates. Also inference, based on pseudo-likelihood principles, is indirectly derived from the separate analyses. In this paper, we extend the approach of Fieuws and Verbeke (Biometrics 2006; 62:424-431) in two ways: the method is applied to different types of outcomes and the full pseudo-likelihood expression is maximized at once, leading directly to unique estimates as well as direct application of pseudo-likelihood inference. This is very appealing when interested in hypothesis testing. The method is applied to data from a repeated dose-toxicity study designed for the evaluation of the neurofunctional effects of a psychotrophic drug. The relative merits of both methods are discussed. PMID:18551509
Detecting alternative graph clusterings.
Mandala, Supreet; Kumara, Soundar; Yao, Tao
2012-07-01
The problem of graph clustering or community detection has enjoyed a lot of attention in complex networks literature. A quality function, modularity, quantifies the strength of clustering and on maximization yields sensible partitions. However, in most real world networks, there are an exponentially large number of near-optimal partitions with some being very different from each other. Therefore, picking an optimal clustering among the alternatives does not provide complete information about network topology. To tackle this problem, we propose a graph perturbation scheme which can be used to identify an ensemble of near-optimal and diverse clusterings. We establish analytical properties of modularity function under the perturbation which ensures diversity. Our approach is algorithm independent and therefore can leverage any of the existing modularity maximizing algorithms. We numerically show that our methodology can systematically identify very different partitions on several existing data sets. The knowledge of diverse partitions sheds more light into the topological organization and helps gain a more complete understanding of the underlying complex network. PMID:23005495
Detecting alternative graph clusterings
NASA Astrophysics Data System (ADS)
Mandala, Supreet; Kumara, Soundar; Yao, Tao
2012-07-01
The problem of graph clustering or community detection has enjoyed a lot of attention in complex networks literature. A quality function, modularity, quantifies the strength of clustering and on maximization yields sensible partitions. However, in most real world networks, there are an exponentially large number of near-optimal partitions with some being very different from each other. Therefore, picking an optimal clustering among the alternatives does not provide complete information about network topology. To tackle this problem, we propose a graph perturbation scheme which can be used to identify an ensemble of near-optimal and diverse clusterings. We establish analytical properties of modularity function under the perturbation which ensures diversity. Our approach is algorithm independent and therefore can leverage any of the existing modularity maximizing algorithms. We numerically show that our methodology can systematically identify very different partitions on several existing data sets. The knowledge of diverse partitions sheds more light into the topological organization and helps gain a more complete understanding of the underlying complex network.
Pseudospectral sampling of Gaussian basis sets as a new avenue to high-dimensional quantum dynamics
NASA Astrophysics Data System (ADS)
Heaps, Charles
This thesis presents a novel approach to modeling quantum molecular dynamics (QMD). Theoretical approaches to QMD are essential to understanding and predicting chemical reactivity and spectroscopy. We implement a method based on a trajectory-guided basis set. In this case, the nuclei are propagated in time using classical mechanics. Each nuclear configuration corresponds to a basis function in the quantum mechanical expansion. Using the time-dependent configurations as a basis set, we are able to evolve in time using relatively little information at each time step. We use a basis set of moving frozen (time-independent width) Gaussian functions that are well-known to provide a simple and efficient basis set for nuclear dynamics. We introduce a new perspective to trajectory-guided Gaussian basis sets based on existing numerical methods. The distinction is based on the Galerkin and collocation methods. In the former, the basis set is tested using basis functions, projecting the solution onto the functional space of the problem and requiring integration over all space. In the collocation method, the Dirac delta function tests the basis set, projecting the solution onto discrete points in space. This effectively reduces the integral evaluation to function evaluation, a fundamental characteristic of pseudospectral methods. We adopt this idea for independent trajectory-guided Gaussian basis functions. We investigate a series of anharmonic vibrational models describing dynamics in up to six dimensions. The pseudospectral sampling is found to be as accurate as full integral evaluation, while the former method is fully general and integration is only possible on very particular model potential energy surfaces. Nonadiabatic dynamics are also investigated in models of photodissociation and collinear triatomic vibronic coupling. Using Ehrenfest trajectories to guide the basis set on multiple surfaces, we observe convergence to exact results using hundreds of basis functions
Li, Ke; Liu, Yi; Wang, Quanxin; Wu, Yalei; Song, Shimin; Sun, Yi; Liu, Tengchong; Wang, Jun; Li, Yang; Du, Shaoyi
2015-01-01
This paper proposes a novel multi-label classification method for resolving the spacecraft electrical characteristics problems which involve many unlabeled test data processing, high-dimensional features, long computing time and identification of slow rate. Firstly, both the fuzzy c-means (FCM) offline clustering and the principal component feature extraction algorithms are applied for the feature selection process. Secondly, the approximate weighted proximal support vector machine (WPSVM) online classification algorithms is used to reduce the feature dimension and further improve the rate of recognition for electrical characteristics spacecraft. Finally, the data capture contribution method by using thresholds is proposed to guarantee the validity and consistency of the data selection. The experimental results indicate that the method proposed can obtain better data features of the spacecraft electrical characteristics, improve the accuracy of identification and shorten the computing time effectively. PMID:26544549
Li, Ke; Liu, Yi; Wang, Quanxin; Wu, Yalei; Song, Shimin; Sun, Yi; Liu, Tengchong; Wang, Jun; Li, Yang; Du, Shaoyi
2015-01-01
This paper proposes a novel multi-label classification method for resolving the spacecraft electrical characteristics problems which involve many unlabeled test data processing, high-dimensional features, long computing time and identification of slow rate. Firstly, both the fuzzy c-means (FCM) offline clustering and the principal component feature extraction algorithms are applied for the feature selection process. Secondly, the approximate weighted proximal support vector machine (WPSVM) online classification algorithms is used to reduce the feature dimension and further improve the rate of recognition for electrical characteristics spacecraft. Finally, the data capture contribution method by using thresholds is proposed to guarantee the validity and consistency of the data selection. The experimental results indicate that the method proposed can obtain better data features of the spacecraft electrical characteristics, improve the accuracy of identification and shorten the computing time effectively. PMID:26544549
Ride, Jemimah; Rowe, Heather; Wynter, Karen; Fisher, Jane; Lorgelly, Paula
2014-01-01
Introduction Postnatal mental health problems, which are an international public health priority, are a suitable target for preventive approaches. The financial burden of these disorders is borne across sectors in society, including health, early childhood, education, justice and the workforce. This paper describes the planned economic evaluation of What Were We Thinking, a psychoeducational intervention for the prevention of postnatal mental health problems in first-time mothers. Methods and analysis The evaluation will be conducted alongside a cluster-randomised controlled trial of its clinical effectiveness. Cost-effectiveness and costs-utility analyses will be conducted, resulting in estimates of cost per percentage point reduction in combined 30-day prevalence of depression, anxiety and adjustment disorders and cost per quality-adjusted life year gained. Uncertainty surrounding these estimates will be addressed using non-parametric bootstrapping and represented using cost-effectiveness acceptability curves. Additional cost analyses relevant for implementation will also be conducted. Modelling will be employed to estimate longer term cost-effectiveness if the intervention is found to be clinically effective during the period of the trial. Ethics and dissemination Approval to conduct the study was granted by the Southern Health (now Monash Health) Human Research Ethics Committee (24 April 2013; 11388B). The study was registered with the Monash University Human Research Ethics Committee (30 April 2013; CF12/1022-2012000474). The Education and Policy Research Committee, Victorian Government Department of Education and Early Childhood Development approved the study (22 March 2012; 2012_001472). Use of the EuroQol was registered with the EuroQol Group; 16 August 2012. Trial registration number The trial was registered with the Australian New Zealand Clinical Trials Registry on 7 May 2012 (registration number ACTRN12613000506796). PMID:25280810
NASA Astrophysics Data System (ADS)
Miao, Yan-Gang; Xu, Zhen-Ming
2016-04-01
Considering non-Gaussian smeared matter distributions, we investigate the thermodynamic behaviors of the noncommutative high-dimensional Schwarzschild-Tangherlini anti-de Sitter black hole, and we obtain the condition for the existence of extreme black holes. We indicate that the Gaussian smeared matter distribution, which is a special case of non-Gaussian smeared matter distributions, is not applicable for the six- and higher-dimensional black holes due to the hoop conjecture. In particular, the phase transition is analyzed in detail. Moreover, we point out that the Maxwell equal area law holds for the noncommutative black hole whose Hawking temperature is within a specific range, but fails for one whose the Hawking temperature is beyond this range.
NASA Astrophysics Data System (ADS)
Song, Yunquan; Lin, Lu; Jian, Ling
2016-07-01
Single-index varying-coefficient model is an important mathematical modeling method to model nonlinear phenomena in science and engineering. In this paper, we develop a variable selection method for high-dimensional single-index varying-coefficient models using a shrinkage idea. The proposed procedure can simultaneously select significant nonparametric components and parametric components. Under defined regularity conditions, with appropriate selection of tuning parameters, the consistency of the variable selection procedure and the oracle property of the estimators are established. Moreover, due to the robustness of the check loss function to outliers in the finite samples, our proposed variable selection method is more robust than the ones based on the least squares criterion. Finally, the method is illustrated with numerical simulations.
NASA Astrophysics Data System (ADS)
Méndez Berhondo, Adolfo L.; Zlobec, Paolo; Díaz Rodríguez, Ana K.
2015-06-01
We examined the dynamic characteristics of the time series regarding a group of pulsations in broadband spectrum at metric waveband solar radio emission. The data were recorded with the radio polarimeter of the INAF-Trieste Astronomical Observatory at July 17, 2002. The aim is to determine if the underlying process of these pulsations can be describe as a periodic, deterministic chaos or stochastic. The pulsations under inquiry in present paper are rather rare, as we found only one example of similar ones reported in the literature. Unlike most of the previously works where the analyses was done to a broadband pulsating events at one single frequency, we examine the pulsation event as it evolves both in time and in frequency. We found that the dynamics underlying the generation of pulsations can be characterized by a deterministic chaotic process which increases the dimension of chaos with frequency showing a transition from low-dimensional to high-dimensional deterministic chaotic system.
Active matter clusters at interfaces.
NASA Astrophysics Data System (ADS)
Copenhagen, Katherine; Gopinathan, Ajay
2016-03-01
Collective and directed motility or swarming is an emergent phenomenon displayed by many self-organized assemblies of active biological matter such as clusters of embryonic cells during tissue development, cancerous cells during tumor formation and metastasis, colonies of bacteria in a biofilm, or even flocks of birds and schools of fish at the macro-scale. Such clusters typically encounter very heterogeneous environments. What happens when a cluster encounters an interface between two different environments has implications for its function and fate. Here we study this problem by using a mathematical model of a cluster that treats it as a single cohesive unit that moves in two dimensions by exerting a force/torque per unit area whose magnitude depends on the nature of the local environment. We find that low speed (overdamped) clusters encountering an interface with a moderate difference in properties can lead to refraction or even total internal reflection of the cluster. For large speeds (underdamped), where inertia dominates, the clusters show more complex behaviors crossing the interface multiple times and deviating from the predictable refraction and reflection for the low velocity clusters. We then present an extreme limit of the model in the absence of rotational damping where clusters can become stuck spiraling along the interface or move in large circular trajectories after leaving the interface. Our results show a wide range of behaviors that occur when collectively moving active biological matter moves across interfaces and these insights can be used to control motion by patterning environments.
Constrained Clustering With Imperfect Oracles.
Zhu, Xiatian; Loy, Chen Change; Gong, Shaogang
2016-06-01
While clustering is usually an unsupervised operation, there are circumstances where we have access to prior belief that pairs of samples should (or should not) be assigned with the same cluster. Constrained clustering aims to exploit this prior belief as constraint (or weak supervision) to influence the cluster formation so as to obtain a data structure more closely resembling human perception. Two important issues remain open: 1) how to exploit sparse constraints effectively and 2) how to handle ill-conditioned/noisy constraints generated by imperfect oracles. In this paper, we present a novel pairwise similarity measure framework to address the above issues. Specifically, in contrast to existing constrained clustering approaches that blindly rely on all features for constraint propagation, our approach searches for neighborhoods driven by discriminative feature selection for more effective constraint diffusion. Crucially, we formulate a novel approach to handling the noisy constraint problem, which has been unrealistically ignored in the constrained clustering literature. Extensive comparative results show that our method is superior to the state-of-the-art constrained clustering approaches and can generally benefit existing pairwise similarity-based data clustering algorithms, such as spectral clustering and affinity propagation. PMID:25622327
HCsnip: An R Package for Semi-supervised Snipping of the Hierarchical Clustering Tree.
Obulkasim, Askar; van de Wiel, Mark A
2015-01-01
Hierarchical clustering (HC) is one of the most frequently used methods in computational biology in the analysis of high-dimensional genomics data. Given a data set, HC outputs a binary tree leaves of which are the data points and internal nodes represent clusters of various sizes. Normally, a fixed-height cut on the HC tree is chosen, and each contiguous branch of data points below that height is considered as a separate cluster. However, the fixed-height branch cut may not be ideal in situations where one expects a complicated tree structure with nested clusters. Furthermore, due to lack of utilization of related background information in selecting the cutoff, induced clusters are often difficult to interpret. This paper describes a novel procedure that aims to automatically extract meaningful clusters from the HC tree in a semi-supervised way. The procedure is implemented in the R package HCsnip available from Bioconductor. Rather than cutting the HC tree at a fixed-height, HCsnip probes the various way of snipping, possibly at variable heights, to tease out hidden clusters ensconced deep down in the tree. The cluster extraction process utilizes, along with the data set from which the HC tree is derived, commonly available background information. Consequently, the extracted clusters are highly reproducible and robust against various sources of variations that "haunted" high-dimensional genomics data. Since the clustering process is guided by the background information, clusters are easy to interpret. Unlike existing packages, no constraint is placed on the data type on which clustering is desired. Particularly, the package accepts patient follow-up data for guiding the cluster extraction process. To our knowledge, HCsnip is the first package that is able to decomposes the HC tree into clusters with piecewise snipping under the guidance of patient time-to-event information. Our implementation of the semi-supervised HC tree snipping framework is generic, and can
Hinselmann, Georg; Rosenbaum, Lars; Jahn, Andreas; Fechner, Nikolas; Ostermann, Claude; Zell, Andreas
2011-02-28
The goal of this study was to adapt a recently proposed linear large-scale support vector machine to large-scale binary cheminformatics classification problems and to assess its performance on various benchmarks using virtual screening performance measures. We extended the large-scale linear support vector machine library LIBLINEAR with state-of-the-art virtual high-throughput screening metrics to train classifiers on whole large and unbalanced data sets. The formulation of this linear support machine has an excellent performance if applied to high-dimensional sparse feature vectors. An additional advantage is the average linear complexity in the number of non-zero features of a prediction. Nevertheless, the approach assumes that a problem is linearly separable. Therefore, we conducted an extensive benchmarking to evaluate the performance on large-scale problems up to a size of 175000 samples. To examine the virtual screening performance, we determined the chemotype clusters using Feature Trees and integrated this information to compute weighted AUC-based performance measures and a leave-cluster-out cross-validation. We also considered the BEDROC score, a metric that was suggested to tackle the early enrichment problem. The performance on each problem was evaluated by a nested cross-validation and a nested leave-cluster-out cross-validation. We compared LIBLINEAR against a Naïve Bayes classifier, a random decision forest classifier, and a maximum similarity ranking approach. These reference approaches were outperformed in a direct comparison by LIBLINEAR. A comparison to literature results showed that the LIBLINEAR performance is competitive but without achieving results as good as the top-ranked nonlinear machines on these benchmarks. However, considering the overall convincing performance and computation time of the large-scale support vector machine, the approach provides an excellent alternative to established large-scale classification approaches. PMID
Cluster beam analysis via photoionization
Grover, J.R. ); Herron, W.J.; Coolbaugh, M.T.; Peifer, W.R.; Garvey, J.F. )
1991-08-22
A photoionization method for quantitatively analyzing the neutral products of free jet expansions is described. The basic principle is to measure the yield of an ion characterization of each component cluster at a photon energy just below that at which production of the same ion from larger clusters can be detected. Since there is then no problem with fragmentation, the beam density of each neutral cluster can be measured in the presence of larger clusters. Although these measurements must be done in the test ions' onset regions where their yields are often quite small, the technique is made highly practicable by the large intensities of widely tunable vacuum-ultraviolet synchrotron light now available at electron storage rings. As an example, the method is applied to the analysis of cluster beams collimated from the free jet expansion of a 200:1 ammonia-chlorobenzene mixture.
Systolic architecture for heirarchical clustering
Ku, L.C.
1984-01-01
Several hierarchical clustering methods (including single-linkage complete-linkage, centroid, and absolute overlap methods) are reviewed. The absolute overlap clustering method is selected for the design of systolic architecture mainly due to its simplicity. Two versions of systolic architectures for the absolute overlap hierarchical clustering algorithm are proposed: one-dimensional version that leads to the development of a two dimensional version which fully takes advantage of the underlying data structure of the problems. The two dimensional systolic architecture can achieve a time complexity of O(m + n) in comparison with the conventional computer implementation of a time complexity of O(m/sup 2*/n).
Fuzzy and hard clustering analysis for thyroid disease.
Azar, Ahmad Taher; El-Said, Shaimaa Ahmed; Hassanien, Aboul Ella
2013-07-01
Thyroid hormones produced by the thyroid gland help regulation of the body's metabolism. A variety of methods have been proposed in the literature for thyroid disease classification. As far as we know, clustering techniques have not been used in thyroid diseases data set so far. This paper proposes a comparison between hard and fuzzy clustering algorithms for thyroid diseases data set in order to find the optimal number of clusters. Different scalar validity measures are used in comparing the performances of the proposed clustering systems. To demonstrate the performance of each algorithm, the feature values that represent thyroid disease are used as input for the system. Several runs are carried out and recorded with a different number of clusters being specified for each run (between 2 and 11), so as to establish the optimum number of clusters. To find the optimal number of clusters, the so-called elbow criterion is applied. The experimental results revealed that for all algorithms, the elbow was located at c=3. The clustering results for all algorithms are then visualized by the Sammon mapping method to find a low-dimensional (normally 2D or 3D) representation of a set of points distributed in a high dimensional pattern space. At the end of this study, some recommendations are formulated to improve determining the actual number of clusters present in the data set. PMID:23357404
Winters-Hilt, Stephen; Merat, Sam
2007-01-01
Background Support Vector Machines (SVMs) provide a powerful method for classification (supervised learning). Use of SVMs for clustering (unsupervised learning) is now being considered in a number of different ways. Results An SVM-based clustering algorithm is introduced that clusters data with no a priori knowledge of input classes. The algorithm initializes by first running a binary SVM classifier against a data set with each vector in the set randomly labelled, this is repeated until an initial convergence occurs. Once this initialization step is complete, the SVM confidence parameters for classification on each of the training instances can be accessed. The lowest confidence data (e.g., the worst of the mislabelled data) then has its' labels switched to the other class label. The SVM is then re-run on the data set (with partly re-labelled data) and is guaranteed to converge in this situation since it converged previously, and now it has fewer data points to carry with mislabelling penalties. This approach appears to limit exposure to the local minima traps that can occur with other approaches. Thus, the algorithm then improves on its weakly convergent result by SVM re-training after each re-labeling on the worst of the misclassified vectors – i.e., those feature vectors with confidence factor values beyond some threshold. The repetition of the above process improves the accuracy, here a measure of separability, until there are no misclassifications. Variations on this type of clustering approach are shown. Conclusion Non-parametric SVM-based clustering methods may allow for much improved performance over parametric approaches, particularly if they can be designed to inherit the strengths of their supervised SVM counterparts. PMID:18047717
NASA Technical Reports Server (NTRS)
Socolovsky, Eduardo A.; Bushnell, Dennis M. (Technical Monitor)
2002-01-01
The cosine or correlation measures of similarity used to cluster high dimensional data are interpreted as projections, and the orthogonal components are used to define a complementary dissimilarity measure to form a similarity-dissimilarity measure pair. Using a geometrical approach, a number of properties of this pair is established. This approach is also extended to general inner-product spaces of any dimension. These properties include the triangle inequality for the defined dissimilarity measure, error estimates for the triangle inequality and bounds on both measures that can be obtained with a few floating-point operations from previously computed values of the measures. The bounds and error estimates for the similarity and dissimilarity measures can be used to reduce the computational complexity of clustering algorithms and enhance their scalability, and the triangle inequality allows the design of clustering algorithms for high dimensional distributed data.
Two generalizations of Kohonen clustering
NASA Technical Reports Server (NTRS)
Bezdek, James C.; Pal, Nikhil R.; Tsao, Eric C. K.
1993-01-01
The relationship between the sequential hard c-means (SHCM), learning vector quantization (LVQ), and fuzzy c-means (FCM) clustering algorithms is discussed. LVQ and SHCM suffer from several major problems. For example, they depend heavily on initialization. If the initial values of the cluster centers are outside the convex hull of the input data, such algorithms, even if they terminate, may not produce meaningful results in terms of prototypes for cluster representation. This is due in part to the fact that they update only the winning prototype for every input vector. The impact and interaction of these two families with Kohonen's self-organizing feature mapping (SOFM), which is not a clustering method, but which often leads ideas to clustering algorithms is discussed. Then two generalizations of LVQ that are explicitly designed as clustering algorithms are presented; these algorithms are referred to as generalized LVQ = GLVQ; and fuzzy LVQ = FLVQ. Learning rules are derived to optimize an objective function whose goal is to produce 'good clusters'. GLVQ/FLVQ (may) update every node in the clustering net for each input vector. Neither GLVQ nor FLVQ depends upon a choice for the update neighborhood or learning rate distribution - these are taken care of automatically. Segmentation of a gray tone image is used as a typical application of these algorithms to illustrate the performance of GLVQ/FLVQ.
Active matter clusters at interfaces
NASA Astrophysics Data System (ADS)
Copenhagen, Katherine; Gopinathan, Ajay
Collective and directed motility or swarming is an emergent phenomenon displayed by many self-organized assemblies of active biological matter such as clusters of embryonic cells during tissue development and flocks of birds. Such clusters typically encounter very heterogeneous environments. What happens when a cluster encounters an interface between two different environments has implications for its function and fate. Here we study this problem by using a mathematical model of a cluster that treats it as a single cohesive unit whose movement depends on the nature of the local environment. We find that low speed clusters which exert forces but no active torques, encountering an interface with a moderate difference in properties can lead to refraction or even total internal reflection of the cluster. For large speeds and clusters with active torques, they show more complex behaviors crossing the interface multiple times, becoming trapped at the interface and deviating from the predictable refraction and reflection of the low velocity clusters. Our results show a wide range of behaviors that occur when collectively moving active biological matter moves across interfaces and these insights can be used to control motion by patterning environments.
Cluster tidal fields: Effects on disk galaxies
NASA Technical Reports Server (NTRS)
Valluri, Monica
1993-01-01
A variety of observations of galaxies in clusters indicate that the gas in these galaxies is strongly affected by the cluster environment. We present results of a study of the dynamical effects of the mean cluster tidal field on a disk galaxy as it falls into a cluster for the first time on a bound orbit with constant angular momentum (Valluri 1992). The problem is studied in the restricted 3-body framework. The cluster is modelled by a modified Hubble potential and the disk galaxy is modelled as a flattened spheroid.
Multitask spectral clustering by exploring intertask correlation.
Yang, Yang; Ma, Zhigang; Yang, Yi; Nie, Feiping; Shen, Heng Tao
2015-05-01
Clustering, as one of the most classical research problems in pattern recognition and data mining, has been widely explored and applied to various applications. Due to the rapid evolution of data on the Web, more emerging challenges have been posed on traditional clustering techniques: 1) correlations among related clustering tasks and/or within individual task are not well captured; 2) the problem of clustering out-of-sample data is seldom considered; and 3) the discriminative property of cluster label matrix is not well explored. In this paper, we propose a novel clustering model, namely multitask spectral clustering (MTSC), to cope with the above challenges. Specifically, two types of correlations are well considered: 1) intertask clustering correlation, which refers the relations among different clustering tasks and 2) intratask learning correlation, which enables the processes of learning cluster labels and learning mapping function to reinforce each other. We incorporate a novel l2,p -norm regularizer to control the coherence of all the tasks based on an assumption that related tasks should share a common low-dimensional representation. Moreover, for each individual task, an explicit mapping function is simultaneously learnt for predicting cluster labels by mapping features to the cluster label matrix. Meanwhile, we show that the learning process can naturally incorporate discriminative information to further improve clustering performance. We explore and discuss the relationships between our proposed model and several representative clustering techniques, including spectral clustering, k -means and discriminative k -means. Extensive experiments on various real-world datasets illustrate the advantage of the proposed MTSC model compared to state-of-the-art clustering approaches. PMID:25252288
Cluster synchronization induced by one-node clusters in networks with asymmetric negative couplings
Zhang, Jianbao; Ma, Zhongjun; Zhang, Gang
2013-12-15
This paper deals with the problem of cluster synchronization in networks with asymmetric negative couplings. By decomposing the coupling matrix into three matrices, and employing Lyapunov function method, sufficient conditions are derived for cluster synchronization. The conditions show that the couplings of multi-node clusters from one-node clusters have beneficial effects on cluster synchronization. Based on the effects of the one-node clusters, an effective and universal control scheme is put forward for the first time. The obtained results may help us better understand the relation between cluster synchronization and cluster structures of the networks. The validity of the control scheme is confirmed through two numerical simulations, in a network with no cluster structure and in a scale-free network.
Sharma, Ashok; Podolsky, Robert; Zhao, Jieping; McIndoe, Richard A.
2009-01-01
Motivation: As the number of publically available microarray experiments increases, the ability to analyze extremely large datasets across multiple experiments becomes critical. There is a requirement to develop algorithms which are fast and can cluster extremely large datasets without affecting the cluster quality. Clustering is an unsupervised exploratory technique applied to microarray data to find similar data structures or expression patterns. Because of the high input/output costs involved and large distance matrices calculated, most of the algomerative clustering algorithms fail on large datasets (30 000 + genes/200 + arrays). In this article, we propose a new two-stage algorithm which partitions the high-dimensional space associated with microarray data using hyperplanes. The first stage is based on the Balanced Iterative Reducing and Clustering using Hierarchies algorithm with the second stage being a conventional k-means clustering technique. This algorithm has been implemented in a software tool (HPCluster) designed to cluster gene expression data. We compared the clustering results using the two-stage hyperplane algorithm with the conventional k-means algorithm from other available programs. Because, the first stage traverses the data in a single scan, the performance and speed increases substantially. The data reduction accomplished in the first stage of the algorithm reduces the memory requirements allowing us to cluster 44 460 genes without failure and significantly decreases the time to complete when compared with popular k-means programs. The software was written in C# (.NET 1.1). Availability: The program is freely available and can be downloaded from http://www.amdcc.org/bioinformatics/bioinformatics.aspx. Contact: rmcindoe@mail.mcg.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:19261720
Alexandrov, Theodore; Kobarg, Jan Hendrik
2011-01-01
Motivation: Imaging mass spectrometry (IMS) is one of the few measurement technology s of biochemistry which, given a thin sample, is able to reveal its spatial chemical composition in the full molecular range. IMS produces a hyperspectral image, where for each pixel a high-dimensional mass spectrum is measured. Currently, the technology is mature enough and one of the major problems preventing its spreading is the under-development of computational methods for mining huge IMS datasets. This article proposes a novel approach for spatial segmentation of an IMS dataset, which is constructed considering the important issue of pixel-to-pixel variability. Methods: We segment pixels by clustering their mass spectra. Importantly, we incorporate spatial relations between pixels into clustering, so that pixels are clustered together with their neighbors. We propose two methods. One is non-adaptive, where pixel neighborhoods are selected in the same manner for all pixels. The second one respects the structure observable in the data. For a pixel, its neighborhood is defined taking into account similarity of its spectrum to the spectra of adjacent pixels. Both methods have the linear complexity and require linear memory space (in the number of spectra). Results: The proposed segmentation methods are evaluated on two IMS datasets: a rat brain section and a section of a neuroendocrine tumor. They discover anatomical structure, discriminate the tumor region and highlight functionally similar regions. Moreover, our methods provide segmentation maps of similar or better quality if compared to the other state-of-the-art methods, but outperform them in runtime and/or required memory. Contact: theodore@math.uni-bremen.de PMID:21685075
Nair, Nitya; Newell, Evan W.; Vollmers, Christopher; Quake, Stephen R.; Morton, John M.; Davis, Mark M.; He, Xiao-Song; Greenberg, Harry B.
2015-01-01
In-depth phenotyping of human intestinal antibody secreting cells (ASCs) and their precursors is important for developing improved mucosal vaccines. We used single-cell mass cytometry to simultaneously analyze 34 differentiation and trafficking markers on intestinal and circulating B cells. In addition, we labeled rotavirus double-layered particles with a metal isotope and characterized B cells specific to the rotavirus VP6 major structural protein. We describe the heterogeneity of the intestinal B cell compartment, dominated by ASCs with some phenotypic and transcriptional characteristics of long-lived plasma cells. Using principal component analysis, we visualized the phenotypic relationships between major B cell subsets in the intestine and blood, and revealed that IgM+ memory B cells (MBCs) and naïve B cells were phenotypically related as were CD27− MBCs and switched MBCs. ASCs in the intestine and blood were highly clonally related, but associated with distinct trajectories of phenotypic development. VP6-specific B cells were present among diverse B cell subsets in immune donors, including naïve B cells, with phenotypes representative of the overall B cell pool. These data provide a high dimensional view of intestinal B cells and the determinants regulating humoral memory to a ubiquitous, mucosal pathogen at steady-state. PMID:25899688
Ferrell, Paul Brent; Diggins, Kirsten Elizabeth; Polikowsky, Hannah Grace; Mohan, Sanjay Ram; Seegmiller, Adam C.
2016-01-01
The plasticity of AML drives poor clinical outcomes and confounds its longitudinal detection. However, the immediate impact of treatment on the leukemic and non-leukemic cells of the bone marrow and blood remains relatively understudied. Here, we conducted a pilot study of high dimensional longitudinal monitoring of immunophenotype in AML. To characterize changes in cell phenotype before, during, and immediately after induction treatment, we developed a 27-antibody panel for mass cytometry focused on surface diagnostic markers and applied it to 46 samples of blood or bone marrow tissue collected over time from 5 AML patients. Central goals were to determine whether changes in AML phenotype would be captured effectively by cytomic tools and to implement methods for describing the evolving phenotypes of AML cell subsets. Mass cytometry data were analyzed using established computational techniques. Within this pilot study, longitudinal immune monitoring with mass cytometry revealed fundamental changes in leukemia phenotypes that occurred over time during and after induction in the refractory disease setting. Persisting AML blasts became more phenotypically distinct from stem and progenitor cells due to expression of novel marker patterns that differed from pre-treatment AML cells and from all cell types observed in healthy bone marrow. This pilot study of single cell immune monitoring in AML represents a powerful tool for precision characterization and targeting of resistant disease. PMID:27074138
Ma, Yanyuan; Zhu, Liping
2013-01-01
Summary We study the heteroscedastic partially linear single-index model with an unspecified error variance function, which allows for high dimensional covariates in both the linear and the single-index components of the mean function. We propose a class of consistent estimators of the parameters by using a proper weighting strategy. An interesting finding is that the linearity condition which is widely assumed in the dimension reduction literature is not necessary for methodological or theoretical development: it contributes only to the simplification of non-optimal consistent estimation. We also find that the performance of the usual weighted least square type of estimators deteriorates when the non-parametric component is badly estimated. However, estimators in our family automatically provide protection against such deterioration, in that the consistency can be achieved even if the baseline non-parametric function is completely misspecified. We further show that the most efficient estimator is a member of this family and can be easily obtained by using non-parametric estimation. Properties of the estimators proposed are presented through theoretical illustration and numerical simulations. An example on gender discrimination is used to demonstrate and to compare the practical performance of the estimators. PMID:23970823
Cuny, Jérôme; Xie, Yu; Pickard, Chris J; Hassanali, Ali A
2016-02-01
Nuclear magnetic resonance (NMR) spectroscopy is one of the most powerful experimental tools to probe the local atomic order of a wide range of solid-state compounds. However, due to the complexity of the related spectra, in particular for amorphous materials, their interpretation in terms of structural information is often challenging. These difficulties can be overcome by combining molecular dynamics simulations to generate realistic structural models with an ab initio evaluation of the corresponding chemical shift and quadrupolar coupling tensors. However, due to computational constraints, this approach is limited to relatively small system sizes which, for amorphous materials, prevents an adequate statistical sampling of the distribution of the local environments that is required to quantitatively describe the system. In this work, we present an approach to efficiently and accurately predict the NMR parameters of very large systems. This is achieved by using a high-dimensional neural-network representation of NMR parameters that are calculated using an ab initio formalism. To illustrate the potential of this approach, we applied this neural-network NMR (NN-NMR) method on the (17)O and (29)Si quadrupolar coupling and chemical shift parameters of various crystalline silica polymorphs and silica glasses. This approach is, in principal, general and has the potential to be applied to predict the NMR properties of various materials. PMID:26730889
NASA Astrophysics Data System (ADS)
Cavaglieri, Daniele; Bewley, Thomas
2015-04-01
Implicit/explicit (IMEX) Runge-Kutta (RK) schemes are effective for time-marching ODE systems with both stiff and nonstiff terms on the RHS; such schemes implement an (often A-stable or better) implicit RK scheme for the stiff part of the ODE, which is often linear, and, simultaneously, a (more convenient) explicit RK scheme for the nonstiff part of the ODE, which is often nonlinear. Low-storage RK schemes are especially effective for time-marching high-dimensional ODE discretizations of PDE systems on modern (cache-based) computational hardware, in which memory management is often the most significant computational bottleneck. In this paper, we develop and characterize eight new low-storage implicit/explicit RK schemes which have higher accuracy and better stability properties than the only low-storage implicit/explicit RK scheme available previously, the venerable second-order Crank-Nicolson/Runge-Kutta-Wray (CN/RKW3) algorithm that has dominated the DNS/LES literature for the last 25 years, while requiring similar storage (two, three, or four registers of length N) and comparable floating-point operations per timestep.
Awale, Mahendra; Reymond, Jean-Louis
2015-08-24
An Internet portal accessible at www.gdb.unibe.ch has been set up to automatically generate color-coded similarity maps of the ChEMBL database in relation to up to two sets of active compounds taken from the enhanced Directory of Useful Decoys (eDUD), a random set of molecules, or up to two sets of user-defined reference molecules. These maps visualize the relationships between the selected compounds and ChEMBL in six different high dimensional chemical spaces, namely MQN (42-D molecular quantum numbers), SMIfp (34-D SMILES fingerprint), APfp (20-D shape fingerprint), Xfp (55-D pharmacophore fingerprint), Sfp (1024-bit substructure fingerprint), and ECfp4 (1024-bit extended connectivity fingerprint). The maps are supplied in form of Java based desktop applications called "similarity mapplets" allowing interactive content browsing and linked to a "Multifingerprint Browser for ChEMBL" (also accessible directly at www.gdb.unibe.ch ) to perform nearest neighbor searches. One can obtain six similarity mapplets of ChEMBL relative to random reference compounds, 606 similarity mapplets relative to single eDUD active sets, 30,300 similarity mapplets relative to pairs of eDUD active sets, and any number of similarity mapplets relative to user-defined reference sets to help visualize the structural diversity of compound series in drug optimization projects and their relationship to other known bioactive compounds. PMID:26207526
Du, Jing; Wang, Jian
2015-11-01
Bessel beams carrying orbital angular momentum (OAM) with helical phase fronts exp(ilφ)(l=0;±1;±2;…), where φ is the azimuthal angle and l corresponds to the topological number, are orthogonal with each other. This feature of Bessel beams provides a new dimension to code/decode data information on the OAM state of light, and the theoretical infinity of topological number enables possible high-dimensional structured light coding/decoding for free-space optical communications. Moreover, Bessel beams are nondiffracting beams having the ability to recover by themselves in the face of obstructions, which is important for free-space optical communications relying on line-of-sight operation. By utilizing the OAM and nondiffracting characteristics of Bessel beams, we experimentally demonstrate 12 m distance obstruction-free optical m-ary coding/decoding using visible Bessel beams in a free-space optical communication system. We also study the bit error rate (BER) performance of hexadecimal and 32-ary coding/decoding based on Bessel beams with different topological numbers. After receiving 500 symbols at the receiver side, a zero BER of hexadecimal coding/decoding is observed when the obstruction is placed along the propagation path of light. PMID:26512460
Clustering PPI data by combining FA and SHC method
2015-01-01
Clustering is one of main methods to identify functional modules from protein-protein interaction (PPI) data. Nevertheless traditional clustering methods may not be effective for clustering PPI data. In this paper, we proposed a novel method for clustering PPI data by combining firefly algorithm (FA) and synchronization-based hierarchical clustering (SHC) algorithm. Firstly, the PPI data are preprocessed via spectral clustering (SC) which transforms the high-dimensional similarity matrix into a low dimension matrix. Then the SHC algorithm is used to perform clustering. In SHC algorithm, hierarchical clustering is achieved by enlarging the neighborhood radius of synchronized objects continuously, while the hierarchical search is very difficult to find the optimal neighborhood radius of synchronization and the efficiency is not high. So we adopt the firefly algorithm to determine the optimal threshold of the neighborhood radius of synchronization automatically. The proposed algorithm is tested on the MIPS PPI dataset. The results show that our proposed algorithm is better than the traditional algorithms in precision, recall and f-measure value. PMID:25707632
Progeny Clustering: A Method to Identify Biological Phenotypes.
Hu, Chenyue W; Kornblau, Steven M; Slater, John H; Qutub, Amina A
2015-01-01
Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset. PMID:26267476
On evaluating clustering procedures for use in classification
NASA Technical Reports Server (NTRS)
Pore, M. D.; Moritz, T. E.; Register, D. T.; Yao, S. S.; Eppler, W. G. (Principal Investigator)
1979-01-01
The problem of evaluating clustering algorithms and their respective computer programs for use in a preprocessing step for classification is addressed. In clustering for classification the probability of correct classification is suggested as the ultimate measure of accuracy on training data. A means of implementing this criterion and a measure of cluster purity are discussed. Examples are given. A procedure for cluster labeling that is based on cluster purity and sample size is presented.
Black, Kevin J.
2013-01-01
Background: Prior brain imaging and autopsy studies have suggested that structural abnormalities of the basal ganglia (BG) nuclei may be present in Tourette Syndrome (TS). These studies have focused mainly on the volume differences of the BG structures and not their anatomical shapes. Shape differences of various brain structures have been demonstrated in other neuropsychiatric disorders using large-deformation, high dimensional brain mapping (HDBM-LD). A previous study of a small sample of adult TS patients demonstrated the validity of the method, but did not find significant differences compared to controls. Since TS usually begins in childhood and adult studies may show structure differences due to adaptations, we hypothesized that differences in BG and thalamus structure geometry and volume due to etiological changes in TS might be better characterized in children. Objective: Pilot the HDBM-LD method in children and estimate effect sizes. Methods: In this pilot study, T1-weighted MRIs were collected in 13 children with TS and 16 healthy, tic-free, control children. The groups were well matched for age. The primary outcome measures were the first 10 eigenvectors which are derived using HDBM-LD methods and represent the majority of the geometric shape of each structure, and the volumes of each structure adjusted for whole brain volume. We also compared hemispheric right/left asymmetry and estimated effect sizes for both volume and shape differences between groups. Results: We found no statistically significant differences between the TS subjects and controls in volume, shape, or right/left asymmetry. Effect sizes were greater for shape analysis than for volume. Conclusion: This study represents one of the first efforts to study the shape as opposed to the volume of the BG in TS, but power was limited by sample size. Shape analysis by the HDBM-LD method may prove more sensitive to group differences. PMID:24715957
Lotz, Martin K.; Otsuki, Shuhei; Grogan, Shawn P.; Sah, Robert; Terkeltaub, Robert; D’Lima, Darryl
2010-01-01
The formation of new cell clusters is a histological hallmark of arthritic cartilage but the biology of clusters and their role in disease are poorly understood. This is the first comprehensive review of clinical and experimental conditions associated with cluster formation. Genes and proteins that are expressed in cluster cells, the cellular origin of the clusters, mechanisms that lead to cluster formation and the role of cluster cells in pathogenesis are discussed. PMID:20506158
Feature Clustering for Accelerating Parallel Coordinate Descent
Scherrer, Chad; Tewari, Ambuj; Halappanavar, Mahantesh; Haglin, David J.
2012-12-06
We demonstrate an approach for accelerating calculation of the regularization path for L1 sparse logistic regression problems. We show the benefit of feature clustering as a preconditioning step for parallel block-greedy coordinate descent algorithms.
NASA Astrophysics Data System (ADS)
Weinmann, Simone M.; Kauffmann, Guinevere; von der Linden, Anja; De Lucia, Gabriella
2010-08-01
We investigate how the specific star formation rates of galaxies of different masses depend on cluster-centric radius and on the central/satellite dichotomy in both field and cluster environments. Recent data from a variety of sources, including the cluster catalogue of von der Linden et al., are compared to the semi-analytic models of De Lucia & Blaizot. We find that these models predict too many passive satellite galaxies in clusters, too few passive central galaxies with low stellar masses and too many passive central galaxies with high masses. We then outline a series of modifications to the model necessary to solve these problems: (a) instead of instantaneous stripping of the external gas reservoir after a galaxy becomes a satellite, the gas supply is assumed to decrease at the same rate that the surrounding halo loses mass due to tidal stripping and (b) the active galactic nuclei (AGN) feedback efficiency is lowered to bring the fraction of massive passive centrals in better agreement with the data. We also allow for radio mode AGN feedback in satellite galaxies. (c) We assume that satellite galaxies residing in host haloes with masses below 1012h-1Msolar do not undergo any stripping. We highlight the fact that in low-mass galaxies, the external reservoir is composed primarily of gas that has been expelled from the galactic disc by supernovae-driven winds. This gas must remain available as a future reservoir for star formation, even in satellite galaxies. Finally, we present a simple recipe for the stripping of gas and dark matter in satellites that can be used in models where subhalo evolution is not followed in detail.
Some properties of ion and cluster plasma
Gudzenko, L.I.; Derzhiev, V.I.; Yakovlenko, S.I.
1982-11-01
The aggregate of problems connected with the physics of ion and cluster plasma is qualitatively considered. Such a plasma can exist when a dense gas is ionized by a hard ionizer. The conditions for the formation of an ion plasma and the difference between its characteristics and those of an ordinary electron plasma are discussed; a solvated-ion model and the distribution of the clusters with respect to the number of solvated molecules are considered. The recombination rate of the positively and negatively charged clusters is roughly estimated. The parameters of a ball-lightning plasma are estimated on the basis of the cluster model.
... the body released during an allergic response) or serotonin (chemical made by nerve cells). A problem in a small area at the base of the brain called the hypothalamus may be involved. More men than women are affected. The headaches can occur at any ...
Electrodynamic properties of fractal clusters
NASA Astrophysics Data System (ADS)
Maksimenko, V. V.; Zagaynov, V. A.; Agranovski, I. E.
2014-07-01
An influence of interference on a character of light interaction both with individual fractal cluster (FC) consisting of nanoparticles and with agglomerates of such clusters is investigated. Using methods of the multiple scattering theory, effective dielectric permeability of a micron-size FC composed of non-absorbing nanoparticles is calculated. The cluster could be characterized by a set of effective dielectric permeabilities. Their number coincides with the number of particles, where space arrangement in the cluster is correlated. If the fractal dimension is less than some critical value and frequency corresponds to the frequency of the visible spectrum, then the absolute value of effective dielectric permeability becomes very large. This results in strong renormalization (decrease) of the incident radiation wavelength inside the cluster. The renormalized photons are cycled or trapped inside the system of multi-scaled cavities inside the cluster. A lifetime of a photon localized inside an agglomerate of FCs is a macroscopic value allowing to observe the stimulated emission of the localized light. The latter opens up a possibility for creation of lasers without inverse population of energy levels. Moreover, this allows to reconsider problems of optical cloaking of macroscopic objects. One more feature of fractal structures is a possibility of unimpeded propagation of light when any resistance associated with scattering disappears.
Spectral clustering of protein sequences
Paccanaro, Alberto; Casbon, James A.; Saqi, Mansoor A. S.
2006-01-01
An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL]. PMID:16547200
Self consistency grouping: a stringent clustering method
2012-01-01
Background Numerous types of clustering like single linkage and K-means have been widely studied and applied to a variety of scientific problems. However, the existing methods are not readily applicable for the problems that demand high stringency. Methods Our method, self consistency grouping, i.e. SCG, yields clusters whose members are closer in rank to each other than to any member outside the cluster. We do not define a distance metric; we use the best known distance metric and presume that it measures the correct distance. SCG does not impose any restriction on the size or the number of the clusters that it finds. The boundaries of clusters are determined by the inconsistencies in the ranks. In addition to the direct implementation that finds the complete structure of the (sub)clusters we implemented two faster versions. The fastest version is guaranteed to find only the clusters that are not subclusters of any other clusters and the other version yields the same output as the direct implementation but does so more efficiently. Results Our tests have demonstrated that SCG yields very few false positives. This was accomplished by introducing errors in the distance measurement. Clustering of protein domain representatives by structural similarity showed that SCG could recover homologous groups with high precision. Conclusions SCG has potential for finding biological relationships under stringent conditions. PMID:23320864
On the clustering of multidimensional pictorial data
NASA Technical Reports Server (NTRS)
Bryant, J. D. (Principal Investigator)
1979-01-01
Obvious approaches to reducing the cost (in computer resources) of applying current clustering techniques to the problem of remote sensing are discussed. The use of spatial information in finding fields and in classifying mixture pixels is examined, and the AMOEBA clustering program is described. Internally, a pattern recognition program, from without, AMOEBA appears to be an unsupervised clustering program. It is fast and automatic. No choices (such as arbitrary thresholds to set split/combine sequences) need be made. The problem of finding the number of clusters is solved automatically. At the conclusion of the program, all points in the scene are classified; however, a provision is included for a reject classification of some points which, within the theoretical framework, cannot rationally be assigned to any cluster.
Bipartite graph partitioning and data clustering
Zha, Hongyuan; He, Xiaofeng; Ding, Chris; Gu, Ming; Simon, Horst D.
2001-05-07
Many data types arising from data mining applications can be modeled as bipartite graphs, examples include terms and documents in a text corpus, customers and purchasing items in market basket analysis and reviewers and movies in a movie recommender system. In this paper, the authors propose a new data clustering method based on partitioning the underlying biopartite graph. The partition is constructed by minimizing a normalized sum of edge weights between unmatched pairs of vertices of the bipartite graph. They show that an approximate solution to the minimization problem can be obtained by computing a partial singular value decomposition (SVD) of the associated edge weight matrix of the bipartite graph. They point out the connection of their clustering algorithm to correspondence analysis used in multivariate analysis. They also briefly discuss the issue of assigning data objects to multiple clusters. In the experimental results, they apply their clustering algorithm to the problem of document clustering to illustrate its effectiveness and efficiency.
Swarm Intelligence in Text Document Clustering
Cui, Xiaohui; Potok, Thomas E
2008-01-01
Social animals or insects in nature often exhibit a form of emergent collective behavior. The research field that attempts to design algorithms or distributed problem-solving devices inspired by the collective behavior of social insect colonies is called Swarm Intelligence. Compared to the traditional algorithms, the swarm algorithms are usually flexible, robust, decentralized and self-organized. These characters make the swarm algorithms suitable for solving complex problems, such as document collection clustering. The major challenge of today's information society is being overwhelmed with information on any topic they are searching for. Fast and high-quality document clustering algorithms play an important role in helping users to effectively navigate, summarize, and organize the overwhelmed information. In this chapter, we introduce three nature inspired swarm intelligence clustering approaches for document clustering analysis. These clustering algorithms use stochastic and heuristic principles discovered from observing bird flocks, fish schools and ant food forage.
... gov/ Home Body Getting your period Problem periods Problem periods It’s common to have cramps or feel ... doctor Some common period problems Signs of period problems top One way to know if you may ...
... it could be a sign of a balance problem. Balance problems can make you feel unsteady or as if ... related injuries, such as hip fracture. Some balance problems are due to problems in the inner ear. ...
... often, it could be a sign of a balance problem. Balance problems can make you feel unsteady or as ... fall-related injuries, such as hip fracture. Some balance problems are due to problems in the inner ...
Reactions and properties of clusters
NASA Astrophysics Data System (ADS)
Castleman, A. W., Jr.
1992-09-01
The elucidation from a molecular point of view of the differences and similarities in the properties and reactivity of matter in the gaseous compared to the condensed state is a subject of considerable current interest. One of the promising approaches to this problem is to utilize mass spectrometry in conjunction with laser spectroscopy and fast-flow reaction devices to investigate the changing properties, structure and reactivity of clusters as a function of the degree of solvation under well-controlled conditions. In this regard, an investigation of molecular cluster ions has provided considerable new insight into the basic mechanisms of ion reactions within a cluster, and this paper reviews some of the recent advances in cluster production, the origin of magic numbers and relationship to cluster ion stabilities, and solvation effects on reactions. There have been some notable advances in the production of large cluster ions under thermal reaction conditions, enabling a systematic study of the influence of solvation on reactions to be carried out. These and other new studies of magic numbers have traced their origin to the thermochemical stability of cluster ions. There are several classes of reaction where solvation has a notable influence on reactivity. A particularly interesting example comes from recent studies of the reactions of the hydroxyl anion with CO2 and SO2, studied as a function of the degree of hydration of OH-. Both reactions are highly exothermic, yet the differences in reactivity are dramatic. In the case of SO2, the reaction occurs at near the collision rate. By contrast, CO2 reactivity plummets dramatically for clusters having more than four water molecules. The slow rate is in accord with observations in the liquid phase.
Large-scale metagenomic sequence clustering on map-reduce clusters.
Yang, Xiao; Zola, Jaroslaw; Aluru, Srinivas
2013-02-01
Taxonomic clustering of species from millions of DNA fragments sequenced from their genomes is an important and frequently arising problem in metagenomics. In this paper, we present a parallel algorithm for taxonomic clustering of large metagenomic samples with support for overlapping clusters. We develop sketching techniques, akin to those created for web document clustering, to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all comparison. We formulate the metagenomic classification problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud ready implementation. We show that the resulting framework can produce high quality clustering of metagenomic samples consisting of millions of reads, in reasonable time limits, when executed on a modest size cluster. PMID:23427983
A Cross Unequal Clustering Routing Algorithm for Sensor Network
NASA Astrophysics Data System (ADS)
Tong, Wang; Jiyi, Wu; He, Xu; Jinghua, Zhu; Munyabugingo, Charles
2013-08-01
In the routing protocol for wireless sensor network, the cluster size is generally fixed in clustering routing algorithm for wireless sensor network, which can easily lead to the "hot spot" problem. Furthermore, the majority of routing algorithms barely consider the problem of long distance communication between adjacent cluster heads that brings high energy consumption. Therefore, this paper proposes a new cross unequal clustering routing algorithm based on the EEUC algorithm. In order to solve the defects of EEUC algorithm, this algorithm calculating of competition radius takes the node's position and node's remaining energy into account to make the load of cluster heads more balanced. At the same time, cluster adjacent node is applied to transport data and reduce the energy-loss of cluster heads. Simulation experiments show that, compared with LEACH and EEUC, the proposed algorithm can effectively reduce the energy-loss of cluster heads and balance the energy consumption among all nodes in the network and improve the network lifetime
A reduced basis Landweber method for nonlinear inverse problems
NASA Astrophysics Data System (ADS)
Garmatter, Dominik; Haasdonk, Bernard; Harrach, Bastian
2016-03-01
We consider parameter identification problems in parametrized partial differential equations (PDEs). These lead to nonlinear ill-posed inverse problems. One way of solving them is using iterative regularization methods, which typically require numerous amounts of forward solutions during the solution process. In this article we consider the nonlinear Landweber method and couple it with the reduced basis method as a model order reduction technique in order to reduce the overall computational time. In particular, we consider PDEs with a high-dimensional parameter space, which are known to pose difficulties in the context of reduced basis methods. We present a new method that is able to handle such high-dimensional parameter spaces by combining the nonlinear Landweber method with adaptive online reduced basis updates. It is then applied to the inverse problem of reconstructing the conductivity in the stationary heat equation.
Performance Comparison Of Evolutionary Algorithms For Image Clustering
NASA Astrophysics Data System (ADS)
Civicioglu, P.; Atasever, U. H.; Ozkan, C.; Besdok, E.; Karkinli, A. E.; Kesikoglu, A.
2014-09-01
Evolutionary computation tools are able to process real valued numerical sets in order to extract suboptimal solution of designed problem. Data clustering algorithms have been intensively used for image segmentation in remote sensing applications. Despite of wide usage of evolutionary algorithms on data clustering, their clustering performances have been scarcely studied by using clustering validation indexes. In this paper, the recently proposed evolutionary algorithms (i.e., Artificial Bee Colony Algorithm (ABC), Gravitational Search Algorithm (GSA), Cuckoo Search Algorithm (CS), Adaptive Differential Evolution Algorithm (JADE), Differential Search Algorithm (DSA) and Backtracking Search Optimization Algorithm (BSA)) and some classical image clustering techniques (i.e., k-means, fcm, som networks) have been used to cluster images and their performances have been compared by using four clustering validation indexes. Experimental test results exposed that evolutionary algorithms give more reliable cluster-centers than classical clustering techniques, but their convergence time is quite long.
Foodservice Occupations Cluster Guide.
ERIC Educational Resources Information Center
Oregon State Dept. of Education, Salem.
Intended to assist vocational teachers in developing and implementing a cluster program in food service occupations, this guide contains sections on cluster organization and implementation and instructional emphasis areas. The cluster organization and implementation section covers goal-based planning and includes a proposed cluster curriculum, a…
Echenique, P.M.; Manson, J.R.; Ritchie, R.H. )
1990-03-19
We present a model for the cluster-impact-fusion experiments of Buehler, Friedlander, and Friedman, Calculated fusion rates as a function of bombarding energy for constant cluster size agree well with experiment. The dependence of the fusion rate on cluster size at fixed bombarding energy is explained qualitatively. The role of correlated, coherent collisions in enhanced energy loss by clusters is emphasized.
ON CLUSTERING TECHNIQUES OF CITATION GRAPHS.
ERIC Educational Resources Information Center
CHIEN, R.T.; PREPARATA, F.P.
ONE OF THE PROBLEMS ENCOUNTERED IN CLUSTERING TECHNIQUES AS APPLIED TO DOCUMENT RETRIEVAL SYSTEMS USING BIBLIOGRAPHIC COUPLING DEVICES IS THAT THE COMPUTATIONAL EFFORT REQUIRED GROWS ROUGHLY AS THE SQUARE OF THE COLLECTION SIZE. IN THIS STUDY GRAPH THEORY IS APPLIED TO THIS PROBLEM BY FIRST MAPPING THE CITATION GRAPH OF THE DOCUMENT COLLECTION…
... labor starts before 37 completed weeks of pregnancy Problems with the umbilical cord Problems with the position of the baby, such as ... feet first Birth injuries For some of these problems, the baby may need to be delivered surgically ...
... version of this page please turn Javascript on. Balance Problems About Balance Problems Have you ever felt dizzy, lightheaded, or ... dizziness problem during the past year. Why Good Balance is Important Having good balance means being able ...
Properties and Formation of Star Clusters
NASA Astrophysics Data System (ADS)
Sharina, M. E.
2016-03-01
Many key problems in astrophysics involve research on the properties of star clusters, for example: stellar evolution and nucleosynthesis, the history of star formation in galaxies, formation dynamics of galaxies and their subsystems, the calibration of the fundamental distance scale in the universe, and the luminosity functions of stars and star clusters. This review is intended to familiarize the reader with modern observational and theoretical data on the formation and evolution of star clusters in our galaxy and others. Unsolved problems in this area are formulated and research on ways to solve them is discussed. In particular, some of the most important current observational and theoretical problems include: (1) a more complete explanation of the physical processes in molecular clouds leading to the formation and evolution of massive star clusters; (2) observation of these objects in different stages of evolution, including protoclusters, at wavelengths where interstellar absorption is minimal; and, (3) comparison of the properties of massive star clusters in different galaxies and of galaxies during the most active star formation phase at different red shifts. The main goal in solving these problems is to explain the variations in the abundance of chemical elements and in the multiple populations of stars in clusters discovered at the end of the twentieth century.
Improved Ant Colony Clustering Algorithm and Its Performance Study.
Gao, Wei
2016-01-01
Clustering analysis is used in many disciplines and applications; it is an important tool that descriptively identifies homogeneous groups of objects based on attribute values. The ant colony clustering algorithm is a swarm-intelligent method used for clustering problems that is inspired by the behavior of ant colonies that cluster their corpses and sort their larvae. A new abstraction ant colony clustering algorithm using a data combination mechanism is proposed to improve the computational efficiency and accuracy of the ant colony clustering algorithm. The abstraction ant colony clustering algorithm is used to cluster benchmark problems, and its performance is compared with the ant colony clustering algorithm and other methods used in existing literature. Based on similar computational difficulties and complexities, the results show that the abstraction ant colony clustering algorithm produces results that are not only more accurate but also more efficiently determined than the ant colony clustering algorithm and the other methods. Thus, the abstraction ant colony clustering algorithm can be used for efficient multivariate data clustering. PMID:26839533
Improved Ant Colony Clustering Algorithm and Its Performance Study
Gao, Wei
2016-01-01
Clustering analysis is used in many disciplines and applications; it is an important tool that descriptively identifies homogeneous groups of objects based on attribute values. The ant colony clustering algorithm is a swarm-intelligent method used for clustering problems that is inspired by the behavior of ant colonies that cluster their corpses and sort their larvae. A new abstraction ant colony clustering algorithm using a data combination mechanism is proposed to improve the computational efficiency and accuracy of the ant colony clustering algorithm. The abstraction ant colony clustering algorithm is used to cluster benchmark problems, and its performance is compared with the ant colony clustering algorithm and other methods used in existing literature. Based on similar computational difficulties and complexities, the results show that the abstraction ant colony clustering algorithm produces results that are not only more accurate but also more efficiently determined than the ant colony clustering algorithm and the other methods. Thus, the abstraction ant colony clustering algorithm can be used for efficient multivariate data clustering. PMID:26839533
A Comparison Study on Similarity and Dissimilarity Measures in Clustering Continuous Data
Shirkhorshidi, Ali Seyed; Aghabozorgi, Saeed; Wah, Teh Ying
2015-01-01
Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones. PMID:26658987
Large-Scale Multi-Dimensional Document Clustering on GPU Clusters
Cui, Xiaohui; Mueller, Frank; Zhang, Yongpeng; Potok, Thomas E
2010-01-01
Document clustering plays an important role in data mining systems. Recently, a flocking-based document clustering algorithm has been proposed to solve the problem through simulation resembling the flocking behavior of birds in nature. This method is superior to other clustering algorithms, including k-means, in the sense that the outcome is not sensitive to the initial state. One limitation of this approach is that the algorithmic complexity is inherently quadratic in the number of documents. As a result, execution time becomes a bottleneck with large number of documents. In this paper, we assess the benefits of exploiting the computational power of Beowulf-like clusters equipped with contemporary Graphics Processing Units (GPUs) as a means to significantly reduce the runtime of flocking-based document clustering. Our framework scales up to over one million documents processed simultaneously in a sixteennode GPU cluster. Results are also compared to a four-node cluster with higher-end GPUs. On these clusters, we observe 30X-50X speedups, which demonstrates the potential of GPU clusters to efficiently solve massive data mining problems. Such speedups combined with the scalability potential and accelerator-based parallelization are unique in the domain of document-based data mining, to the best of our knowledge.
Dong, Bao-xia; Peng, Jun; Gómez-García, Carlos J; Benmansour, Samia; Jia, Heng-qing; Hu, Ning-hai
2007-07-23
By introducing the flexible 1,1'-(1,4-butanediyl)bis(imidazole) (bbi) ligand into the polyoxovanadate system, five novel polyoxoanion-templated architectures based on [As(8)V(14)O(42)](4-) and [V(16)O(38)Cl](6-) building blocks were obtained: [M(bbi)(2)](2)[As(8)V(14)O(42)(H(2)O)] [M = Co (1), Ni (2), and Zn (3)], [Cu(bbi)](4)[As(8)V(14)O(42)(H(2)O)] (4), and [Cu(bbi)](6)[V(16)O(38)Cl] (5). Compounds 1-3 are isostructural, and they exhibit a binodal (4,6)-connected 2D structure with Schläfli symbol (3(4) x 4(2))(3(4) x 4(4) x 5(4) x 6(3))(2), in which the polyoxoanion induces a closed four-membered circuit of M(4)(bbi)(4). Compound 4 exhibits an interesting 3D framework constructed from tetradentate [As(8)V(14)O(42)](4-) cluster anions and cationic ladderlike double chains. There exists a bigger M(8)(bbi)(6)O(2) circuit in 4. The 3D extended structure of 5 is composed of heptadentate [V(16)O(38)Cl](6-) anions and flexural cationic chains; the latter consists of six Cu(bbi) segments arranged alternately. It presents the largest 24-membered circuit of M(24)(bbi)(24) so far observed made of bbi molecules and transition-metal cations. Investigation of their structural relations shows the important template role of the polyoxoanions and the synergetic interactions among the polyoxoanions, transition-metal ions, and flexible ligand in the assembly process. The magnetic properties of compounds 1-3 were also studied. PMID:17592834
Web document clustering using hyperlink structures
He, Xiaofeng; Zha, Hongyuan; Ding, Chris H.Q; Simon, Horst D.
2001-05-07
With the exponential growth of information on the World Wide Web there is great demand for developing efficient and effective methods for organizing and retrieving the information available. Document clustering plays an important role in information retrieval and taxonomy management for the World Wide Web and remains an interesting and challenging problem in the field of web computing. In this paper we consider document clustering methods exploring textual information hyperlink structure and co-citation relations. In particular we apply the normalized cut clustering method developed in computer vision to the task of hyperdocument clustering. We also explore some theoretical connections of the normalized-cut method to K-means method. We then experiment with normalized-cut method in the context of clustering query result sets for web search engines.
Information bottleneck based incremental fuzzy clustering for large biomedical data.
Liu, Yongli; Wan, Xing
2016-08-01
Incremental fuzzy clustering combines advantages of fuzzy clustering and incremental clustering, and therefore is important in classifying large biomedical literature. Conventional algorithms, suffering from data sparsity and high-dimensionality, often fail to produce reasonable results and may even assign all the objects to a single cluster. In this paper, we propose two incremental algorithms based on information bottleneck, Single-Pass fuzzy c-means (spFCM-IB) and Online fuzzy c-means (oFCM-IB). These two algorithms modify conventional algorithms by considering different weights for each centroid and object and scoring mutual information loss to measure the distance between centroids and objects. spFCM-IB and oFCM-IB are used to group a collection of biomedical text abstracts from Medline database. Experimental results show that clustering performances of our approaches are better than such prominent counterparts as spFCM, spHFCM, oFCM and oHFCM, in terms of accuracy. PMID:27260783
[Autism Spectrum Disorder and DSM-5: Spectrum or Cluster?].
Kienle, Xaver; Freiberger, Verena; Greulich, Heide; Blank, Rainer
2015-01-01
Within the new DSM-5, the currently differentiated subgroups of "Autistic Disorder" (299.0), "Asperger's Disorder" (299.80) and "Pervasive Developmental Disorder" (299.80) are replaced by the more general "Autism Spectrum Disorder". With regard to a patient-oriented and expedient advising therapy planning, however, the issue of an empirically reproducible and clinically feasible differentiation into subgroups must still be raised. Based on two Autism-rating-scales (ASDS and FSK), an exploratory two-step cluster analysis was conducted with N=103 children (age: 5-18) seen in our social-pediatric health care centre to examine potentially autistic symptoms. In the two-cluster solution of both rating scales, mainly the problems in social communication grouped the children into a cluster "with communication problems" (51 % and 41 %), and a cluster "without communication problems". Within the three-cluster solution of the ASDS, sensory hypersensitivity, cleaving to routines and social-communicative problems generated an "autistic" subgroup (22%). The children of the second cluster ("communication problems", 35%) were only described by social-communicative problems, and the third group did not show any problems (38%). In the three-cluster solution of the FSK, the "autistic cluster" of the two-cluster solution differentiated in a subgroup with mainly social-communicative problems (cluster 1) and a second subgroup described by restrictive, repetitive behavior. The different cluster solutions will be discussed with a view to the new DSM-5 diagnostic criteria, for following studies a further specification of some of the ASDS and FSK items could be helpful. PMID:26289149
Collins, Anne Gabrielle Eva; Frank, Michael Joshua
2016-07-01
Often the world is structured such that distinct sensory contexts signify the same abstract rule set. Learning from feedback thus informs us not only about the value of stimulus-action associations but also about which rule set applies. Hierarchical clustering models suggest that learners discover structure in the environment, clustering distinct sensory events into a single latent rule set. Such structure enables a learner to transfer any newly acquired information to other contexts linked to the same rule set, and facilitates re-use of learned knowledge in novel contexts. Here, we show that humans exhibit this transfer, generalization and clustering during learning. Trial-by-trial model-based analysis of EEG signals revealed that subjects' reward expectations incorporated this hierarchical structure; these structured neural signals were predictive of behavioral transfer and clustering. These results further our understanding of how humans learn and generalize flexibly by building abstract, behaviorally relevant representations of the complex, high-dimensional sensory environment. PMID:27082659
Jacquez, Geoffrey M.
2009-01-01
Most disease clustering methods assume specific shapes and do not evaluate statistical power using the applicable geography, at-risk population, and covariates. Cluster Morphology Analysis (CMA) conducts power analyses of alternative techniques assuming clusters of different relative risks and shapes. Results are ranked by statistical power and false positives, under the rationale that surveillance should (1) find true clusters while (2) avoiding false clusters. CMA then synthesizes results of the most powerful methods. CMA was evaluated in simulation studies and applied to pancreatic cancer mortality in Michigan, and finds clusters of flexible shape while routinely evaluating statistical power. PMID:20234799
Dynamics of the breakdown of granular clusters
NASA Astrophysics Data System (ADS)
Coppex, François; Droz, Michel; Lipowski, Adam
2002-07-01
Recently van der Meer et al. studied the breakdown of a granular cluster [Phys. Rev. Lett. 88, 174302 (2002)]. We reexamine this problem using an urn model, which takes into account fluctuations and finite-size effects. General arguments are given for the absence of a continuous transition when the number of urns (compartments) is greater than two. Monte Carlo simulations show that the lifetime of a cluster τ diverges at the limits of stability as τ~N1/3, where N is the number of balls. After the breakdown, depending on the dynamical rules of our urn model, either normal or anomalous diffusion of the cluster takes place.
Cluster Stability Estimation Based on a Minimal Spanning Trees Approach
NASA Astrophysics Data System (ADS)
Volkovich, Zeev (Vladimir); Barzily, Zeev; Weber, Gerhard-Wilhelm; Toledano-Kitai, Dvora
2009-08-01
Among the areas of data and text mining which are employed today in science, economy and technology, clustering theory serves as a preprocessing step in the data analyzing. However, there are many open questions still waiting for a theoretical and practical treatment, e.g., the problem of determining the true number of clusters has not been satisfactorily solved. In the current paper, this problem is addressed by the cluster stability approach. For several possible numbers of clusters we estimate the stability of partitions obtained from clustering of samples. Partitions are considered consistent if their clusters are stable. Clusters validity is measured as the total number of edges, in the clusters' minimal spanning trees, connecting points from different samples. Actually, we use the Friedman and Rafsky two sample test statistic. The homogeneity hypothesis, of well mingled samples within the clusters, leads to asymptotic normal distribution of the considered statistic. Resting upon this fact, the standard score of the mentioned edges quantity is set, and the partition quality is represented by the worst cluster corresponding to the minimal standard score value. It is natural to expect that the true number of clusters can be characterized by the empirical distribution having the shortest left tail. The proposed methodology sequentially creates the described value distribution and estimates its left-asymmetry. Numerical experiments, presented in the paper, demonstrate the ability of the approach to detect the true number of clusters.
Personalized PageRank Clustering: A graph clustering algorithm based on random walks
NASA Astrophysics Data System (ADS)
A. Tabrizi, Shayan; Shakery, Azadeh; Asadpour, Masoud; Abbasi, Maziar; Tavallaie, Mohammad Ali
2013-11-01
Graph clustering has been an essential part in many methods and thus its accuracy has a significant effect on many applications. In addition, exponential growth of real-world graphs such as social networks, biological networks and electrical circuits demands clustering algorithms with nearly-linear time and space complexity. In this paper we propose Personalized PageRank Clustering (PPC) that employs the inherent cluster exploratory property of random walks to reveal the clusters of a given graph. We combine random walks and modularity to precisely and efficiently reveal the clusters of a graph. PPC is a top-down algorithm so it can reveal inherent clusters of a graph more accurately than other nearly-linear approaches that are mainly bottom-up. It also gives a hierarchy of clusters that is useful in many applications. PPC has a linear time and space complexity and has been superior to most of the available clustering algorithms on many datasets. Furthermore, its top-down approach makes it a flexible solution for clustering problems with different requirements.
Rhee, Diane; Yun, Sung-Cheol; Khang, Young-Ho
2007-02-01
Study findings showed problem behaviors can be observed in clusters in South Korean adolescents. Prevention programs targeting problem behavior clusters may have a greater impact on adolescents at risk for more than one problem behavior than programs targeting only a portion of the cluster. PMID:17259067
Webster, Clayton; Tempone, Raul; Nobile, Fabio
2007-12-01
This work describes the convergence analysis of a Smolyak-type sparse grid stochastic collocation method for the approximation of statistical quantities related to the solution of partial differential equations with random coefficients and forcing terms (input data of the model). To compute solution statistics, the sparse grid stochastic collocation method uses approximate solutions, produced here by finite elements, corresponding to a deterministic set of points in the random input space. This naturally requires solving uncoupled deterministic problems and, as such, the derived strong error estimates for the fully discrete solution are used to compare the computational efficiency of the proposed method with the Monte Carlo method. Numerical examples illustrate the theoretical results and are used to compare this approach with several others, including the standard Monte Carlo.
Universal Cluster Deposition System
NASA Astrophysics Data System (ADS)
Qiang, You; Sun, Zhiguang; Sellmyer, David J.
2001-03-01
We have developed a universal cluster deposition system (UCDS), which combines a new kind of sputtering-gas-aggregation (SGA) cluster beam source with two atom beams from magnetron sputtering. A highly intense, very stable beam of nanoclusters (like Co, Fe, Ni, Si, CoSm or CoPt) are produced. A quadrupole and/or a new high transmission infinite range mass selector have been designed for the cluster beam. The size distribution (Δd/d) is between 0.05+/-0.10, measured in situ by TOF. A range of mean cluster size is 2 to 10 nm. Usually the deposition rate is about 5 deg/s. The cluster concentration in the film is adjusted through the ratio of cluster and atomic beam deposition rates, as measured in situ with a rotatable quartz microbalance. The UCDS can be used to prepare coated clusters. After exiting from the cluster source, the clusters can be coated first with an atomic or molecular species in an evaporation chamber, and deposited alone or co-deposited with another material. This system is used to deposit simultaneously or alternately mesoscopic thin films or multilayers, and offers the possibility to control independently the incident cluster size and concentration, and thereby the interaction between clusters and cluster-matrix material which is of interest for fundamental research and industry applications. Magnetic properties of Co cluster-assembled materials will be discussed. * Research supported by NSF, DARPA through ARO, and CMRA
Matlab Cluster Ensemble Toolbox v. 1.0
2009-04-27
This is a Matlab toolbox for investigating the application of cluster ensembles to data classification, with the objective of improving the accuracy and/or speed of clustering. The toolbox divides the cluster ensemble problem into four areas, providing functionality for each. These include, (1) synthetic data generation, (2) clustering to generate individual data partitions and similarity matrices, (3) consensus function generation and final clustering to generate ensemble data partitioning, and (4) implementation of accuracy metrics. With regard to data generation, Gaussian data of arbitrary dimension can be generated. The kcenters algorithm can then be used to generate individual data partitions by either, (a) subsampling the data and clustering each subsample, or by (b) randomly initializing the algorithm and generating a clustering for each initialization. In either case an overall similarity matrix can be computed using a consensus function operating on the individual similarity matrices. A final clustering can be performed and performance metrics are provided for evaluation purposes.
NASA Astrophysics Data System (ADS)
Ulrich, Michaël
2015-08-01
It is well known that freeness appears in the high-dimensional limit of independence for matrices. Thus, for instance, the additive free Brownian motion can be seen as the limit of the Brownian motion on hermitian matrices. More generally, it is quite natural to try to build free Lévy processes as high-dimensional limits of classical matricial Lévy processes. We will focus here on one specific such construction, discussing and generalizing the work done previously by Biane in Ref.2, who has shown that the (classical) Brownian motion on the Unitary group U(d) converges to the free multiplicative Brownian motion when d goes to infinity. We shall first recall that result and give an alternative proof for it. We shall then see how this proof can be adapted in a more general context in order to get a free Lévy process on the dual group (in the sense of Voiculescu) U
EINSTEIN Cluster Alignments Revisited
NASA Astrophysics Data System (ADS)
Chambers, S. W.; Melott, A. L.; Miller, C. J.
2000-12-01
We have examined whether the major axes of rich galaxy clusters tend to point (in projection) toward their nearest neighboring cluster. We used the data of Ulmer, McMillan and Kowalski, who used x-ray morphology to define position angles. Our cluster samples, with well measured redshifts and updated positions, were taken from the MX Northern Abell Cluster Survey. The usual Kolmogorov-Smirnov test shows no significant alignment signal for nonrandom angles for all separations less than 100 Mpc/h. Refining the null hypothesis, however, with the Wilcoxon rank-sum test, reveals a high confidence signal for alignment. This confidence is highest when we restrict our sample to small nearest neighbor separations. We conclude that we have identified a more powerful tool for testing cluster-cluster alignments. Moreover, there is a strong signal in the data for alignment, consistent with a picture of hierarchical cluster formation in which matter falls into clusters along large scale filamentary structures.
Star clusters as simple stellar populations.
Bruzual A, Gustavo
2010-02-28
In this paper, I review to what extent we can understand the photometric properties of star clusters, and of low-mass, unresolved galaxies, in terms of population-synthesis models designed to describe 'simple stellar populations' (SSPs), i.e. groups of stars born at the same time, in the same volume of space and from a gas cloud of homogeneous chemical composition. The photometric properties predicted by these models do not readily match the observations of most star clusters, unless we properly take into account the expected variation in the number of stars occupying sparsely populated evolutionary stages, owing to stochastic fluctuations in the stellar initial mass function. In this case, population-synthesis models reproduce remarkably well the full ranges of observed integrated colours and absolute magnitudes of star clusters of various ages and metallicities. The disagreement between the model predictions and observations of cluster colours and magnitudes may indicate problems with or deficiencies in the modelling, and does not necessarily tell us that star clusters do not behave like SSPs. Matching the photometric properties of star clusters using SSP models is a necessary (but not sufficient) condition for clusters to be considered SSPs. Composite models, characterized by complex star-formation histories, also match the observed cluster colours. PMID:20083506
ERIC Educational Resources Information Center
Brokes, Joy Cunningham
2010-01-01
New Jersey's urban students traditionally don't do well on the high stakes NJ High School Proficiency Assessment. Most current remedial mathematics curricula provide students with a plethora of problems like those traditionally found on the state test. This approach is not working. Finding better ways to teach our urban students may help close…
Improving performance through concept formation and conceptual clustering
NASA Technical Reports Server (NTRS)
Fisher, Douglas H.
1992-01-01
Research from June 1989 through October 1992 focussed on concept formation, clustering, and supervised learning for purposes of improving the efficiency of problem-solving, planning, and diagnosis. These projects resulted in two dissertations on clustering, explanation-based learning, and means-ends planning, and publications in conferences and workshops, several book chapters, and journals; a complete Bibliography of NASA Ames supported publications is included. The following topics are studied: clustering of explanations and problem-solving experiences; clustering and means-end planning; and diagnosis of space shuttle and space station operating modes.
[Pathophysiology of cluster headache].
Donnet, Anne
2015-11-01
The aetiology of cluster headache is partially unknown. Three areas are involved in the pathogenesis of cluster headache: the trigeminal nociceptive pathways, the autonomic system and the hypothalamus. The cluster headache attack involves activation of the trigeminal autonomic reflex. A dysfunction located in posterior hypothalamic gray matter is probably pivotal in the process. There is a probable association between smoke exposure, a possible genetic predisposition and the development of cluster headache. PMID:26470883
... daily activities, get around, and exercise. Having a problem with walking can make daily life more difficult. ... walk is called your gait. A variety of problems can cause an abnormal gait and lead to ...
... re not getting enough air. Sometimes mild breathing problems are from a stuffy nose or hard exercise. ... emphysema or pneumonia cause breathing difficulties. So can problems with your trachea or bronchi, which are part ...
... cord injury In some cases, your emotions or relationship problems can lead to ED, such as: Poor ... you stressed, depressed, or anxious? Are you having relationship problems? You may have a number of different ...
... ankles and toes. Other types of arthritis include gout or pseudogout. Sometimes, there is a mechanical problem ... for more information on osteoarthritis, rheumatoid arthritis and gout. How Common are Joint Problems? Osteoarthritis, which affects ...
Nikooienejad, Amir; Wang, Wenyi; Johnson, Valen E.
2016-01-01
Motivation: The advent of new genomic technologies has resulted in the production of massive data sets. Analyses of these data require new statistical and computational methods. In this article, we propose one such method that is useful in selecting explanatory variables for prediction of a binary response. Although this problem has recently been addressed using penalized likelihood methods, we adopt a Bayesian approach that utilizes a mixture of non-local prior densities and point masses on the binary regression coefficient vectors. Results: The resulting method, which we call iMOMLogit, provides improved performance in identifying true models and reducing estimation and prediction error in a number of simulation studies. More importantly, its application to several genomic datasets produces predictions that have high accuracy using far fewer explanatory variables than competing methods. We also describe a novel approach for setting prior hyperparameters by examining the total variation distance between the prior distributions on the regression parameters and the distribution of the maximum likelihood estimator under the null distribution. Finally, we describe a computational algorithm that can be used to implement iMOMLogit in ultrahigh-dimensional settings (p>>n) and provide diagnostics to assess the probability that this algorithm has identified the highest posterior probability model. Availability and implementation: Software to implement this method can be downloaded at: http://www.stat.tamu.edu/∼amir/code.html. Contact: wwang7@mdanderson.org or vjohnson@stat.tamu.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:26740524
Collaborative Clustering for Sensor Networks
NASA Technical Reports Server (NTRS)
Wagstaff. Loro :/; Green Jillian; Lane, Terran
2011-01-01
Traditionally, nodes in a sensor network simply collect data and then pass it on to a centralized node that archives, distributes, and possibly analyzes the data. However, analysis at the individual nodes could enable faster detection of anomalies or other interesting events, as well as faster responses such as sending out alerts or increasing the data collection rate. There is an additional opportunity for increased performance if individual nodes can communicate directly with their neighbors. Previously, a method was developed by which machine learning classification algorithms could collaborate to achieve high performance autonomously (without requiring human intervention). This method worked for supervised learning algorithms, in which labeled data is used to train models. The learners collaborated by exchanging labels describing the data. The new advance enables clustering algorithms, which do not use labeled data, to also collaborate. This is achieved by defining a new language for collaboration that uses pair-wise constraints to encode useful information for other learners. These constraints specify that two items must, or cannot, be placed into the same cluster. Previous work has shown that clustering with these constraints (in isolation) already improves performance. In the problem formulation, each learner resides at a different node in the sensor network and makes observations (collects data) independently of the other learners. Each learner clusters its data and then selects a pair of items about which it is uncertain and uses them to query its neighbors. The resulting feedback (a must and cannot constraint from each neighbor) is combined by the learner into a consensus constraint, and it then reclusters its data while incorporating the new constraint. A strategy was also proposed for cleaning the resulting constraint sets, which may contain conflicting constraints; this improves performance significantly. This approach has been applied to collaborative
The clustering of cases of a rare disease is considered. The number of events observed for each unit is assumed to have a Poisson distribution, the mean of which depends upon the population size and the cluster membership of that unit. Here a cluster consists of those units that ...
Hierarchical modeling of cluster size in wildlife surveys
Royle, J. Andrew
2008-01-01
Clusters or groups of individuals are the fundamental unit of observation in many wildlife sampling problems, including aerial surveys of waterfowl, marine mammals, and ungulates. Explicit accounting of cluster size in models for estimating abundance is necessary because detection of individuals within clusters is not independent and detectability of clusters is likely to increase with cluster size. This induces a cluster size bias in which the average cluster size in the sample is larger than in the population at large. Thus, failure to account for the relationship between delectability and cluster size will tend to yield a positive bias in estimates of abundance or density. I describe a hierarchical modeling framework for accounting for cluster-size bias in animal sampling. The hierarchical model consists of models for the observation process conditional on the cluster size distribution and the cluster size distribution conditional on the total number of clusters. Optionally, a spatial model can be specified that describes variation in the total number of clusters per sample unit. Parameter estimation, model selection, and criticism may be carried out using conventional likelihood-based methods. An extension of the model is described for the situation where measurable covariates at the level of the sample unit are available. Several candidate models within the proposed class are evaluated for aerial survey data on mallard ducks (Anas platyrhynchos).
2013-01-01
Trial design A pragmatic cluster randomised controlled trial. Methods Participants: Clusters were primary health care clinics on the Ministry of Health list. Clients were eligible if they were aged 18 and over. Interventions: Two members of staff from each intervention clinic received the training programme. Clients in both intervention and control clinics subsequently received normal routine care from their health workers. Objective: To examine the impact of a mental health inservice training on routine detection of mental disorder in the clinics and on client outcomes. Outcomes: The primary outcome was the rate of accurate routine clinic detection of mental disorder and the secondary outcome was client recovery over a twelve week follow up period. Randomisation: clinics were randomised to intervention and control groups using a table of random numbers. Blinding: researchers and clients were blind to group assignment. Results Numbers randomised: 49 and 50 clinics were assigned to intervention and control groups respectively. 12 GHQ positive clients per clinic were identified for follow up. Numbers analysed: 468 and 478 clients were followed up for three months in intervention and control groups respectively. Outcome: At twelve weeks after training of the intervention group, the rate of accurate routine clinic detection of mental disorder was greater than 0 in 5% versus 0% of the intervention and control groups respectively, in both the intention to treat analysis (p = 0.50) and the per protocol analysis (p =0.50). Standardised effect sizes for client improvement were 0.34 (95% CI = (0.01,0.68)) for the General Health Questionnaire, 0.39 ((95% CI = (0.22, 0.61)) for the EQ and 0.49 (95% CI = (0.11,0.87)) for WHODAS (using ITT analysis); and 0.43 (95% CI = (0.09,0.76)) for the GHQ, 0.44 (95% CI = (0.22,0.65)) for the EQ and 0.58 (95% CI = (0.18,0.97)) for WHODAS (using per protocol analysis). Harms: None identified. Conclusion The
What's the Best Node for Your Cluster?
NASA Astrophysics Data System (ADS)
Stevens, Rick
2000-03-01
A well designed cluster requires a node containing the most appropriate balance of resources for the problems it will be solving. For many scientific problems, memory bandwidth or peripheral bandwidth can be a severe bottleneck, and spending extra money on a faster processor will not increase performance significantly. This talk will cover the options available for cluster nodes including processors, memory speeds and standards, and peripheral busses. There will also be a discussion of when SMP nodes should be used, and how many processors can be accomodated per node.
Image segmentation using fuzzy LVQ clustering networks
NASA Technical Reports Server (NTRS)
Tsao, Eric Chen-Kuo; Bezdek, James C.; Pal, Nikhil R.
1992-01-01
In this note we formulate image segmentation as a clustering problem. Feature vectors extracted from a raw image are clustered into subregions, thereby segmenting the image. A fuzzy generalization of a Kohonen learning vector quantization (LVQ) which integrates the Fuzzy c-Means (FCM) model with the learning rate and updating strategies of the LVQ is used for this task. This network, which segments images in an unsupervised manner, is thus related to the FCM optimization problem. Numerical examples on photographic and magnetic resonance images are given to illustrate this approach to image segmentation.
NASA Astrophysics Data System (ADS)
Feng, Jian-xin; Tang, Jia-fu; Wang, Guang-xing
2007-04-01
On the basis of the analysis of clustering algorithm that had been proposed for MANET, a novel clustering strategy was proposed in this paper. With the trust defined by statistical hypothesis in probability theory and the cluster head selected by node trust and node mobility, this strategy can realize the function of the malicious nodes detection which was neglected by other clustering algorithms and overcome the deficiency of being incapable of implementing the relative mobility metric of corresponding nodes in the MOBIC algorithm caused by the fact that the receiving power of two consecutive HELLO packet cannot be measured. It's an effective solution to cluster MANET securely.
Clustering Implies Geometry in Networks
NASA Astrophysics Data System (ADS)
Krioukov, Dmitri
2016-05-01
Network models with latent geometry have been used successfully in many applications in network science and other disciplines, yet it is usually impossible to tell if a given real network is geometric, meaning if it is a typical element in an ensemble of random geometric graphs. Here we identify structural properties of networks that guarantee that random graphs having these properties are geometric. Specifically we show that random graphs in which expected degree and clustering of every node are fixed to some constants are equivalent to random geometric graphs on the real line, if clustering is sufficiently strong. Large numbers of triangles, homogeneously distributed across all nodes as in real networks, are thus a consequence of network geometricity. The methods we use to prove this are quite general and applicable to other network ensembles, geometric or not, and to certain problems in quantum gravity.
Clustering Implies Geometry in Networks.
Krioukov, Dmitri
2016-05-20
Network models with latent geometry have been used successfully in many applications in network science and other disciplines, yet it is usually impossible to tell if a given real network is geometric, meaning if it is a typical element in an ensemble of random geometric graphs. Here we identify structural properties of networks that guarantee that random graphs having these properties are geometric. Specifically we show that random graphs in which expected degree and clustering of every node are fixed to some constants are equivalent to random geometric graphs on the real line, if clustering is sufficiently strong. Large numbers of triangles, homogeneously distributed across all nodes as in real networks, are thus a consequence of network geometricity. The methods we use to prove this are quite general and applicable to other network ensembles, geometric or not, and to certain problems in quantum gravity. PMID:27258887
NASA Astrophysics Data System (ADS)
Lee, J. H.; Yoon, H.; Kitanidis, P. K.; Werth, C. J.; Valocchi, A. J.
2015-12-01
Characterizing subsurface properties, particularly hydraulic conductivity, is crucial for reliable and cost-effective groundwater supply management, contaminant remediation, and emerging deep subsurface activities such as geologic carbon storage and unconventional resources recovery. With recent advances in sensor technology, a large volume of hydro-geophysical and chemical data can be obtained to achieve high-resolution images of subsurface properties, which can be used for accurate subsurface flow and reactive transport predictions. However, subsurface characterization with a plethora of information requires high, often prohibitive, computational costs associated with "big data" processing and large-scale numerical simulations. As a result, traditional inversion techniques are not well-suited for problems that require coupled multi-physics simulation models with massive data. In this work, we apply a scalable inversion method called Principal Component Geostatistical Approach (PCGA) for characterizing heterogeneous hydraulic conductivity (K) distribution in a 3-D sand box. The PCGA is a Jacobian-free geostatistical inversion approach that uses the leading principal components of the prior information to reduce computational costs, sometimes dramatically, and can be easily linked with any simulation software. Sequential images of transient tracer concentrations in the sand box were obtained using magnetic resonance imaging (MRI) technique, resulting in 6 million tracer-concentration data [Yoon et. al., 2008]. Since each individual tracer observation has little information on the K distribution, the dimension of the data was reduced using temporal moments and discrete cosine transform (DCT). Consequently, 100,000 unknown K values consistent with the scale of MRI data (at a scale of 0.25^3 cm^3) were estimated by matching temporal moments and DCT coefficients of the original tracer data. Estimated K fields are close to the true K field, and even small
Unconventional methods for clustering
NASA Astrophysics Data System (ADS)
Kotyrba, Martin
2016-06-01
Cluster analysis or clustering is a task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is the main task of exploratory data mining and a common technique for statistical data analysis used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics. The topic of this paper is one of the modern methods of clustering namely SOM (Self Organising Map). The paper describes the theory needed to understand the principle of clustering and descriptions of algorithm used with clustering in our experiments.
Robust Multi-Network Clustering via Joint Cross-Domain Cluster Alignment
Liu, Rui; Cheng, Wei; Tong, Hanghang; Wang, Wei; Zhang, Xiang
2016-01-01
Network clustering is an important problem that has recently drawn a lot of attentions. Most existing work focuses on clustering nodes within a single network. In many applications, however, there exist multiple related networks, in which each network may be constructed from a different domain and instances in one domain may be related to instances in other domains. In this paper, we propose a robust algorithm, MCA, for multi-network clustering that takes into account cross-domain relationships between instances. MCA has several advantages over the existing single network clustering methods. First, it is able to detect associations between clusters from different domains, which, however, is not addressed by any existing methods. Second, it achieves more consistent clustering results on multiple networks by leveraging the duality between clustering individual networks and inferring cross-network cluster alignment. Finally, it provides a multi-network clustering solution that is more robust to noise and errors. We perform extensive experiments on a variety of real and synthetic networks to demonstrate the effectiveness and efficiency of MCA.
THE STELLAR MASS GROWTH OF BRIGHTEST CLUSTER GALAXIES IN THE IRAC SHALLOW CLUSTER SURVEY
Lin, Yen-Ting; Brodwin, Mark; Gonzalez, Anthony H.; Bode, Paul; Eisenhardt, Peter R. M.; Stanford, S. A.; Vikhlinin, Alexey
2013-07-01
The details of the stellar mass assembly of brightest cluster galaxies (BCGs) remain an unresolved problem in galaxy formation. We have developed a novel approach that allows us to construct a sample of clusters that form an evolutionary sequence, and have applied it to the Spitzer IRAC Shallow Cluster Survey (ISCS) to examine the evolution of BCGs in progenitors of present-day clusters with mass of (2.5-4.5) Multiplication-Sign 10{sup 14} M{sub Sun }. We follow the cluster mass growth history extracted from a high resolution cosmological simulation, and then use an empirical method that infers the cluster mass based on the ranking of cluster luminosity to select high-z clusters of appropriate mass from ISCS to be progenitors of the given set of z = 0 clusters. We find that, between z = 1.5 and 0.5, the BCGs have grown in stellar mass by a factor of 2.3, which is well-matched by the predictions from a state-of-the-art semi-analytic model. Below z = 0.5 we see hints of differences in behavior between the model and observation.
The cluster graphical lasso for improved estimation of Gaussian graphical models
Tan, Kean Ming; Witten, Daniela; Shojaie, Ali
2015-01-01
The task of estimating a Gaussian graphical model in the high-dimensional setting is considered. The graphical lasso, which involves maximizing the Gaussian log likelihood subject to a lasso penalty, is a well-studied approach for this task. A surprising connection between the graphical lasso and hierarchical clustering is introduced: the graphical lasso in effect performs a two-step procedure, in which (1) single linkage hierarchical clustering is performed on the variables in order to identify connected components, and then (2) a penalized log likelihood is maximized on the subset of variables within each connected component. Thus, the graphical lasso determines the connected components of the estimated network via single linkage clustering. The single linkage clustering is known to perform poorly in certain finite-sample settings. Therefore, the cluster graphical lasso, which involves clustering the features using an alternative to single linkage clustering, and then performing the graphical lasso on the subset of variables within each cluster, is proposed. Model selection consistency for this technique is established, and its improved performance relative to the graphical lasso is demonstrated in a simulation study, as well as in applications to a university webpage and a gene expression data sets. PMID:25642008
Full Text Clustering and Relationship Network Analysis of Biomedical Publications
Guan, Renchu; Yang, Chen; Marchese, Maurizio; Liang, Yanchun; Shi, Xiaohu
2014-01-01
Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers. PMID:25250864
Full text clustering and relationship network analysis of biomedical publications.
Guan, Renchu; Yang, Chen; Marchese, Maurizio; Liang, Yanchun; Shi, Xiaohu
2014-01-01
Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers. PMID:25250864
Fusion and clustering algorithms for spatial data
NASA Astrophysics Data System (ADS)
Kuntala, Pavani
Spatial clustering is an approach for discovering groups of related data points in spatial data. Spatial clustering has attracted a lot of research attention due to various applications where it is needed. It holds practical importance in application domains such as geographic knowledge discovery, sensors, rare disease discovery, astronomy, remote sensing, and so on. The motivation for this work stems from the limitations of the existing spatial clustering methods. In most conventional spatial clustering algorithms, the similarity measurement mainly considers the geometric attributes. However, in many real applications, users are concerned about both the spatial and the non-spatial attributes. In conventional spatial clustering, the input data set is partitioned into several compact regions and data points that are similar to one another in their non-spatial attributes may be scattered over different regions, thus making the corresponding objective difficult to achieve. In this dissertation, a novel clustering methodology is proposed to explore the clustering problem within both spatial and non-spatial domains by employing a fusion-based approach. The goal is to optimize a given objective function in the spatial domain, while satisfying the constraint specified in the non- spatial attribute domain. Several experiments are conducted to provide insights into the proposed methodology. The algorithm first captures the spatial cores having the highest structure and then employs an iterative, heuristic mechanism to find the optimal number of spatial cores and non-spatial clusters that exist in the data. Such a fusion-based framework allows for the handling of data streams and provides a framework for comparing spatial clusters. The correctness and efficiency of the proposed clustering model is demonstrated on real world and synthetic data sets.
Fast and accurate estimation for astrophysical problems in large databases
NASA Astrophysics Data System (ADS)
Richards, Joseph W.
2010-10-01
A recent flood of astronomical data has created much demand for sophisticated statistical and machine learning tools that can rapidly draw accurate inferences from large databases of high-dimensional data. In this Ph.D. thesis, methods for statistical inference in such databases will be proposed, studied, and applied to real data. I use methods for low-dimensional parametrization of complex, high-dimensional data that are based on the notion of preserving the connectivity of data points in the context of a Markov random walk over the data set. I show how this simple parameterization of data can be exploited to: define appropriate prototypes for use in complex mixture models, determine data-driven eigenfunctions for accurate nonparametric regression, and find a set of suitable features to use in a statistical classifier. In this thesis, methods for each of these tasks are built up from simple principles, compared to existing methods in the literature, and applied to data from astronomical all-sky surveys. I examine several important problems in astrophysics, such as estimation of star formation history parameters for galaxies, prediction of redshifts of galaxies using photometric data, and classification of different types of supernovae based on their photometric light curves. Fast methods for high-dimensional data analysis are crucial in each of these problems because they all involve the analysis of complicated high-dimensional data in large, all-sky surveys. Specifically, I estimate the star formation history parameters for the nearly 800,000 galaxies in the Sloan Digital Sky Survey (SDSS) Data Release 7 spectroscopic catalog, determine redshifts for over 300,000 galaxies in the SDSS photometric catalog, and estimate the types of 20,000 supernovae as part of the Supernova Photometric Classification Challenge. Accurate predictions and classifications are imperative in each of these examples because these estimates are utilized in broader inference problems
Modeling Clustered Data with Very Few Clusters.
McNeish, Daniel; Stapleton, Laura M
2016-01-01
Small-sample inference with clustered data has received increased attention recently in the methodological literature, with several simulation studies being presented on the small-sample behavior of many methods. However, nearly all previous studies focus on a single class of methods (e.g., only multilevel models, only corrections to sandwich estimators), and the differential performance of various methods that can be implemented to accommodate clustered data with very few clusters is largely unknown, potentially due to the rigid disciplinary preferences. Furthermore, a majority of these studies focus on scenarios with 15 or more clusters and feature unrealistically simple data-generation models with very few predictors. This article, motivated by an applied educational psychology cluster randomized trial, presents a simulation study that simultaneously addresses the extreme small sample and differential performance (estimation bias, Type I error rates, and relative power) of 12 methods to account for clustered data with a model that features a more realistic number of predictors. The motivating data are then modeled with each method, and results are compared. Results show that generalized estimating equations perform poorly; the choice of Bayesian prior distributions affects performance; and fixed effect models perform quite well. Limitations and implications for applications are also discussed. PMID:27269278
Clusters of polyhedra in spherical confinement
Teich, Erin G.; van Anders, Greg; Klotsa, Daphne; Dshemuchadse, Julia; Glotzer, Sharon C.
2016-01-01
Dense particle packing in a confining volume remains a rich, largely unexplored problem, despite applications in blood clotting, plasmonics, industrial packaging and transport, colloidal molecule design, and information storage. Here, we report densest found clusters of the Platonic solids in spherical confinement, for up to N=60 constituent polyhedral particles. We examine the interplay between anisotropic particle shape and isotropic 3D confinement. Densest clusters exhibit a wide variety of symmetry point groups and form in up to three layers at higher N. For many N values, icosahedra and dodecahedra form clusters that resemble sphere clusters. These common structures are layers of optimal spherical codes in most cases, a surprising fact given the significant faceting of the icosahedron and dodecahedron. We also investigate cluster density as a function of N for each particle shape. We find that, in contrast to what happens in bulk, polyhedra often pack less densely than spheres. We also find especially dense clusters at so-called magic numbers of constituent particles. Our results showcase the structural diversity and experimental utility of families of solutions to the packing in confinement problem. PMID:26811458
Clusters of polyhedra in spherical confinement.
Teich, Erin G; van Anders, Greg; Klotsa, Daphne; Dshemuchadse, Julia; Glotzer, Sharon C
2016-02-01
Dense particle packing in a confining volume remains a rich, largely unexplored problem, despite applications in blood clotting, plasmonics, industrial packaging and transport, colloidal molecule design, and information storage. Here, we report densest found clusters of the Platonic solids in spherical confinement, for up to [Formula: see text] constituent polyhedral particles. We examine the interplay between anisotropic particle shape and isotropic 3D confinement. Densest clusters exhibit a wide variety of symmetry point groups and form in up to three layers at higher N. For many N values, icosahedra and dodecahedra form clusters that resemble sphere clusters. These common structures are layers of optimal spherical codes in most cases, a surprising fact given the significant faceting of the icosahedron and dodecahedron. We also investigate cluster density as a function of N for each particle shape. We find that, in contrast to what happens in bulk, polyhedra often pack less densely than spheres. We also find especially dense clusters at so-called magic numbers of constituent particles. Our results showcase the structural diversity and experimental utility of families of solutions to the packing in confinement problem. PMID:26811458
NASA Astrophysics Data System (ADS)
Speegle, Darrin; Steward, Robert
2015-08-01
We propose a semiparametric approach to infer the existence of and estimate the location of a statistical change-point to a nonlinear high dimensional time series contaminated with an additive noise component. In particular, we consider a p―dimensional stochastic process of independent multivariate normal observations where the mean function varies smoothly except at a single change-point. Our approach first involves a dimension reduction of the original time series through a random matrix multiplication. Next, we conduct a Bayesian analysis on the empirical detail coefficients of this dimensionally reduced time series after a wavelet transform. We also present a means to associate confidence bounds to the conclusions of our results. Aside from being computationally efficient and straight forward to implement, the primary advantage of our methods is seen in how these methods apply to a much larger class of time series whose mean functions are subject to only general smoothness conditions.
Cosgun, Erdal; Limdi, Nita A.; Duarte, Christine W.
2011-01-01
Motivation: With complex traits and diseases having potential genetic contributions of thousands of genetic factors, and with current genotyping arrays consisting of millions of single nucleotide polymorphisms (SNPs), powerful high-dimensional statistical techniques are needed to comprehensively model the genetic variance. Machine learning techniques have many advantages including lack of parametric assumptions, and high power and flexibility. Results: We have applied three machine learning approaches: Random Forest Regression (RFR), Boosted Regression Tree (BRT) and Support Vector Regression (SVR) to the prediction of warfarin maintenance dose in a cohort of African Americans. We have developed a multi-step approach that selects SNPs, builds prediction models with different subsets of selected SNPs along with known associated genetic and environmental variables and tests the discovered models in a cross-validation framework. Preliminary results indicate that our modeling approach gives much higher accuracy than previous models for warfarin dose prediction. A model size of 200 SNPs (in addition to the known genetic and environmental variables) gives the best accuracy. The R2 between the predicted and actual square root of warfarin dose in this model was on average 66.4% for RFR, 57.8% for SVR and 56.9% for BRT. Thus RFR had the best accuracy, but all three techniques achieved better performance than the current published R2 of 43% in a sample of mixed ethnicity, and 27% in an African American sample. In summary, machine learning approaches for high-dimensional pharmacogenetic prediction, and for prediction of clinical continuous traits of interest, hold great promise and warrant further research. Contact: cduarte@uab.edu Supplementary information: Supplementary data are available at Bioinformatics online. PMID:21450715
NASA Astrophysics Data System (ADS)
Finley, A. O.; Banerjee, S.; Cook, B. D.
2010-12-01
Recent advances in remote sensing, specifically waveform Light Detection and Ranging (LiDAR) sensors, provide the data needed to quantify forest variables at a fine spatial resolution over large domains. Of particular interest is LiDAR data from NASA's Laser Vegetation Imaging Sensor (LVIS), upcoming Deformation, Ecosystem Structure, and Dynamics of Ice (DESDynI) missions, and NSF's National Ecological Observatory Network planned Airborne Observation Platform. A central challenge to using these data is to couple field measurements of forest variables (e.g., species, indices of structural complexity, light competition, or drought stress) with the high-dimensional LiDAR signal through a model, which allows prediction of the tree-level variables at locations where only the remotely sensed data area are available. It is common to model the high-dimensional signal vector as a mixture of a relatively small number of Gaussian distributions. The parameters from these Gaussian distributions, or indices derived from the parameters, can then be used as regressors in a regression model. These approaches retain only a small amount of information contained in the signal. Further, it is not known a priori which features of the signal explain the most variability in the response variables. It is possible to fully exploit the information in the signal by treating it as an object, thus, we define a framework to couple a spatial latent factor model with forest variables using a fully Bayesian functional spatial data analysis. Our proposed modeling framework explicitly: 1) reduces the dimensionality of signals in an optimal way (i.e., preserves the information that describes the maximum variability in response variable); 2) propagates uncertainty in data and parameters through to prediction, and; 3) acknowledges and leverages spatial dependence among the regressors and model residuals to meet statistical assumptions and improve prediction. The proposed modeling framework is
Formation and Assembly of Massive Star Clusters
NASA Astrophysics Data System (ADS)
McMillan, Stephen
The formation of stars and star clusters is a major unresolved problem in astrophysics. It is central to modeling stellar populations and understanding galaxy luminosity distributions in cosmological models. Young massive clusters are major components of starburst galaxies, while globular clusters are cornerstones of the cosmic distance scale and represent vital laboratories for studies of stellar dynamics and stellar evolution. Yet how these clusters form and how rapidly and efficiently they expel their natal gas remain unclear, as do the consequences of this gas expulsion for cluster structure and survival. Also unclear is how the properties of low-mass clusters, which form from small-scale instabilities in galactic disks and inform much of our understanding of cluster formation and star-formation efficiency, differ from those of more massive clusters, which probably formed in starburst events driven by fast accretion at high redshift, or colliding gas flows in merging galaxies. Modeling cluster formation requires simulating many simultaneous physical processes, placing stringent demands on both software and hardware. Simulations of galaxies evolving in cosmological contexts usually lack the numerical resolution to simulate star formation in detail. They do not include detailed treatments of important physical effects such as magnetic fields, radiation pressure, ionization, and supernova feedback. Simulations of smaller clusters include these effects, but fall far short of the mass of even single young globular clusters. With major advances in computing power and software, we can now directly address this problem. We propose to model the formation of massive star clusters by integrating the FLASH adaptive mesh refinement magnetohydrodynamics (MHD) code into the Astrophysical Multi-purpose Software Environment (AMUSE) framework, to work with existing stellar-dynamical and stellar evolution modules in AMUSE. All software will be freely distributed on-line, allowing
Measures of between-cluster variability in cluster randomized trials with binary outcomes.
Thomson, Andrew; Hayes, Richard; Cousens, Simon
2009-05-30
Cluster randomized trials (CRTs) are increasingly used to evaluate the effectiveness of health-care interventions. A key feature of CRTs is that the observations on individuals within clusters are correlated as a result of between-cluster variability. Sample size formulae exist which account for such correlations, but they make different assumptions regarding the between-cluster variability in the intervention arm of a trial, resulting in different sample size estimates. We explore the relationship for binary outcome data between two common measures of between-cluster variability: k, the coefficient of variation and rho, the intracluster correlation coefficient. We then assess how the assumptions of constant k or rho across treatment arms correspond to different assumptions about intervention effects. We assess implications for sample size estimation and present a simple solution to the problems outlined. PMID:19378266
A network-assisted co-clustering algorithm to discover cancer subtypes based on gene expression
2014-01-01
Background Cancer subtype information is critically important for understanding tumor heterogeneity. Existing methods to identify cancer subtypes have primarily focused on utilizing generic clustering algorithms (such as hierarchical clustering) to identify subtypes based on gene expression data. The network-level interaction among genes, which is key to understanding the molecular perturbations in cancer, has been rarely considered during the clustering process. The motivation of our work is to develop a method that effectively incorporates molecular interaction networks into the clustering process to improve cancer subtype identification. Results We have developed a new clustering algorithm for cancer subtype identification, called “network-assisted co-clustering for the identification of cancer subtypes” (NCIS). NCIS combines gene network information to simultaneously group samples and genes into biologically meaningful clusters. Prior to clustering, we assign weights to genes based on their impact in the network. Then a new weighted co-clustering algorithm based on a semi-nonnegative matrix tri-factorization is applied. We evaluated the effectiveness of NCIS on simulated datasets as well as large-scale Breast Cancer and Glioblastoma Multiforme patient samples from The Cancer Genome Atlas (TCGA) project. NCIS was shown to better separate the patient samples into clinically distinct subtypes and achieve higher accuracy on the simulated datasets to tolerate noise, as compared to consensus hierarchical clustering. Conclusions The weighted co-clustering approach in NCIS provides a unique solution to incorporate gene network information into the clustering process. Our tool will be useful to comprehensively identify cancer subtypes that would otherwise be obscured by cancer heterogeneity, using high-throughput and high-dimensional gene expression data. PMID:24491042
Topic modeling for cluster analysis of large biological and medical datasets
2014-01-01
Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than
Clustering of complex shaped data sets via Kohonen maps and mathematical morphology
NASA Astrophysics Data System (ADS)
Ferreira Costa, Jose A.; de Andrade Netto, Marcio L.
2001-03-01
Clustering is the process of discovering groups within the data, based on similarities, with a minimal, if any, knowledge of their structure. The self-organizing (or Kohonen) map (SOM) is one of the best known neural network algorithms. It has been widely studied as a software tool for visualization of high-dimensional data. Important features include information compression while preserving topological and metric relationship of the primary data items. Although Kohonen maps had been applied for clustering data, usually the researcher sets the number of neurons equal to the expected number of clusters, or manually segments a two-dimensional map using some a-priori knowledge of the data. This paper proposes techniques for automatic partitioning and labeling SOM networks in clusters of neurons that may be used to represent the data clusters. Mathematical morphology operations, such as watershed, are performed on the U-matrix, which is a neuron-distance image. The direct application of watershed leads to an oversegmented image. It is used markers to identify significant clusters and homotopy modification to suppress the others. Markers are automatically found by performing a multilevel scan of connected regions of the U-matrix. Each cluster of neurons is a sub-graph that defines, in the input space, complex and non-parametric geometries which approximately describes the shape of the clusters. The process of map partitioning is extended recursively. Each cluster of neurons gives rise to a new map, which are trained with the subset of data that were classified to it. The algorithm produces dynamically a hierarchical tree of maps, which explains the cluster's structure in levels of granularity. The distributed and multiple prototypes cluster representation enables the discoveries of clusters even in the case when we have two or more non-separable pattern classes.