Clustering high dimensional data using RIA
NASA Astrophysics Data System (ADS)
Aziz, Nazrina
2015-05-01
Clustering may simply represent a convenient method for organizing a large data set so that it can easily be understood and information can efficiently be retrieved. However, identifying cluster in high dimensionality data sets is a difficult task because of the curse of dimensionality. Another challenge in clustering is some traditional functions cannot capture the pattern dissimilarity among objects. In this article, we used an alternative dissimilarity measurement called Robust Influence Angle (RIA) in the partitioning method. RIA is developed using eigenstructure of the covariance matrix and robust principal component score. We notice that, it can obtain cluster easily and hence avoid the curse of dimensionality. It is also manage to cluster large data sets with mixed numeric and categorical value.
Clustering high dimensional data using RIA
Aziz, Nazrina
2015-05-15
Clustering may simply represent a convenient method for organizing a large data set so that it can easily be understood and information can efficiently be retrieved. However, identifying cluster in high dimensionality data sets is a difficult task because of the curse of dimensionality. Another challenge in clustering is some traditional functions cannot capture the pattern dissimilarity among objects. In this article, we used an alternative dissimilarity measurement called Robust Influence Angle (RIA) in the partitioning method. RIA is developed using eigenstructure of the covariance matrix and robust principal component score. We notice that, it can obtain cluster easily and hence avoid the curse of dimensionality. It is also manage to cluster large data sets with mixed numeric and categorical value.
NASA Technical Reports Server (NTRS)
Srivastava, Ashok, N.; Akella, Ram; Diev, Vesselin; Kumaresan, Sakthi Preethi; McIntosh, Dawn M.; Pontikakis, Emmanuel D.; Xu, Zuobing; Zhang, Yi
2006-01-01
This paper describes the results of a significant research and development effort conducted at NASA Ames Research Center to develop new text mining techniques to discover anomalies in free-text reports regarding system health and safety of two aerospace systems. We discuss two problems of significant importance in the aviation industry. The first problem is that of automatic anomaly discovery about an aerospace system through the analysis of tens of thousands of free-text problem reports that are written about the system. The second problem that we address is that of automatic discovery of recurring anomalies, i.e., anomalies that may be described m different ways by different authors, at varying times and under varying conditions, but that are truly about the same part of the system. The intent of recurring anomaly identification is to determine project or system weakness or high-risk issues. The discovery of recurring anomalies is a key goal in building safe, reliable, and cost-effective aerospace systems. We address the anomaly discovery problem on thousands of free-text reports using two strategies: (1) as an unsupervised learning problem where an algorithm takes free-text reports as input and automatically groups them into different bins, where each bin corresponds to a different unknown anomaly category; and (2) as a supervised learning problem where the algorithm classifies the free-text reports into one of a number of known anomaly categories. We then discuss the application of these methods to the problem of discovering recurring anomalies. In fact the special nature of recurring anomalies (very small cluster sizes) requires incorporating new methods and measures to enhance the original approach for anomaly detection. ?& pant 0-
Adaptive dimension reduction for clustering high dimensional data
Ding, Chris; He, Xiaofeng; Zha, Hongyuan; Simon, Horst
2002-10-01
It is well-known that for high dimensional data clustering, standard algorithms such as EM and the K-means are often trapped in local minimum. many initialization methods were proposed to tackle this problem, but with only limited success. In this paper they propose a new approach to resolve this problem by repeated dimension reductions such that K-means or EM are performed only in very low dimensions. Cluster membership is utilized as a bridge between the reduced dimensional sub-space and the original space, providing flexibility and ease of implementation. Clustering analysis performed on highly overlapped Gaussians, DNA gene expression profiles and internet newsgroups demonstrate the effectiveness of the proposed algorithm.
Dimensionality Reduction Particle Swarm Algorithm for High Dimensional Clustering
Cui, Xiaohui; ST Charles, Jesse Lee; Potok, Thomas E; Beaver, Justin M
2008-01-01
The Particle Swarm Optimization (PSO) clustering algorithm can generate more compact clustering results than the traditional K-means clustering algorithm. However, when clustering high dimensional datasets, the PSO clustering algorithm is notoriously slow because its computation cost increases exponentially with the size of the dataset dimension. Dimensionality reduction techniques offer solutions that both significantly improve the computation time, and yield reasonably accurate clustering results in high dimensional data analysis. In this paper, we introduce research that combines different dimensionality reduction techniques with the PSO clustering algorithm in order to reduce the complexity of high dimensional datasets and speed up the PSO clustering process. We report significant improvements in total runtime. Moreover, the clustering accuracy of the dimensionality reduction PSO clustering algorithm is comparable to the one that uses full dimension space.
Accelerating high-dimensional clustering with lossless data reduction.
Qaqish, Bahjat F; O'Brien, Jonathon J; Hibbard, Jonathan C; Clowers, Katie J
2017-09-15
For cluster analysis, high-dimensional data are associated with instability, decreased classification accuracy and high-computational burden. The latter challenge can be eliminated as a serious concern. For applications where dimension reduction techniques are not implemented, we propose a temporary transformation which accelerates computations with no loss of information. The algorithm can be applied for any statistical procedure depending only on Euclidean distances and can be implemented sequentially to enable analyses of data that would otherwise exceed memory limitations. The method is easily implemented in common statistical software as a standard pre-processing step. The benefit of our algorithm grows with the dimensionality of the problem and the complexity of the analysis. Consequently, our simple algorithm not only decreases the computation time for routine analyses, it opens the door to performing calculations that may have otherwise been too burdensome to attempt. R, Matlab and SAS/IML code for implementing lossless data reduction is freely available in the Appendix. obrienj@hms.harvard.edu.
Semi-supervised high-dimensional clustering by tight wavelet frames
NASA Astrophysics Data System (ADS)
Dong, Bin; Hao, Ning
2015-08-01
High-dimensional clustering arises frequently from many areas in natural sciences, technical disciplines and social medias. In this paper, we consider the problem of binary clustering of high-dimensional data, i.e. classification of a data set into 2 classes. We assume that the correct (or mostly correct) classification of a small portion of the given data is known. Based on such partial classification, we design optimization models that complete the clustering of the entire data set using the recently introduced tight wavelet frames on graphs.1 Numerical experiments of the proposed models applied to some real data sets are conducted. In particular, the performance of the models on some very high-dimensional data sets are examined; and combinations of the models with some existing dimension reduction techniques are also considered.
Modification of DIRECT for high-dimensional design problems
NASA Astrophysics Data System (ADS)
Tavassoli, Arash; Haji Hajikolaei, Kambiz; Sadeqi, Soheil; Wang, G. Gary; Kjeang, Erik
2014-06-01
DIviding RECTangles (DIRECT), as a well-known derivative-free global optimization method, has been found to be effective and efficient for low-dimensional problems. When facing high-dimensional black-box problems, however, DIRECT's performance deteriorates. This work proposes a series of modifications to DIRECT for high-dimensional problems (dimensionality d>10). The principal idea is to increase the convergence speed by breaking its single initialization-to-convergence approach into several more intricate steps. Specifically, starting with the entire feasible area, the search domain will shrink gradually and adaptively to the region enclosing the potential optimum. Several stopping criteria have been introduced to avoid premature convergence. A diversification subroutine has also been developed to prevent the algorithm from being trapped in local minima. The proposed approach is benchmarked using nine standard high-dimensional test functions and one black-box engineering problem. All these tests show a significant efficiency improvement over the original DIRECT for high-dimensional design problems.
High dimensional model representation (HDMR) with clustering for image retrieval
NASA Astrophysics Data System (ADS)
Karcılı, Ayşegül; Tunga, Burcu
2017-01-01
Image retrieval continues to hold an important place in today's extremely fast growing technology. In this field, the accurate image retrieval with high speed is critical. In this study, to achieve this important issue we developed a novel method with the help of High Dimensional Model Representation (HDMR) philosophy. HDMR is a decomposition method used to solve different scientific problems. To test the performance of the new method we used Columbia Object Image Library (COIL100) and obtained the encouraging results. These results are given in the findings section.
Model-based Clustering of High-Dimensional Data in Astrophysics
NASA Astrophysics Data System (ADS)
Bouveyron, C.
2016-05-01
The nature of data in Astrophysics has changed, as in other scientific fields, in the past decades due to the increase of the measurement capabilities. As a consequence, data are nowadays frequently of high dimensionality and available in mass or stream. Model-based techniques for clustering are popular tools which are renowned for their probabilistic foundations and their flexibility. However, classical model-based techniques show a disappointing behavior in high-dimensional spaces which is mainly due to their dramatical over-parametrization. The recent developments in model-based classification overcome these drawbacks and allow to efficiently classify high-dimensional data, even in the "small n / large p" situation. This work presents a comprehensive review of these recent approaches, including regularization-based techniques, parsimonious modeling, subspace classification methods and classification methods based on variable selection. The use of these model-based methods is also illustrated on real-world classification problems in Astrophysics using R packages.
Visualization of high-dimensional clusters using nonlinear magnification
Keahey, T.A.
1998-12-31
This paper describes a cluster visualization system used for data-mining fraud detection. The system can simultaneously show 6 dimensions of data, and a unique technique of 3D nonlinear magnification allows individual clusters of data points to be magnified while still maintaining a view of the global context. The author first describes the fraud detection problem, along with the data which is to be visualized. Then he describes general characteristics of the visualization system, and shows how nonlinear magnification can be used in this system. Finally he concludes and describes options for further work.
A multistage mathematical approach to automated clustering of high-dimensional noisy data
Friedman, Alexander; Keselman, Michael D.; Gibb, Leif G.; Graybiel, Ann M.
2015-01-01
A critical problem faced in many scientific fields is the adequate separation of data derived from individual sources. Often, such datasets require analysis of multiple features in a highly multidimensional space, with overlap of features and sources. The datasets generated by simultaneous recording from hundreds of neurons emitting phasic action potentials have produced the challenge of separating the recorded signals into independent data subsets (clusters) corresponding to individual signal-generating neurons. Mathematical methods have been developed over the past three decades to achieve such spike clustering, but a complete solution with fully automated cluster identification has not been achieved. We propose here a fully automated mathematical approach that identifies clusters in multidimensional space through recursion, which combats the multidimensionality of the data. Recursion is paired with an approach to dimensional evaluation, in which each dimension of a dataset is examined for its informational importance for clustering. The dimensions offering greater informational importance are given added weight during recursive clustering. To combat strong background activity, our algorithm takes an iterative approach of data filtering according to a signal-to-noise ratio metric. The algorithm finds cluster cores, which are thereafter expanded to include complete clusters. This mathematical approach can be extended from its prototype context of spike sorting to other datasets that suffer from high dimensionality and background activity. PMID:25831512
NASA Astrophysics Data System (ADS)
Manukyan, N.; Eppstein, M. J.; Rizzo, D. M.
2011-12-01
data to demonstrate how the proposed methods facilitate automatic identification and visualization of clusters in real-world, high-dimensional biogeochemical data with complex relationships. The proposed methods are quite general and are applicable to a wide range of geophysical problems. [1] Pearce, A., Rizzo, D., and Mouser, P., "Subsurface characterization of groundwater contaminated by landfill leachate using microbial community profile data and a nonparametric decision-making process", Water Resources Research, 47:W06511, 11 pp, 2011. [2] Mouser, P., Rizzo, D., Druschel, G., Morales, S, O'Grady, P., Hayden, N., Stevens, L., "Enhanced detection of groundwater contamination from a leaking waste disposal site by microbial community profiles", Water Resources Research, 46:W12506, 12 pp., 2010.
Banerjee, Arindam; Ghosh, Joydeep
2004-05-01
Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (spkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, spkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the spkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques. Index Terms-Balanced clustering, expectation maximization (EM), frequency-sensitive competitive learning (FSCL), high-dimensional clustering, kmeans, normalized data, scalable clustering, streaming data, text clustering.
Effects of dependence in high-dimensional multiple testing problems
Kim, Kyung In; van de Wiel, Mark A
2008-01-01
Background We consider effects of dependence among variables of high-dimensional data in multiple hypothesis testing problems, in particular the False Discovery Rate (FDR) control procedures. Recent simulation studies consider only simple correlation structures among variables, which is hardly inspired by real data features. Our aim is to systematically study effects of several network features like sparsity and correlation strength by imposing dependence structures among variables using random correlation matrices. Results We study the robustness against dependence of several FDR procedures that are popular in microarray studies, such as Benjamin-Hochberg FDR, Storey's q-value, SAM and resampling based FDR procedures. False Non-discovery Rates and estimates of the number of null hypotheses are computed from those methods and compared. Our simulation study shows that methods such as SAM and the q-value do not adequately control the FDR to the level claimed under dependence conditions. On the other hand, the adaptive Benjamini-Hochberg procedure seems to be most robust while remaining conservative. Finally, the estimates of the number of true null hypotheses under various dependence conditions are variable. Conclusion We discuss a new method for efficient guided simulation of dependent data, which satisfy imposed network constraints as conditional independence structures. Our simulation set-up allows for a structural study of the effect of dependencies on multiple testing criterions and is useful for testing a potentially new method on π0 or FDR estimation in a dependency context. PMID:18298808
Clustering multiply imputed multivariate high-dimensional longitudinal profiles.
Bruckers, Liesbeth; Molenberghs, Geert; Dendale, Paul
2017-09-01
In this paper, we propose a method to cluster multivariate functional data with missing observations. Analysis of functional data often encompasses dimension reduction techniques such as principal component analysis (PCA). These techniques require complete data matrices. In this paper, the data are completed by means of multiple imputation, and subsequently each imputed data set is submitted to a cluster procedure. The final partition of the data, summarizing the partitions obtained for the imputed data sets, is obtained by means of ensemble clustering. The uncertainty in cluster membership, due to missing data, is characterized by means of the agreement between the members of the ensemble and fuzziness of the consensus clustering. The potential of the method was brought out on the heart failure (HF) data. Daily measurement for four biomarkers (heart rate, diastolic, and systolic blood pressure, weight) were used to cluster the patients. To normalize the distributions of the longitudinal outcomes, the data were transformed with a natural logarithm function. A cubic spline base with 69 basis functions was employed to smooth the profiles. The proposed algorithm indicates the existence of a latent structure and divides the HF patients into two clusters, showing a different evolution in blood pressure values and weight. In general, cluster results are sensitive to choices made. Likewise for the proposed approach, alternative choices for the distance measure, procedure to optimize the objective function, choice of the scree-test threshold, or the number of principal components, to be used in the approximation of the surrogate density, could all influence the final partition. For the HF data set, the final partition depends on the number of principal components used in the procedure. © 2017 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Visualization of high-dimensional clusters using nonlinear magnification
NASA Astrophysics Data System (ADS)
Keahey, T. A.
1999-03-01
This paper describes a visualization system which has been used as part of a data-mining effort to detect fraud and abuse within state medicare programs. The data-mining process generates a set of N attributes for each medicare provider and beneficiary in the state; these attributes can be numeric, categorical, or derived from the scoring proces of the data- mining routines. The attribute list can be considered as an N- dimensional space, which is subsequently partitioned into some fixed number of cluster partitions. The sparse nature of the clustered space provides room for the simultaneous visualization of more than 3 dimensions; examples in the paper will show 6-dimensional visualization. This ability to view higher dimensional data allows the data-mining researcher to compare the clustering effectiveness of the different attributes. Transparency based rendering is also used in conjunction with filtering techniques to provide selective rendering of only those data which are of greatest interest. Nonlinear magnification techniques are used to stretch the N- dimensional space to allow focus on one or more regions of interest while still allowing a view of the global context. The magnification can either be applied globally, or in a constrained fashion to expand individual clusters within the space.
High dimensional data clustering by partitioning the hypergraphs using dense subgraph partition
NASA Astrophysics Data System (ADS)
Sun, Xili; Tian, Shoucai; Lu, Yonggang
2015-12-01
Due to the curse of dimensionality, traditional clustering methods usually fail to produce meaningful results for the high dimensional data. Hypergraph partition is believed to be a promising method for dealing with this challenge. In this paper, we first construct a graph G from the data by defining an adjacency relationship between the data points using Shared Reverse k Nearest Neighbors (SRNN). Then a hypergraph is created from the graph G by defining the hyperedges to be all the maximal cliques in the graph G. After the hypergraph is produced, a powerful hypergraph partitioning method called dense subgraph partition (DSP) combined with the k-medoids method is used to produce the final clustering results. The proposed method is evaluated on several real high-dimensional datasets, and the experimental results show that the proposed method can improve the clustering results of the high dimensional data compared with applying k-medoids method directly on the original data.
Städler, Nicolas; Dondelinger, Frank; Hill, Steven M; Akbani, Rehan; Lu, Yiling; Mills, Gordon B; Mukherjee, Sach
2017-09-15
Molecular pathways and networks play a key role in basic and disease biology. An emerging notion is that networks encoding patterns of molecular interplay may themselves differ between contexts, such as cell type, tissue or disease (sub)type. However, while statistical testing of differences in mean expression levels has been extensively studied, testing of network differences remains challenging. Furthermore, since network differences could provide important and biologically interpretable information to identify molecular subgroups, there is a need to consider the unsupervised task of learning subgroups and networks that define them. This is a nontrivial clustering problem, with neither subgroups nor subgroup-specific networks known at the outset. We leverage recent ideas from high-dimensional statistics for testing and clustering in the network biology setting. The methods we describe can be applied directly to most continuous molecular measurements and networks do not need to be specified beforehand. We illustrate the ideas and methods in a case study using protein data from The Cancer Genome Atlas (TCGA). This provides evidence that patterns of interplay between signalling proteins differ significantly between cancer types. Furthermore, we show how the proposed approaches can be used to learn subtypes and the molecular networks that define them. As the Bioconductor package nethet. staedler.n@gmail.com or sach.mukherjee@dzne.de. Supplementary data are available at Bioinformatics online.
Comparison of clustering methods for high-dimensional single-cell flow and mass cytometry data.
Weber, Lukas M; Robinson, Mark D
2016-12-01
Recent technological developments in high-dimensional flow cytometry and mass cytometry (CyTOF) have made it possible to detect expression levels of dozens of protein markers in thousands of cells per second, allowing cell populations to be characterized in unprecedented detail. Traditional data analysis by "manual gating" can be inefficient and unreliable in these high-dimensional settings, which has led to the development of a large number of automated analysis methods. Methods designed for unsupervised analysis use specialized clustering algorithms to detect and define cell populations for further downstream analysis. Here, we have performed an up-to-date, extensible performance comparison of clustering methods for high-dimensional flow and mass cytometry data. We evaluated methods using several publicly available data sets from experiments in immunology, containing both major and rare cell populations, with cell population identities from expert manual gating as the reference standard. Several methods performed well, including FlowSOM, X-shift, PhenoGraph, Rclusterpp, and flowMeans. Among these, FlowSOM had extremely fast runtimes, making this method well-suited for interactive, exploratory analysis of large, high-dimensional data sets on a standard laptop or desktop computer. These results extend previously published comparisons by focusing on high-dimensional data and including new methods developed for CyTOF data. R scripts to reproduce all analyses are available from GitHub (https://github.com/lmweber/cytometry-clustering-comparison), and pre-processed data files are available from FlowRepository (FR-FCM-ZZPH), allowing our comparisons to be extended to include new clustering methods and reference data sets. © 2016 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of ISAC.
Variational Bayesian strategies for high-dimensional, stochastic design problems
Koutsourelakis, P.S.
2016-03-01
This paper is concerned with a lesser-studied problem in the context of model-based, uncertainty quantification (UQ), that of optimization/design/control under uncertainty. The solution of such problems is hindered not only by the usual difficulties encountered in UQ tasks (e.g. the high computational cost of each forward simulation, the large number of random variables) but also by the need to solve a nonlinear optimization problem involving large numbers of design variables and potentially constraints. We propose a framework that is suitable for a class of such problems and is based on the idea of recasting them as probabilistic inference tasks. To that end, we propose a Variational Bayesian (VB) formulation and an iterative VB–Expectation-Maximization scheme that is capable of identifying a local maximum as well as a low-dimensional set of directions in the design space, along which, the objective exhibits the largest sensitivity. We demonstrate the validity of the proposed approach in the context of two numerical examples involving thousands of random and design variables. In all cases considered the cost of the computations in terms of calls to the forward model was of the order of 100 or less. The accuracy of the approximations provided is assessed by information-theoretic metrics.
Mwangi, Benson; Soares, Jair C; Hasan, Khader M
2014-10-30
Neuroimaging machine learning studies have largely utilized supervised algorithms - meaning they require both neuroimaging scan data and corresponding target variables (e.g. healthy vs. diseased) to be successfully 'trained' for a prediction task. Noticeably, this approach may not be optimal or possible when the global structure of the data is not well known and the researcher does not have an a priori model to fit the data. We set out to investigate the utility of an unsupervised machine learning technique; t-distributed stochastic neighbour embedding (t-SNE) in identifying 'unseen' sample population patterns that may exist in high-dimensional neuroimaging data. Multimodal neuroimaging scans from 92 healthy subjects were pre-processed using atlas-based methods, integrated and input into the t-SNE algorithm. Patterns and clusters discovered by the algorithm were visualized using a 2D scatter plot and further analyzed using the K-means clustering algorithm. t-SNE was evaluated against classical principal component analysis. Remarkably, based on unlabelled multimodal scan data, t-SNE separated study subjects into two very distinct clusters which corresponded to subjects' gender labels (cluster silhouette index value=0.79). The resulting clusters were used to develop an unsupervised minimum distance clustering model which identified 93.5% of subjects' gender. Notably, from a neuropsychiatric perspective this method may allow discovery of data-driven disease phenotypes or sub-types of treatment responders. Copyright © 2014 Elsevier B.V. All rights reserved.
Nuclear Potential Clustering As a New Tool to Detect Patterns in High Dimensional Datasets
NASA Astrophysics Data System (ADS)
Tonkova, V.; Paulus, D.; Neeb, H.
2013-02-01
We present a new approach for the clustering of high dimensional data without prior assumptions about the structure of the underlying distribution. The proposed algorithm is based on a concept adapted from nuclear physics. To partition the data, we model the dynamic behaviour of nucleons interacting in an N-dimensional space. An adaptive nuclear potential, comprised of a short-range attractive (strong interaction) and a long-range repulsive term (Coulomb force) is assigned to each data point. By modelling the dynamics, nucleons that are densely distributed in space fuse to build nuclei (clusters) whereas single point clusters repel each other. The formation of clusters is completed when the system reaches the state of minimal potential energy. The data are then grouped according to the particles' final effective potential energy level. The performance of the algorithm is tested with several synthetic datasets showing that the proposed method can robustly identify clusters even when complex configurations are present. Furthermore, quantitative MRI data from 43 multiple sclerosis patients were analyzed, showing a reasonable splitting into subgroups according to the individual patients' disease grade. The good performance of the algorithm on such highly correlated non-spherical datasets, which are typical for MRI derived image features, shows that Nuclear Potential Clustering is a valuable tool for automated data analysis, not only in the MRI domain.
Clustering High-Dimensional Landmark-based Two-dimensional Shape Data‡
Huang, Chao; Styner, Martin; Zhu, Hongtu
2015-01-01
An important goal in image analysis is to cluster and recognize objects of interest according to the shapes of their boundaries. Clustering such objects faces at least four major challenges including a curved shape space, a high-dimensional feature space, a complex spatial correlation structure, and shape variation associated with some covariates (e.g., age or gender). The aim of this paper is to develop a penalized model-based clustering framework to cluster landmark-based planar shape data, while explicitly addressing these challenges. Specifically, a mixture of offset-normal shape factor analyzers (MOSFA) is proposed with mixing proportions defined through a regression model (e.g., logistic) and an offset-normal shape distribution in each component for data in the curved shape space. A latent factor analysis model is introduced to explicitly model the complex spatial correlation. A penalized likelihood approach with both adaptive pairwise fusion Lasso penalty function and L2 penalty function is used to automatically realize variable selection via thresholding and deliver a sparse solution. Our real data analysis has confirmed the excellent finite-sample performance of MOSFA in revealing meaningful clusters in the corpus callosum shape data obtained from the Attention Deficit Hyperactivity Disorder-200 (ADHD-200) study. PMID:26604425
NASA Astrophysics Data System (ADS)
Mohammad khaninezhad, M.; Jafarpour, B.
2012-12-01
Data limitation and heterogeneity of the geologic formations introduce significant uncertainty in predicting the related flow and transport processes in these environments. Fluid flow and displacement behavior in subsurface systems is mainly controlled by the structural connectivity models that create preferential flow pathways (or barriers). The connectivity of extreme geologic features strongly constrains the evolution of the related flow and transport processes in subsurface formations. Therefore, characterization of the geologic continuity and facies connectivity is critical for reliable prediction of the flow and transport behavior. The goal of this study is to develop a robust and geologically consistent framework for solving large-scale nonlinear subsurface characterization inverse problems under uncertainty about geologic continuity and structural connectivity. We formulate a novel inverse modeling approach by adopting a sparse reconstruction perspective, which involves two major components: 1) sparse description of hydraulic property distribution under significant uncertainty in structural connectivity and 2) formulation of an effective sparsity-promoting inversion method that is robust against prior model uncertainty. To account for the significant variability in the structural connectivity, we use, as prior, multiple distinct connectivity models. For sparse/compact representation of high-dimensional hydraulic property maps, we investigate two methods. In one approach, we apply the principle component analysis (PCA) to each prior connectivity model individually and combine the resulting leading components from each model to form a diverse geologic dictionary. Alternatively, we combine many realizations of the hydraulic properties from different prior connectivity models and use them to generate a diverse training dataset. We use the training dataset with a sparsifying transform, such as K-SVD, to construct a sparse geologic dictionary that is robust to
NASA Astrophysics Data System (ADS)
Franck, I. M.; Koutsourelakis, P. S.
2017-01-01
This paper is concerned with the numerical solution of model-based, Bayesian inverse problems. We are particularly interested in cases where the cost of each likelihood evaluation (forward-model call) is expensive and the number of unknown (latent) variables is high. This is the setting in many problems in computational physics where forward models with nonlinear PDEs are used and the parameters to be calibrated involve spatio-temporarily varying coefficients, which upon discretization give rise to a high-dimensional vector of unknowns. One of the consequences of the well-documented ill-posedness of inverse problems is the possibility of multiple solutions. While such information is contained in the posterior density in Bayesian formulations, the discovery of a single mode, let alone multiple, poses a formidable computational task. The goal of the present paper is two-fold. On one hand, we propose approximate, adaptive inference strategies using mixture densities to capture multi-modal posteriors. On the other, we extend our work in [1] with regard to effective dimensionality reduction techniques that reveal low-dimensional subspaces where the posterior variance is mostly concentrated. We validate the proposed model by employing Importance Sampling which confirms that the bias introduced is small and can be efficiently corrected if the analyst wishes to do so. We demonstrate the performance of the proposed strategy in nonlinear elastography where the identification of the mechanical properties of biological materials can inform non-invasive, medical diagnosis. The discovery of multiple modes (solutions) in such problems is critical in achieving the diagnostic objectives.
NASA Astrophysics Data System (ADS)
Choo, Jaegul; Lee, Hanseung; Liu, Zhicheng; Stasko, John; Park, Haesun
2013-01-01
Many of the modern data sets such as text and image data can be represented in high-dimensional vector spaces and have benefited from computational methods that utilize advanced computational methods. Visual analytics approaches have contributed greatly to data understanding and analysis due to their capability of leveraging humans' ability for quick visual perception. However, visual analytics targeting large-scale data such as text and image data has been challenging due to the limited screen space in terms of both the numbers of data points and features to represent. Among various computational methods supporting visual analytics, dimension reduction and clustering have played essential roles by reducing these numbers in an intelligent way to visually manageable sizes. Given numerous dimension reduction and clustering methods available, however, the decision on the choice of algorithms and their parameters becomes difficult. In this paper, we present an interactive visual testbed system for dimension reduction and clustering in a large-scale high-dimensional data analysis. The testbed system enables users to apply various dimension reduction and clustering methods with different settings, visually compare the results from different algorithmic methods to obtain rich knowledge for the data and tasks at hand, and eventually choose the most appropriate path for a collection of algorithms and parameters. Using various data sets such as documents, images, and others that are already encoded in vectors, we demonstrate how the testbed system can support these tasks.
The feature selection bias problem in relation to high-dimensional gene data.
Krawczuk, Jerzy; Łukaszuk, Tomasz
2016-01-01
Feature selection is a technique widely used in data mining. The aim is to select the best subset of features relevant to the problem being considered. In this paper, we consider feature selection for the classification of gene datasets. Gene data is usually composed of just a few dozen objects described by thousands of features. For this kind of data, it is easy to find a model that fits the learning data. However, it is not easy to find one that will simultaneously evaluate new data equally well as learning data. This overfitting issue is well known as regards classification and regression, but it also applies to feature selection. We address this problem and investigate its importance in an empirical study of four feature selection methods applied to seven high-dimensional gene datasets. We chose datasets that are well studied in the literature-colon cancer, leukemia and breast cancer. All the datasets are characterized by a significant number of features and the presence of exactly two decision classes. The feature selection methods used are ReliefF, minimum redundancy maximum relevance, support vector machine-recursive feature elimination and relaxed linear separability. Our main result reveals the existence of positive feature selection bias in all 28 experiments (7 datasets and 4 feature selection methods). Bias was calculated as the difference between validation and test accuracies and ranges from 2.6% to as much as 41.67%. The validation accuracy (biased accuracy) was calculated on the same dataset on which the feature selection was performed. The test accuracy was calculated for data that was not used for feature selection (by so called external cross-validation). This work provides evidence that using the same dataset for feature selection and learning is not appropriate. We recommend using cross-validation for feature selection in order to reduce selection bias. Copyright © 2015 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Schütze, Niels; Wöhling, Thomas; de Play, Michael
2010-05-01
Some real-world optimization problems in water resources have a high-dimensional space of decision variables and more than one objective function. In this work, we compare three general-purpose, multi-objective simulation optimization algorithms, namely NSGA-II, AMALGAM, and CMA-ES-MO when solving three real case Multi-objective Optimization Problems (MOPs): (i) a high-dimensional soil hydraulic parameter estimation problem; (ii) a multipurpose multi-reservoir operation problem; and (iii) a scheduling problem in deficit irrigation. We analyze the behaviour of the three algorithms on these test problems considering their formulations ranging from 40 up to 120 decision variables and 2 to 4 objectives. The computational effort required by each algorithm in order to reach the true Pareto front is also analyzed.
Semi-Supervised Clustering for High-Dimensional and Sparse Features
ERIC Educational Resources Information Center
Yan, Su
2010-01-01
Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some "weak" form of side…
Semi-Supervised Clustering for High-Dimensional and Sparse Features
ERIC Educational Resources Information Center
Yan, Su
2010-01-01
Clustering is one of the most common data mining tasks, used frequently for data organization and analysis in various application domains. Traditional machine learning approaches to clustering are fully automated and unsupervised where class labels are unknown a priori. In real application domains, however, some "weak" form of side…
Mosmann, Tim R; Naim, Iftekhar; Rebhahn, Jonathan; Datta, Suprakash; Cavenaugh, James S; Weaver, Jason M; Sharma, Gaurav
2014-05-01
A multistage clustering and data processing method, SWIFT (detailed in a companion manuscript), has been developed to detect rare subpopulations in large, high-dimensional flow cytometry datasets. An iterative sampling procedure initially fits the data to multidimensional Gaussian distributions, then splitting and merging stages use a criterion of unimodality to optimize the detection of rare subpopulations, to converge on a consistent cluster number, and to describe non-Gaussian distributions. Probabilistic assignment of cells to clusters, visualization, and manipulation of clusters by their cluster medians, facilitate application of expert knowledge using standard flow cytometry programs. The dual problems of rigorously comparing similar complex samples, and enumerating absent or very rare cell subpopulations in negative controls, were solved by assigning cells in multiple samples to a cluster template derived from a single or combined sample. Comparison of antigen-stimulated and control human peripheral blood cell samples demonstrated that SWIFT could identify biologically significant subpopulations, such as rare cytokine-producing influenza-specific T cells. A sensitivity of better than one part per million was attained in very large samples. Results were highly consistent on biological replicates, yet the analysis was sensitive enough to show that multiple samples from the same subject were more similar than samples from different subjects. A companion manuscript (Part 1) details the algorithmic development of SWIFT.
McParland, D; Phillips, C M; Brennan, L; Roche, H M; Gormley, I C
2017-06-30
The LIPGENE-SU.VI.MAX study, like many others, recorded high-dimensional continuous phenotypic data and categorical genotypic data. LIPGENE-SU.VI.MAX focuses on the need to account for both phenotypic and genetic factors when studying the metabolic syndrome (MetS), a complex disorder that can lead to higher risk of type 2 diabetes and cardiovascular disease. Interest lies in clustering the LIPGENE-SU.VI.MAX participants into homogeneous groups or sub-phenotypes, by jointly considering their phenotypic and genotypic data, and in determining which variables are discriminatory. A novel latent variable model that elegantly accommodates high dimensional, mixed data is developed to cluster LIPGENE-SU.VI.MAX participants using a Bayesian finite mixture model. A computationally efficient variable selection algorithm is incorporated, estimation is via a Gibbs sampling algorithm and an approximate BIC-MCMC criterion is developed to select the optimal model. Two clusters or sub-phenotypes ('healthy' and 'at risk') are uncovered. A small subset of variables is deemed discriminatory, which notably includes phenotypic and genotypic variables, highlighting the need to jointly consider both factors. Further, 7 years after the LIPGENE-SU.VI.MAX data were collected, participants underwent further analysis to diagnose presence or absence of the MetS. The two uncovered sub-phenotypes strongly correspond to the 7-year follow-up disease classification, highlighting the role of phenotypic and genotypic factors in the MetS and emphasising the potential utility of the clustering approach in early screening. Additionally, the ability of the proposed approach to define the uncertainty in sub-phenotype membership at the participant level is synonymous with the concepts of precision medicine and nutrition. Copyright © 2017 John Wiley & Sons, Ltd. Copyright © 2017 John Wiley & Sons, Ltd.
Wang, Xueyi
2011-01-01
The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 106 records and 104 dimensions, kMkNN shows a 2-to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces. PMID:22247818
Wang, Xueyi
2012-02-08
The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 10(6) records and 10(4) dimensions, kMkNN shows a 2-to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces.
Chen, Sui-Pi; Huang, Guan-Hua
2014-06-01
This paper uses a Bayesian formulation of a clustering procedure to identify gene-gene interactions under case-control studies, called the Algorithm via Bayesian Clustering to Detect Epistasis (ABCDE). The ABCDE uses Dirichlet process mixtures to model SNP marker partitions, and uses the Gibbs weighted Chinese restaurant sampling to simulate posterior distributions of these partitions. Unlike the representative Bayesian epistasis detection algorithm BEAM, which partitions markers into three groups, the ABCDE can be evaluated at any given partition, regardless of the number of groups. This study also develops permutation tests to validate the disease association for SNP subsets identified by the ABCDE, which can yield results that are more robust to model specification and prior assumptions. This study examines the performance of the ABCDE and compares it with the BEAM using various simulated data and a schizophrenia SNP dataset.
Haplotyping Problem, A Clustering Approach
Eslahchi, Changiz; Sadeghi, Mehdi; Pezeshk, Hamid; Kargar, Mehdi; Poormohammadi, Hadi
2007-09-06
Construction of two haplotypes from a set of Single Nucleotide Polymorphism (SNP) fragments is called haplotype reconstruction problem. One of the most popular computational model for this problem is Minimum Error Correction (MEC). Since MEC is an NP-hard problem, here we propose a novel heuristic algorithm based on clustering analysis in data mining for haplotype reconstruction problem. Based on hamming distance and similarity between two fragments, our iterative algorithm produces two clusters of fragments; then, in each iteration, the algorithm assigns a fragment to one of the clusters. Our results suggest that the algorithm has less reconstruction error rate in comparison with other algorithms.
Haplotyping Problem, A Clustering Approach
NASA Astrophysics Data System (ADS)
Eslahchi, Changiz; Sadeghi, Mehdi; Pezeshk, Hamid; Kargar, Mehdi; Poormohammadi, Hadi
2007-09-01
Construction of two haplotypes from a set of Single Nucleotide Polymorphism (SNP) fragments is called haplotype reconstruction problem. One of the most popular computational model for this problem is Minimum Error Correction (MEC). Since MEC is an NP-hard problem, here we propose a novel heuristic algorithm based on clustering analysis in data mining for haplotype reconstruction problem. Based on hamming distance and similarity between two fragments, our iterative algorithm produces two clusters of fragments; then, in each iteration, the algorithm assigns a fragment to one of the clusters. Our results suggest that the algorithm has less reconstruction error rate in comparison with other algorithms.
Cluster expression in fission and fusion in high-dimensional macroscopic-microscopic calculations
Iwamoto, A.; Ichikawa, T.; Moller, P.; Sierk, A. J.
2004-01-01
We discuss the relation between the fission-fusion potential-energy surfaces of very heavy nuclei and the formation process of these nuclei in cold-fusion reactions. In the potential-energy surfaces, we find a pronounced valley structure, with one valley corresponding to the cold-fusion reaction, the other to fission. As the touching point is approached in the cold-fusion entrance channel, an instability towards dynamical deformation of the projectile occurs, which enhances the fusion cross section. These two 'cluster effects' enhance the production of superheavy nuclei in cold-fusion reactions, in addition to the effect of the low compound-system excitation energy in these reactions. Heavy-ion fusion reactions have been used extensively to synthesize heavy elements beyond actinide nuclei. In order to proceed further in this direction, we need to understand the formation process more precisely, not just the decay process. The dynamics of the formation process are considerably more complex than the dynamics necessary to interpret the spontaneous-fission decay of heavy elements. However, before implementing a full dynamical description it is useful to understand the basic properties of the potential-energy landscape encountered in the initial stages of the collision. The collision process and entrance-channel landscape can conveniently be separated into two parts, namely the early-stage separated system before touching and the late-stage composite system after touching. The transition between these two stages is particularly important, but not very well understood until now. To understand better the transition between the two stages we analyze here in detail the potential energy landscape or 'collision surface' of the system both outside and inside the touching configuration of the target and projectile. In Sec. 2, we discuss calculated five-dimensional potential-energy landscapes inside touching and identify major features. In Sec. 3, we present calculated
Naim, Iftekhar; Datta, Suprakash; Rebhahn, Jonathan; Cavenaugh, James S; Mosmann, Tim R; Sharma, Gaurav
2014-01-01
We present a model-based clustering method, SWIFT (Scalable Weighted Iterative Flow-clustering Technique), for digesting high-dimensional large-sized datasets obtained via modern flow cytometry into more compact representations that are well-suited for further automated or manual analysis. Key attributes of the method include the following: (a) the analysis is conducted in the multidimensional space retaining the semantics of the data, (b) an iterative weighted sampling procedure is utilized to maintain modest computational complexity and to retain discrimination of extremely small subpopulations (hundreds of cells from datasets containing tens of millions), and (c) a splitting and merging procedure is incorporated in the algorithm to preserve distinguishability between biologically distinct populations, while still providing a significant compaction relative to the original data. This article presents a detailed algorithmic description of SWIFT, outlining the application-driven motivations for the different design choices, a discussion of computational complexity of the different steps, and results obtained with SWIFT for synthetic data and relatively simple experimental data that allow validation of the desirable attributes. A companion paper (Part 2) highlights the use of SWIFT, in combination with additional computational tools, for more challenging biological problems. © 2014 The Authors. Published by Wiley Periodicals Inc. PMID:24677621
Naim, Iftekhar; Datta, Suprakash; Rebhahn, Jonathan; Cavenaugh, James S; Mosmann, Tim R; Sharma, Gaurav
2014-05-01
We present a model-based clustering method, SWIFT (Scalable Weighted Iterative Flow-clustering Technique), for digesting high-dimensional large-sized datasets obtained via modern flow cytometry into more compact representations that are well-suited for further automated or manual analysis. Key attributes of the method include the following: (a) the analysis is conducted in the multidimensional space retaining the semantics of the data, (b) an iterative weighted sampling procedure is utilized to maintain modest computational complexity and to retain discrimination of extremely small subpopulations (hundreds of cells from datasets containing tens of millions), and (c) a splitting and merging procedure is incorporated in the algorithm to preserve distinguishability between biologically distinct populations, while still providing a significant compaction relative to the original data. This article presents a detailed algorithmic description of SWIFT, outlining the application-driven motivations for the different design choices, a discussion of computational complexity of the different steps, and results obtained with SWIFT for synthetic data and relatively simple experimental data that allow validation of the desirable attributes. A companion paper (Part 2) highlights the use of SWIFT, in combination with additional computational tools, for more challenging biological problems. © 2014 The Authors. Published by Wiley Periodicals Inc. on behalf of the International Society for Advancement of Cytometry.
A numerical algorithm for optimal feedback gains in high dimensional LQR problems
NASA Technical Reports Server (NTRS)
Banks, H. T.; Ito, K.
1986-01-01
A hybrid method for computing the feedback gains in linear quadratic regulator problems is proposed. The method, which combines the use of a Chandrasekhar type system with an iteration of the Newton-Kleinman form with variable acceleration parameter Smith schemes, is formulated so as to efficiently compute directly the feedback gains rather than solutions of an associated Riccati equation. The hybrid method is particularly appropriate when used with large dimensional systems such as those arising in approximating infinite dimensional (distributed parameter) control systems (e.g., those governed by delay-differential and partial differential equations). Computational advantage of the proposed algorithm over the standard eigenvector (Potter, Laub-Schur) based techniques are discussed and numerical evidence of the efficacy of our ideas presented.
Fournier, René Orel, Slava
2013-12-21
We present a method for fitting high-dimensional potential energy surfaces that is almost fully automated, can be applied to systems with various chemical compositions, and involves no particular choice of function form. We tested it on four systems: Ag{sub 20}, Sn{sub 6}Pb{sub 6}, Si{sub 10}, and Li{sub 8}. The cost for energy evaluation is smaller than the cost of a density functional theory (DFT) energy evaluation by a factor of 1500 for Li{sub 8}, and 60 000 for Ag{sub 20}. We achieved intermediate accuracy (errors of 0.4 to 0.8 eV on atomization energies, or, 1% to 3% on cohesive energies) with rather small datasets (between 240 and 1400 configurations). We demonstrate that this accuracy is sufficient to correctly screen the configurations with lowest DFT energy, making this function potentially very useful in a hybrid global optimization strategy. We show that, as expected, the accuracy of the function improves with an increase in the size of the fitting dataset.
NASA Astrophysics Data System (ADS)
Yao, Bing; Yang, Hui
2016-12-01
This paper presents a novel physics-driven spatiotemporal regularization (STRE) method for high-dimensional predictive modeling in complex healthcare systems. This model not only captures the physics-based interrelationship between time-varying explanatory and response variables that are distributed in the space, but also addresses the spatial and temporal regularizations to improve the prediction performance. The STRE model is implemented to predict the time-varying distribution of electric potentials on the heart surface based on the electrocardiogram (ECG) data from the distributed sensor network placed on the body surface. The model performance is evaluated and validated in both a simulated two-sphere geometry and a realistic torso-heart geometry. Experimental results show that the STRE model significantly outperforms other regularization models that are widely used in current practice such as Tikhonov zero-order, Tikhonov first-order and L1 first-order regularization methods.
Yao, Bing; Yang, Hui
2016-12-14
This paper presents a novel physics-driven spatiotemporal regularization (STRE) method for high-dimensional predictive modeling in complex healthcare systems. This model not only captures the physics-based interrelationship between time-varying explanatory and response variables that are distributed in the space, but also addresses the spatial and temporal regularizations to improve the prediction performance. The STRE model is implemented to predict the time-varying distribution of electric potentials on the heart surface based on the electrocardiogram (ECG) data from the distributed sensor network placed on the body surface. The model performance is evaluated and validated in both a simulated two-sphere geometry and a realistic torso-heart geometry. Experimental results show that the STRE model significantly outperforms other regularization models that are widely used in current practice such as Tikhonov zero-order, Tikhonov first-order and L1 first-order regularization methods.
Yao, Bing; Yang, Hui
2016-01-01
This paper presents a novel physics-driven spatiotemporal regularization (STRE) method for high-dimensional predictive modeling in complex healthcare systems. This model not only captures the physics-based interrelationship between time-varying explanatory and response variables that are distributed in the space, but also addresses the spatial and temporal regularizations to improve the prediction performance. The STRE model is implemented to predict the time-varying distribution of electric potentials on the heart surface based on the electrocardiogram (ECG) data from the distributed sensor network placed on the body surface. The model performance is evaluated and validated in both a simulated two-sphere geometry and a realistic torso-heart geometry. Experimental results show that the STRE model significantly outperforms other regularization models that are widely used in current practice such as Tikhonov zero-order, Tikhonov first-order and L1 first-order regularization methods. PMID:27966576
Tuo, Shouheng; Yong, Longquan; Deng, Fang’an; Li, Yanhai; Lin, Yong; Lu, Qiuju
2017-01-01
Harmony Search (HS) and Teaching-Learning-Based Optimization (TLBO) as new swarm intelligent optimization algorithms have received much attention in recent years. Both of them have shown outstanding performance for solving NP-Hard optimization problems. However, they also suffer dramatic performance degradation for some complex high-dimensional optimization problems. Through a lot of experiments, we find that the HS and TLBO have strong complementarity each other. The HS has strong global exploration power but low convergence speed. Reversely, the TLBO has much fast convergence speed but it is easily trapped into local search. In this work, we propose a hybrid search algorithm named HSTLBO that merges the two algorithms together for synergistically solving complex optimization problems using a self-adaptive selection strategy. In the HSTLBO, both HS and TLBO are modified with the aim of balancing the global exploration and exploitation abilities, where the HS aims mainly to explore the unknown regions and the TLBO aims to rapidly exploit high-precision solutions in the known regions. Our experimental results demonstrate better performance and faster speed than five state-of-the-art HS variants and show better exploration power than five good TLBO variants with similar run time, which illustrates that our method is promising in solving complex high-dimensional optimization problems. The experiment on portfolio optimization problems also demonstrate that the HSTLBO is effective in solving complex read-world application. PMID:28403224
Tuo, Shouheng; Yong, Longquan; Deng, Fang'an; Li, Yanhai; Lin, Yong; Lu, Qiuju
2017-01-01
Harmony Search (HS) and Teaching-Learning-Based Optimization (TLBO) as new swarm intelligent optimization algorithms have received much attention in recent years. Both of them have shown outstanding performance for solving NP-Hard optimization problems. However, they also suffer dramatic performance degradation for some complex high-dimensional optimization problems. Through a lot of experiments, we find that the HS and TLBO have strong complementarity each other. The HS has strong global exploration power but low convergence speed. Reversely, the TLBO has much fast convergence speed but it is easily trapped into local search. In this work, we propose a hybrid search algorithm named HSTLBO that merges the two algorithms together for synergistically solving complex optimization problems using a self-adaptive selection strategy. In the HSTLBO, both HS and TLBO are modified with the aim of balancing the global exploration and exploitation abilities, where the HS aims mainly to explore the unknown regions and the TLBO aims to rapidly exploit high-precision solutions in the known regions. Our experimental results demonstrate better performance and faster speed than five state-of-the-art HS variants and show better exploration power than five good TLBO variants with similar run time, which illustrates that our method is promising in solving complex high-dimensional optimization problems. The experiment on portfolio optimization problems also demonstrate that the HSTLBO is effective in solving complex read-world application.
Clustering of solutions in hard satisfiability problems
NASA Astrophysics Data System (ADS)
Ardelius, John; Aurell, Erik; Krishnamurthy, Supriya
2007-10-01
We study numerically the solution space structure of random 3-SAT problems close to the SAT/UNSAT transition. This is done by considering chains of satisfiability problems, where clauses are added sequentially to a problem instance. Using the overlap measure of similarity between different solutions found on the same problem instance, we examine geometrical changes as a function of α. In each chain, the overlap distribution is first smooth, but then develops a tiered structure, indicating that the solutions are found in well separated clusters. On chains of not too large instances, all remaining solutions are eventually observed to be found in only one small cluster before vanishing. This condensation transition point is estimated by finite size scaling to be αc = 4.26 with an apparent critical exponent of about 1.7. The average overlap value is also observed to increase with α up to the transition, indicating a reduction in solutions space size, in accordance with theoretical predictions. The solutions are generated by a local heuristic, ASAT, and compared to those found by the Survey Propagation algorithm up to αc.
A local search for a graph clustering problem
NASA Astrophysics Data System (ADS)
Navrotskaya, Anna; Il'ev, Victor
2016-10-01
In the clustering problems one has to partition a given set of objects (a data set) into some subsets (called clusters) taking into consideration only similarity of the objects. One of most visual formalizations of clustering is graph clustering, that is grouping the vertices of a graph into clusters taking into consideration the edge structure of the graph whose vertices are objects and edges represent similarities between the objects. In the graph k-clustering problem the number of clusters does not exceed k and the goal is to minimize the number of edges between clusters and the number of missing edges within clusters. This problem is NP-hard for any k ≥ 2. We propose a polynomial time (2k-1)-approximation algorithm for graph k-clustering. Then we apply a local search procedure to the feasible solution found by this algorithm and hold experimental research of obtained heuristics.
A facility for using cluster research to study environmental problems
Not Available
1991-11-01
This report begins by describing the general application of cluster based research to environmental chemistry and the development of a Cluster Structure and Dynamics Research Facility (CSDRF). Next, four important areas of cluster research are described in more detail, including how they can impact environmental problems. These are: surface-supported clusters, water and contaminant interactions, time-resolved dynamic studies in clusters, and cluster structures and reactions. These facilities and equipment required for each area of research are then presented. The appendices contain workshop agenda and a listing of the researchers who participated in the workshop discussions that led to this report.
Lefèvre, T; Chauvin, P
2015-02-01
In an epidemiologist's toolbox, three main types of statistical tools can be found: means and proportions comparisons, linear or logistic regression models and Cox-type regression models. All these techniques have their own multivariate formulations, so that biases can be accounted for. Nonetheless, there is an entire set of natively massive multivariate techniques, which are based on weaker assumptions than classical statistical techniques are, and which seem to be underestimated or remain unknown to most epidemiologists. These techniques are used for pattern recognition or clustering – that is, for retrieving homogeneous groups in data without any a priori about these groups. They are widely used in connex domains such as genetics or biomolecular studies. Most clustering techniques require tuning specific parameters so that groups can be identified in data. A critical parameter to set is the number of groups the technique needs to discover. Different approaches to find the optimal number of groups are available, such as the silhouette approach and the robustness approach. This article presents the key aspects of clustering techniques (how proximity between observations is defined and how to find the number of groups), two archetypal techniques (namely the k-means and PAM algorithms) and how they relate to more classical statistical approaches. Through a theoretical, simple example and a real data application, we provide a complete framework within which classical epidemiological concerns can be reconsidered. We show how to (i) identify whether distinct groups exist in data, (ii) identify the optimal number of groups in data, (iii) label each observation according to its own group and (iv) analyze the groups identified according to separate and explicative data. In addition, how to achieve consistent results while removing sensitivity to initial conditions is explained. Clustering techniques, in conjunction with methods for parameter tuning, provide the
ICANP2: Isoenergetic cluster algorithm for NP-complete Problems
NASA Astrophysics Data System (ADS)
Zhu, Zheng; Fang, Chao; Katzgraber, Helmut G.
NP-complete optimization problems with Boolean variables are of fundamental importance in computer science, mathematics and physics. Most notably, the minimization of general spin-glass-like Hamiltonians remains a difficult numerical task. There has been a great interest in designing efficient heuristics to solve these computationally difficult problems. Inspired by the rejection-free isoenergetic cluster algorithm developed for Ising spin glasses, we present a generalized cluster update that can be applied to different NP-complete optimization problems with Boolean variables. The cluster updates allow for a wide-spread sampling of phase space, thus speeding up optimization. By carefully tuning the pseudo-temperature (needed to randomize the configurations) of the problem, we show that the method can efficiently tackle problems on topologies with a large site-percolation threshold. We illustrate the ICANP2 heuristic on paradigmatic optimization problems, such as the satisfiability problem and the vertex cover problem.
ERIC Educational Resources Information Center
Herring, Richard D.
Literature in mathematic problem-solving suggests that learners store information in memory which helps them solve stereotyped algebra word problems. Cluster analysis has been used as an exploratory tool to infer the types of problems which have common representations in memory. This study compares the results of a hierarchical cluster analysis of…
Clustered Self Organising Migrating Algorithm for the Quadratic Assignment Problem
NASA Astrophysics Data System (ADS)
Davendra, Donald; Zelinka, Ivan; Senkerik, Roman
2009-08-01
An approach of population dynamics and clustering for permutative problems is presented in this paper. Diversity indicators are created from solution ordering and its mapping is shown as an advantage for population control in metaheuristics. Self Organising Migrating Algorithm (SOMA) is modified using this approach and vetted with the Quadratic Assignment Problem (QAP). Extensive experimentation is conducted on benchmark problems in this area.
Solving global optimization problems on GPU cluster
Barkalov, Konstantin; Gergel, Victor; Lebedev, Ilya
2016-06-08
The paper contains the results of investigation of a parallel global optimization algorithm combined with a dimension reduction scheme. This allows solving multidimensional problems by means of reducing to data-independent subproblems with smaller dimension solved in parallel. The new element implemented in the research consists in using several graphic accelerators at different computing nodes. The paper also includes results of solving problems of well-known multiextremal test class GKLS on Lobachevsky supercomputer using tens of thousands of GPU cores.
A solution to the problem of clustered objects compact partitioning
NASA Astrophysics Data System (ADS)
Pogrebnoy, D. V.; Pogrebnoy, Al V.; Deeva, O. V.; Petrukhina, I. A.
2017-01-01
The urgency of the study consists in the fact that an object arrangement topology of a distributed system is often nonuniform. Objects can be placed at different distances from each other, thus forming clusters. That is why solving the problem of compact partitioning into sets containing thousands of objects requires the most effective way to a better use of natural structuring of objects that form clusters. The aim of the study is the development of methods of compact partitioning of sets of objects presented as clusters. The research methods are based on applied theories of sets, theory of compact sets and compact partitions, and linear programming methods with Boolean variables. As a result, the paper offers the method necessary to analyze composition and content of clusters. It also evaluates cluster compactness, which results in the decision to include clusters into the sets of partitions. It addresses the problem of optimizing the rearrangement of objects between compact sets that form clusters, which is based on the criteria of maximizing the total compactness of sets. The problem is formulated in the class of objectives of linear programming methods with Boolean variables. It introduces the example of object rearrangement.
An agglomerative hierarchical approach to visualization in Bayesian clustering problems.
Dawson, K J; Belkhir, K
2009-07-01
Clustering problems (including the clustering of individuals into outcrossing populations, hybrid generations, full-sib families and selfing lines) have recently received much attention in population genetics. In these clustering problems, the parameter of interest is a partition of the set of sampled individuals--the sample partition. In a fully Bayesian approach to clustering problems of this type, our knowledge about the sample partition is represented by a probability distribution on the space of possible sample partitions. As the number of possible partitions grows very rapidly with the sample size, we cannot visualize this probability distribution in its entirety, unless the sample is very small. As a solution to this visualization problem, we recommend using an agglomerative hierarchical clustering algorithm, which we call the exact linkage algorithm. This algorithm is a special case of the maximin clustering algorithm that we introduced previously. The exact linkage algorithm is now implemented in our software package PartitionView. The exact linkage algorithm takes the posterior co-assignment probabilities as input and yields as output a rooted binary tree, or more generally, a forest of such trees. Each node of this forest defines a set of individuals, and the node height is the posterior co-assignment probability of this set. This provides a useful visual representation of the uncertainty associated with the assignment of individuals to categories. It is also a useful starting point for a more detailed exploration of the posterior distribution in terms of the co-assignment probabilities.
Multicluster solutions to a multinucleon problem and clustering phenomena
Gnilozub, I. A.; Kurgalin, S. D.; Tchuvil'sky, Yu. M.
2008-07-15
Various concepts of clustering phenomena are discussed. Precise multicluster solutions constructed by the present authors for an A-nucleon problem whose dynamical properties are described by a generalized Elliott Hamiltonian are used as a mathematical formalism of the theory of clustering phenomena in nuclei. It is shown that qualitative features of various clustering phenomena, such as the very fact of the existence of cluster states, their classification, and selectivity of reactions that populate them, are explained within the concept being discussed. The 2{alpha} + bineutron three-cluster states of the {sup 10}Be nucleus are classified, and their spectrum is calculated. It is demonstrated that the results of these calculations are in good agreement with experimental data.
Multicluster solutions to a multinucleon problem and clustering phenomena
NASA Astrophysics Data System (ADS)
Gnilozub, I. A.; Kurgalin, S. D.; Tchuvil'Sky, Yu. M.
2008-07-01
Various concepts of clustering phenomena are discussed. Precise multicluster solutions constructed by the present authors for an A-nucleon problem whose dynamical properties are described by a generalized Elliott Hamiltonian are used as a mathematical formalism of the theory of clustering phenomena in nuclei. It is shown that qualitative features of various clustering phenomena, such as the very fact of the existence of cluster states, their classification, and selectivity of reactions that populate them, are explained within the concept being discussed. The 2 α + bineutron three-cluster states of the 10Be nucleus are classified, and their spectrum is calculated. It is demonstrated that the results of these calculations are in good agreement with experimental data.
Automated High-Dimensional Flow Cytometric Data Analysis
NASA Astrophysics Data System (ADS)
Pyne, Saumyadipta; Hu, Xinli; Wang, Kui; Rossin, Elizabeth; Lin, Tsung-I.; Maier, Lisa; Baecher-Allan, Clare; McLachlan, Geoffrey; Tamayo, Pablo; Hafler, David; de Jager, Philip; Mesirov, Jill
Flow cytometry is widely used for single cell interrogation of surface and intracellular protein expression by measuring fluorescence intensity of fluorophore-conjugated reagents. We focus on the recently developed procedure of Pyne et al. (2009, Proceedings of the National Academy of Sciences USA 106, 8519-8524) for automated high- dimensional flow cytometric analysis called FLAME (FLow analysis with Automated Multivariate Estimation). It introduced novel finite mixture models of heavy-tailed and asymmetric distributions to identify and model cell populations in a flow cytometric sample. This approach robustly addresses the complexities of flow data without the need for transformation or projection to lower dimensions. It also addresses the critical task of matching cell populations across samples that enables downstream analysis. It thus facilitates application of flow cytometry to new biological and clinical problems. To facilitate pipelining with standard bioinformatic applications such as high-dimensional visualization, subject classification or outcome prediction, FLAME has been incorporated with the GenePattern package of the Broad Institute. Thereby analysis of flow data can be approached similarly as other genomic platforms. We also consider some new work that proposes a rigorous and robust solution to the registration problem by a multi-level approach that allows us to model and register cell populations simultaneously across a cohort of high-dimensional flow samples. This new approach is called JCM (Joint Clustering and Matching). It enables direct and rigorous comparisons across different time points or phenotypes in a complex biological study as well as for classification of new patient samples in a more clinical setting.
Estimation and testing problems in auditory neuroscience via clustering.
Hwang, Youngdeok; Wright, Samantha; Hanlon, Bret M
2017-09-01
The processing of auditory information in neurons is an important area in neuroscience. We consider statistical analysis for an electrophysiological experiment related to this area. The recorded synaptic current responses from the experiment are observed as clusters, where the number of clusters is related to an important characteristic of the auditory system. This number is difficult to estimate visually because the clusters are blurred by biological variability. Using singular value decomposition and a Gaussian mixture model, we develop an estimator for the number of clusters. Additionally, we provide a method for hypothesis testing and sample size determination in the two-sample problem. We illustrate our approach with both simulated and experimental data. © 2017, The International Biometric Society.
The Heterogeneous P-Median Problem for Categorization Based Clustering
ERIC Educational Resources Information Center
Blanchard, Simon J.; Aloise, Daniel; DeSarbo, Wayne S.
2012-01-01
The p-median offers an alternative to centroid-based clustering algorithms for identifying unobserved categories. However, existing p-median formulations typically require data aggregation into a single proximity matrix, resulting in masked respondent heterogeneity. A proposed three-way formulation of the p-median problem explicitly considers…
The Heterogeneous P-Median Problem for Categorization Based Clustering
ERIC Educational Resources Information Center
Blanchard, Simon J.; Aloise, Daniel; DeSarbo, Wayne S.
2012-01-01
The p-median offers an alternative to centroid-based clustering algorithms for identifying unobserved categories. However, existing p-median formulations typically require data aggregation into a single proximity matrix, resulting in masked respondent heterogeneity. A proposed three-way formulation of the p-median problem explicitly considers…
Scalable Nearest Neighbor Algorithms for High Dimensional Data.
Muja, Marius; Lowe, David G
2014-11-01
For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computer vision and machine learning algorithms consists of finding nearest neighbor matches to high dimensional vectors that represent the training data. We propose new algorithms for approximate nearest neighbor matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest neighbor algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest neighbor matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open source library called fast library for approximate nearest neighbors (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest neighbor matching.
The Shapes of Galaxy Clusters and Related Problems
NASA Astrophysics Data System (ADS)
Yang, Abel Jiahui
2012-05-01
The cosmological many body problem describes the gravitational clustering of galaxies in the universe. These galaxies cluster about each other to produce some of the largest structures in the universe. These structures are modeled by the gravitational quasi-equilibrium distribution (GQED). The GQED is a fairly robust and simple theory based on thermodynamics and statistical mechanics that, in its simplest form, describes galaxies as point masses of equal mass. We show that more realistic descriptions of the universe only introduce higher order corrections to the theory, and the simple description is sufficient to model most cases of galaxy clustering. To demonstrate this, we use data from the Sloan Digital Sky Survey (SDSS) to show that the observed counts-in-cells distribution of galaxies in the universe follows the GQED. Using the GQED, we develop a theory to study the structure of clusters and groups of galaxies, relating the internal structure of a cluster to the large scale structure of the universe. This theory describes the probability that the galaxies in a region of space have a given kinetic and correlation potential energy. These energies are closely related to the 6-dimensional phase space configuration and thus the shape and structure of a cluster of galaxies. This theory suggests that clusters of galaxies with more than 10 members are very likely to be bound and virialized on average, but may also contain substructure in the form of smaller subclusters that cluster about each other. These subclusters may be the cores of smaller clusters that have merged, which means that the merger history of a cluster may be an important factor that determines its internal structure. Because the full 6-dimensional phase space configuration of a cluster of galaxies cannot be observed, direct comparisons with observations are not possible. Instead, we attempt to model the unobservable dimensions and show that on a statistical basis, the kinetic energies of clusters in
Distributed Computation of the knn Graph for Large High-Dimensional Point Sets
Plaku, Erion; Kavraki, Lydia E.
2009-01-01
High-dimensional problems arising from robot motion planning, biology, data mining, and geographic information systems often require the computation of k nearest neighbor (knn) graphs. The knn graph of a data set is obtained by connecting each point to its k closest points. As the research in the above-mentioned fields progressively addresses problems of unprecedented complexity, the demand for computing knn graphs based on arbitrary distance metrics and large high-dimensional data sets increases, exceeding resources available to a single machine. In this work we efficiently distribute the computation of knn graphs for clusters of processors with message passing. Extensions to our distributed framework include the computation of graphs based on other proximity queries, such as approximate knn or range queries. Our experiments show nearly linear speedup with over one hundred processors and indicate that similar speedup can be obtained with several hundred processors. PMID:19847318
High dimensional feature reduction via projection pursuit
NASA Technical Reports Server (NTRS)
Jimenez, Luis; Landgrebe, David
1994-01-01
The recent development of more sophisticated remote sensing systems enables the measurement of radiation in many more spectral intervals than previously possible. An example of that technology is the AVIRIS system, which collects image data in 220 bands. As a result of this, new algorithms must be developed in order to analyze the more complex data effectively. Data in a high dimensional space presents a substantial challenge, since intuitive concepts valid in a 2-3 dimensional space to not necessarily apply in higher dimensional spaces. For example, high dimensional space is mostly empty. This results from the concentration of data in the corners of hypercubes. Other examples may be cited. Such observations suggest the need to project data to a subspace of a much lower dimension on a problem specific basis in such a manner that information is not lost. Projection Pursuit is a technique that will accomplish such a goal. Since it processes data in lower dimensions, it should avoid many of the difficulties of high dimensional spaces. In this paper, we begin the investigation of some of the properties of Projection Pursuit for this purpose.
Statistical Physics of High Dimensional Inference
NASA Astrophysics Data System (ADS)
Advani, Madhu; Ganguli, Surya
To model modern large-scale datasets, we need efficient algorithms to infer a set of P unknown model parameters from N noisy measurements. What are fundamental limits on the accuracy of parameter inference, given limited measurements, signal-to-noise ratios, prior information, and computational tractability requirements? How can we combine prior information with measurements to achieve these limits? Classical statistics gives incisive answers to these questions as the measurement density α =N/P --> ∞ . However, modern high-dimensional inference problems, in fields ranging from bio-informatics to economics, occur at finite α. We formulate and analyze high-dimensional inference analytically by applying the replica and cavity methods of statistical physics where data serves as quenched disorder and inferred parameters play the role of thermal degrees of freedom. Our analysis reveals that widely cherished Bayesian inference algorithms such as maximum likelihood and maximum a posteriori are suboptimal in the modern setting, and yields new tractable, optimal algorithms to replace them as well as novel bounds on the achievable accuracy of a large class of high-dimensional inference algorithms. Thanks to Stanford Graduate Fellowship and Mind Brain Computation IGERT grant for support.
Analysis of data separation and recovery problems using clustered sparsity
NASA Astrophysics Data System (ADS)
King, Emily J.; Kutyniok, Gitta; Zhuang, Xiaosheng
2011-09-01
Data often have two or more fundamental components, like cartoon-like and textured elements in images; point, filament, and sheet clusters in astronomical data; and tonal and transient layers in audio signals. For many applications, separating these components is of interest. Another issue in data analysis is that of incomplete data, for example a photograph with scratches or seismic data collected with fewer than necessary sensors. There exists a unified approach to solving these problems which is minimizing the l1 norm of the analysis coefficients with respect to particular frame(s). This approach using the concept of clustered sparsity leads to similar theoretical bounds and results, which are presented here. Furthermore, necessary conditions for the frames to lead to sufficiently good solutions are also shown.
Application of clustering global optimization to thin film design problems.
Lemarchand, Fabien
2014-03-10
Refinement techniques usually calculate an optimized local solution, which is strongly dependent on the initial formula used for the thin film design. In the present study, a clustering global optimization method is used which can iteratively change this initial formula, thereby progressing further than in the case of local optimization techniques. A wide panel of local solutions is found using this procedure, resulting in a large range of optical thicknesses. The efficiency of this technique is illustrated by two thin film design problems, in particular an infrared antireflection coating, and a solar-selective absorber coating.
Visual Exploration of High Dimensional Scalar Functions
Gerber, Samuel; Bremer, Peer-Timo; Pascucci, Valerio; Whitaker, Ross
2011-01-01
An important goal of scientific data analysis is to understand the behavior of a system or process based on a sample of the system. In many instances it is possible to observe both input parameters and system outputs, and characterize the system as a high-dimensional function. Such data sets arise, for instance, in large numerical simulations, as energy landscapes in optimization problems, or in the analysis of image data relating to biological or medical parameters. This paper proposes an approach to analyze and visualizing such data sets. The proposed method combines topological and geometric techniques to provide interactive visualizations of discretely sampled high-dimensional scalar fields. The method relies on a segmentation of the parameter space using an approximate Morse-Smale complex on the cloud of point samples. For each crystal of the Morse-Smale complex, a regression of the system parameters with respect to the output yields a curve in the parameter space. The result is a simplified geometric representation of the Morse-Smale complex in the high dimensional input domain. Finally, the geometric representation is embedded in 2D, using dimension reduction, to provide a visualization platform. The geometric properties of the regression curves enable the visualization of additional information about each crystal such as local and global shape, width, length, and sampling densities. The method is illustrated on several synthetic examples of two dimensional functions. Two use cases, using data sets from the UCI machine learning repository, demonstrate the utility of the proposed approach on real data. Finally, in collaboration with domain experts the proposed method is applied to two scientific challenges. The analysis of parameters of climate simulations and their relationship to predicted global energy flux and the concentrations of chemical species in a combustion simulation and their integration with temperature. PMID:20975167
Effects of Cluster Location on Human Performance on the Traveling Salesperson Problem
ERIC Educational Resources Information Center
MacGregor, James N.
2013-01-01
Most models of human performance on the traveling salesperson problem involve clustering of nodes, but few empirical studies have examined effects of clustering in the stimulus array. A recent exception varied degree of clustering and concluded that the more clustered a stimulus array, the easier a TSP is to solve (Dry, Preiss, & Wagemans,…
Effects of Cluster Location on Human Performance on the Traveling Salesperson Problem
ERIC Educational Resources Information Center
MacGregor, James N.
2013-01-01
Most models of human performance on the traveling salesperson problem involve clustering of nodes, but few empirical studies have examined effects of clustering in the stimulus array. A recent exception varied degree of clustering and concluded that the more clustered a stimulus array, the easier a TSP is to solve (Dry, Preiss, & Wagemans,…
SMOTE for high-dimensional class-imbalanced data
2013-01-01
Background Classification using class-imbalanced data is biased in favor of the majority class. The bias is even larger for high-dimensional data, where the number of variables greatly exceeds the number of samples. The problem can be attenuated by undersampling or oversampling, which produce class-balanced data. Generally undersampling is helpful, while random oversampling is not. Synthetic Minority Oversampling TEchnique (SMOTE) is a very popular oversampling method that was proposed to improve random oversampling but its behavior on high-dimensional data has not been thoroughly investigated. In this paper we investigate the properties of SMOTE from a theoretical and empirical point of view, using simulated and real high-dimensional data. Results While in most cases SMOTE seems beneficial with low-dimensional data, it does not attenuate the bias towards the classification in the majority class for most classifiers when data are high-dimensional, and it is less effective than random undersampling. SMOTE is beneficial for k-NN classifiers for high-dimensional data if the number of variables is reduced performing some type of variable selection; we explain why, otherwise, the k-NN classification is biased towards the minority class. Furthermore, we show that on high-dimensional data SMOTE does not change the class-specific mean values while it decreases the data variability and it introduces correlation between samples. We explain how our findings impact the class-prediction for high-dimensional data. Conclusions In practice, in the high-dimensional setting only k-NN classifiers based on the Euclidean distance seem to benefit substantially from the use of SMOTE, provided that variable selection is performed before using SMOTE; the benefit is larger if more neighbors are used. SMOTE for k-NN without variable selection should not be used, because it strongly biases the classification towards the minority class. PMID:23522326
A Link-Based Approach to the Cluster Ensemble Problem.
Iam-On, Natthakan; Boongoen, Tossapon; Garrett, Simon; Price, Chris
2011-12-01
Cluster ensembles have recently emerged as a powerful alternative to standard cluster analysis, aggregating several input data clusterings to generate a single output clustering, with improved robustness and stability. From the early work, these techniques held great promise; however, most of them generate the final solution based on incomplete information of a cluster ensemble. The underlying ensemble-information matrix reflects only cluster-data point relations, while those among clusters are generally overlooked. This paper presents a new link-based approach to improve the conventional matrix. It achieves this using the similarity between clusters that are estimated from a link network model of the ensemble. In particular, three new link-based algorithms are proposed for the underlying similarity assessment. The final clustering result is generated from the refined matrix using two different consensus functions of feature-based and graph-based partitioning. This approach is the first to address and explicitly employ the relationship between input partitions, which has not been emphasized by recent studies of matrix refinement. The effectiveness of the link-based approach is empirically demonstrated over 10 data sets (synthetic and real) and three benchmark evaluation measures. The results suggest the new approach is able to efficiently extract information embedded in the input clusterings, and regularly illustrate higher clustering quality in comparison to several state-of-the-art techniques.
A facility for using cluster research to study environmental problems. Workshop proceedings
Not Available
1991-11-01
This report begins by describing the general application of cluster based research to environmental chemistry and the development of a Cluster Structure and Dynamics Research Facility (CSDRF). Next, four important areas of cluster research are described in more detail, including how they can impact environmental problems. These are: surface-supported clusters, water and contaminant interactions, time-resolved dynamic studies in clusters, and cluster structures and reactions. These facilities and equipment required for each area of research are then presented. The appendices contain workshop agenda and a listing of the researchers who participated in the workshop discussions that led to this report.
Quantifying Photonic High-Dimensional Entanglement
NASA Astrophysics Data System (ADS)
Martin, Anthony; Guerreiro, Thiago; Tiranov, Alexey; Designolle, Sébastien; Fröwis, Florian; Brunner, Nicolas; Huber, Marcus; Gisin, Nicolas
2017-03-01
High-dimensional entanglement offers promising perspectives in quantum information science. In practice, however, the main challenge is to devise efficient methods to characterize high-dimensional entanglement, based on the available experimental data which is usually rather limited. Here we report the characterization and certification of high-dimensional entanglement in photon pairs, encoded in temporal modes. Building upon recently developed theoretical methods, we certify an entanglement of formation of 2.09(7) ebits in a time-bin implementation, and 4.1(1) ebits in an energy-time implementation. These results are based on very limited sets of local measurements, which illustrates the practical relevance of these methods.
Sparse High Dimensional Models in Economics
Fan, Jianqing; Lv, Jinchi; Qi, Lei
2010-01-01
This paper reviews the literature on sparse high dimensional models and discusses some applications in economics and finance. Recent developments of theory, methods, and implementations in penalized least squares and penalized likelihood methods are highlighted. These variable selection methods are proved to be effective in high dimensional sparse modeling. The limits of dimensionality that regularization methods can handle, the role of penalty functions, and their statistical properties are detailed. Some recent advances in ultra-high dimensional sparse modeling are also briefly discussed. PMID:22022635
Pattern of clustering of menopausal problems: A study with a Bengali Hindu ethnic group.
Dasgupta, Doyel; Pal, Baidyanath; Ray, Subha
2016-01-01
We attempted to find out how menopausal problems cluster with each other. The study was conducted among a group of women belonging to a Bengali-speaking Hindu ethnic group of West Bengal, a state located in Eastern India. We recruited 1,400 participants for the study. Information on sociodemographic aspects and menopausal problems were collected from these participants with the help of a pretested questionnaire. Results of cluster analysis showed that vasomotor, vaginal, and urinary problems cluster together, separately from physical and psychosomatic problems.
High-Dimensional Profiling for Computational Diagnosis.
Lottaz, Claudio; Gronwald, Wolfram; Spang, Rainer; Engelmann, Julia C
2017-01-01
New technologies allow for high-dimensional profiling of patients. For instance, genome-wide gene expression analysis in tumors or in blood is feasible with microarrays, if all transcripts are known, or even without this restriction using high-throughput RNA sequencing. Other technologies like NMR finger printing allow for high-dimensional profiling of metabolites in blood or urine. Such technologies for high-dimensional patient profiling represent novel possibilities for molecular diagnostics. In clinical profiling studies, researchers aim to predict disease type, survival, or treatment response for new patients using high-dimensional profiles. In this process, they encounter a series of obstacles and pitfalls. We review fundamental issues from machine learning and recommend a procedure for the computational aspects of a clinical profiling study.
Bayesian Methods for High Dimensional Linear Models
Mallick, Himel; Yi, Nengjun
2013-01-01
In this article, we present a selective overview of some recent developments in Bayesian model and variable selection methods for high dimensional linear models. While most of the reviews in literature are based on conventional methods, we focus on recently developed methods, which have proven to be successful in dealing with high dimensional variable selection. First, we give a brief overview of the traditional model selection methods (viz. Mallow’s Cp, AIC, BIC, DIC), followed by a discussion on some recently developed methods (viz. EBIC, regularization), which have occupied the minds of many statisticians. Then, we review high dimensional Bayesian methods with a particular emphasis on Bayesian regularization methods, which have been used extensively in recent years. We conclude by briefly addressing the asymptotic behaviors of Bayesian variable selection methods for high dimensional linear models under different regularity conditions. PMID:24511433
Analyzing High-Dimensional Multispectral Data
NASA Technical Reports Server (NTRS)
Lee, Chulhee; Landgrebe, David A.
1993-01-01
In this paper, through a series of specific examples, we illustrate some characteristics encountered in analyzing high- dimensional multispectral data. The increased importance of the second-order statistics in analyzing high-dimensional data is illustrated, as is the shortcoming of classifiers such as the minimum distance classifier which rely on first-order variations alone. We also illustrate how inaccurate estimation or first- and second-order statistics, e.g., from use of training sets which are too small, affects the performance of a classifier. Recognizing the importance of second-order statistics on the one hand, but the increased difficulty in perceiving and comprehending information present in statistics derived from high-dimensional data on the other, we propose a method to aid visualization of high-dimensional statistics using a color coding scheme.
The Subspace Voyager: Exploring High-Dimensional Data along a Continuum of Salient 3D Subspace.
Wang, Bing; Mueller, Klaus
2017-02-23
Analyzing high-dimensional data and finding hidden patterns is a difficult problem and has attracted numerous research efforts. Automated methods can be useful to some extent but bringing the data analyst into the loop via interactive visual tools can help the discovery process tremendously. An inherent problem in this effort is that humans lack the mental capacity to truly understand spaces exceeding three spatial dimensions. To keep within this limitation, we describe a framework that decomposes a high-dimensional data space into a continuum of generalized 3D subspaces. Analysts can then explore these 3D subspaces individually via the familiar trackball interface while using additional facilities to smoothly transition to adjacent subspaces for expanded space comprehension. Since the number of such subspaces suffers from combinatorial explosion, we provide a set of data-driven subspace selection and navigation tools which can guide users to interesting subspaces and views. A subspace trail map allows users to manage the explored subspaces, keep their bearings, and return to interesting subspaces and views. Both trackball and trail map are each embedded into a word cloud of attribute labels which aid in navigation. We demonstrate our system via several use cases in a diverse set of application areas - cluster analysis and refinement, information discovery, and supervised training of classifiers. We also report on a user study that evaluates the usability of the various interactions our system provides.
Numerical methods for high-dimensional probability density function equations
Cho, H.; Venturi, D.; Karniadakis, G.E.
2016-01-15
In this paper we address the problem of computing the numerical solution to kinetic partial differential equations involving many phase variables. These types of equations arise naturally in many different areas of mathematical physics, e.g., in particle systems (Liouville and Boltzmann equations), stochastic dynamical systems (Fokker–Planck and Dostupov–Pugachev equations), random wave theory (Malakhov–Saichev equations) and coarse-grained stochastic systems (Mori–Zwanzig equations). We propose three different classes of new algorithms addressing high-dimensionality: The first one is based on separated series expansions resulting in a sequence of low-dimensional problems that can be solved recursively and in parallel by using alternating direction methods. The second class of algorithms relies on truncation of interaction in low-orders that resembles the Bogoliubov–Born–Green–Kirkwood–Yvon (BBGKY) framework of kinetic gas theory and it yields a hierarchy of coupled probability density function equations. The third class of algorithms is based on high-dimensional model representations, e.g., the ANOVA method and probabilistic collocation methods. A common feature of all these approaches is that they are reducible to the problem of computing the solution to high-dimensional equations via a sequence of low-dimensional problems. The effectiveness of the new algorithms is demonstrated in numerical examples involving nonlinear stochastic dynamical systems and partial differential equations, with up to 120 variables.
Numerical methods for high-dimensional probability density function equations
NASA Astrophysics Data System (ADS)
Cho, H.; Venturi, D.; Karniadakis, G. E.
2016-01-01
In this paper we address the problem of computing the numerical solution to kinetic partial differential equations involving many phase variables. These types of equations arise naturally in many different areas of mathematical physics, e.g., in particle systems (Liouville and Boltzmann equations), stochastic dynamical systems (Fokker-Planck and Dostupov-Pugachev equations), random wave theory (Malakhov-Saichev equations) and coarse-grained stochastic systems (Mori-Zwanzig equations). We propose three different classes of new algorithms addressing high-dimensionality: The first one is based on separated series expansions resulting in a sequence of low-dimensional problems that can be solved recursively and in parallel by using alternating direction methods. The second class of algorithms relies on truncation of interaction in low-orders that resembles the Bogoliubov-Born-Green-Kirkwood-Yvon (BBGKY) framework of kinetic gas theory and it yields a hierarchy of coupled probability density function equations. The third class of algorithms is based on high-dimensional model representations, e.g., the ANOVA method and probabilistic collocation methods. A common feature of all these approaches is that they are reducible to the problem of computing the solution to high-dimensional equations via a sequence of low-dimensional problems. The effectiveness of the new algorithms is demonstrated in numerical examples involving nonlinear stochastic dynamical systems and partial differential equations, with up to 120 variables.
Feature extraction and classification algorithms for high dimensional data
NASA Technical Reports Server (NTRS)
Lee, Chulhee; Landgrebe, David
1993-01-01
Feature extraction and classification algorithms for high dimensional data are investigated. Developments with regard to sensors for Earth observation are moving in the direction of providing much higher dimensional multispectral imagery than is now possible. In analyzing such high dimensional data, processing time becomes an important factor. With large increases in dimensionality and the number of classes, processing time will increase significantly. To address this problem, a multistage classification scheme is proposed which reduces the processing time substantially by eliminating unlikely classes from further consideration at each stage. Several truncation criteria are developed and the relationship between thresholds and the error caused by the truncation is investigated. Next an approach to feature extraction for classification is proposed based directly on the decision boundaries. It is shown that all the features needed for classification can be extracted from decision boundaries. A characteristic of the proposed method arises by noting that only a portion of the decision boundary is effective in discriminating between classes, and the concept of the effective decision boundary is introduced. The proposed feature extraction algorithm has several desirable properties: it predicts the minimum number of features necessary to achieve the same classification accuracy as in the original space for a given pattern recognition problem; and it finds the necessary feature vectors. The proposed algorithm does not deteriorate under the circumstances of equal means or equal covariances as some previous algorithms do. In addition, the decision boundary feature extraction algorithm can be used both for parametric and non-parametric classifiers. Finally, some problems encountered in analyzing high dimensional data are studied and possible solutions are proposed. First, the increased importance of the second order statistics in analyzing high dimensional data is recognized
Fast Nonparametric Machine Learning Algorithms for High-Dimensional Massive Data and Applications
2006-03-01
Mapreduce : Simplified data processing on large clusters . In Symposium on Operating System Design and Implementation, 2004. 6.3.2 S. C. Deerwester, S. T...Fast Nonparametric Machine Learning Algorithms for High-dimensional Massive Data and Applications Ting Liu CMU-CS-06-124 March 2006 School of...4. TITLE AND SUBTITLE Fast Nonparametric Machine Learning Algorithms for High-dimensional Massive Data and Applications 5a. CONTRACT NUMBER 5b
Clusters of primordial black holes and reionization problem
Belotsky, K. M. Kirillov, A. A. Rubin, S. G.
2015-05-15
Clusters of primordial black holes may cause the formation of quasars in the early Universe. In turn, radiation from these quasars may lead to the reionization of the Universe. However, the evaporation of primordial black holes via Hawking’s mechanism may also contribute to the ionization of matter. The possibility of matter ionization via the evaporation of primordial black holes with allowance for existing constraints on their density is discussed. The contribution to ionization from the evaporation of primordial black holes characterized by their preset mass spectrum can roughly be estimated at about 10{sup −3}.
Sparse representation approaches for the classification of high-dimensional biological data
2013-01-01
Background High-throughput genomic and proteomic data have important applications in medicine including prevention, diagnosis, treatment, and prognosis of diseases, and molecular biology, for example pathway identification. Many of such applications can be formulated to classification and dimension reduction problems in machine learning. There are computationally challenging issues with regards to accurately classifying such data, and which due to dimensionality, noise and redundancy, to name a few. The principle of sparse representation has been applied to analyzing high-dimensional biological data within the frameworks of clustering, classification, and dimension reduction approaches. However, the existing sparse representation methods are inefficient. The kernel extensions are not well addressed either. Moreover, the sparse representation techniques have not been comprehensively studied yet in bioinformatics. Results In this paper, a Bayesian treatment is presented on sparse representations. Various sparse coding and dictionary learning models are discussed. We propose fast parallel active-set optimization algorithm for each model. Kernel versions are devised based on their dimension-free property. These models are applied for classifying high-dimensional biological data. Conclusions In our experiment, we compared our models with other methods on both accuracy and computing time. It is shown that our models can achieve satisfactory accuracy, and their performance are very efficient. PMID:24565287
Approximation algorithm for the problem of partitioning a sequence into clusters
NASA Astrophysics Data System (ADS)
Kel'manov, A. V.; Mikhailova, L. V.; Khamidullin, S. A.; Khandeev, V. I.
2017-08-01
We consider the problem of partitioning a finite sequence of Euclidean points into a given number of clusters (subsequences) using the criterion of the minimal sum (over all clusters) of intercluster sums of squared distances from the elements of the clusters to their centers. It is assumed that the center of one of the desired clusters is at the origin, while the center of each of the other clusters is unknown and determined as the mean value over all elements in this cluster. Additionally, the partition obeys two structural constraints on the indices of sequence elements contained in the clusters with unknown centers: (1) the concatenation of the indices of elements in these clusters is an increasing sequence, and (2) the difference between an index and the preceding one is bounded above and below by prescribed constants. It is shown that this problem is strongly NP-hard. A 2-approximation algorithm is constructed that is polynomial-time for a fixed number of clusters.
HIGH DIMENSIONAL COVARIANCE MATRIX ESTIMATION IN APPROXIMATE FACTOR MODELS
Fan, Jianqing; Liao, Yuan; Mincheva, Martina
2012-01-01
The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu (2011), taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied. PMID:22661790
HIGH DIMENSIONAL COVARIANCE MATRIX ESTIMATION IN APPROXIMATE FACTOR MODELS.
Fan, Jianqing; Liao, Yuan; Mincheva, Martina
2011-01-01
The variance covariance matrix plays a central role in the inferential theories of high dimensional factor models in finance and economics. Popular regularization methods of directly exploiting sparsity are not directly applicable to many financial problems. Classical methods of estimating the covariance matrices are based on the strict factor models, assuming independent idiosyncratic components. This assumption, however, is restrictive in practical applications. By assuming sparse error covariance matrix, we allow the presence of the cross-sectional correlation even after taking out common factors, and it enables us to combine the merits of both methods. We estimate the sparse covariance using the adaptive thresholding technique as in Cai and Liu (2011), taking into account the fact that direct observations of the idiosyncratic components are unavailable. The impact of high dimensionality on the covariance matrix estimation based on the factor structure is then studied.
An Extended Membrane System with Active Membranes to Solve Automatic Fuzzy Clustering Problems.
Peng, Hong; Wang, Jun; Shi, Peng; Pérez-Jiménez, Mario J; Riscos-Núñez, Agustín
2016-05-01
This paper focuses on automatic fuzzy clustering problem and proposes a novel automatic fuzzy clustering method that employs an extended membrane system with active membranes that has been designed as its computing framework. The extended membrane system has a dynamic membrane structure; since membranes can evolve, it is particularly suitable for processing the automatic fuzzy clustering problem. A modification of a differential evolution (DE) mechanism was developed as evolution rules for objects according to membrane structure and object communication mechanisms. Under the control of both the object's evolution-communication mechanism and the membrane evolution mechanism, the extended membrane system can effectively determine the most appropriate number of clusters as well as the corresponding optimal cluster centers. The proposed method was evaluated over 13 benchmark problems and was compared with four state-of-the-art automatic clustering methods, two recently developed clustering methods and six classification techniques. The comparison results demonstrate the superiority of the proposed method in terms of effectiveness and robustness.
Problem-Solving Environments (PSEs) to Support Innovation Clustering
NASA Technical Reports Server (NTRS)
Gill, Zann
1999-01-01
This paper argues that there is need for high level concepts to inform the development of Problem-Solving Environment (PSE) capability. A traditional approach to PSE implementation is to: (1) assemble a collection of tools; (2) integrate the tools; and (3) assume that collaborative work begins after the PSE is assembled. I argue for the need to start from the opposite premise, that promoting human collaboration and observing that process comes first, followed by the development of supporting tools, and finally evolution of PSE capability through input from collaborating project teams.
Device-independent certification of high-dimensional quantum systems.
D'Ambrosio, Vincenzo; Bisesto, Fabrizio; Sciarrino, Fabio; Barra, Johanna F; Lima, Gustavo; Cabello, Adán
2014-04-11
An important problem in quantum information processing is the certification of the dimension of quantum systems without making assumptions about the devices used to prepare and measure them, that is, in a device-independent manner. A crucial question is whether such certification is experimentally feasible for high-dimensional quantum systems. Here we experimentally witness in a device-independent manner the generation of six-dimensional quantum systems encoded in the orbital angular momentum of single photons and show that the same method can be scaled, at least, up to dimension 13.
Fast Gibbs sampling for high-dimensional Bayesian inversion
NASA Astrophysics Data System (ADS)
Lucka, Felix
2016-11-01
Solving ill-posed inverse problems by Bayesian inference has recently attracted considerable attention. Compared to deterministic approaches, the probabilistic representation of the solution by the posterior distribution can be exploited to explore and quantify its uncertainties. In applications where the inverse solution is subject to further analysis procedures can be a significant advantage. Alongside theoretical progress, various new computational techniques allow us to sample very high dimensional posterior distributions: in (Lucka 2012 Inverse Problems 28 125012), and a Markov chain Monte Carlo posterior sampler was developed for linear inverse problems with {{\\ell }}1-type priors. In this article, we extend this single component (SC) Gibbs-type sampler to a wide range of priors used in Bayesian inversion, such as general {{\\ell }}pq priors with additional hard constraints. In addition, a fast computation of the conditional, SC densities in an explicit, parameterized form, a fast, robust and exact sampling from these one-dimensional densities is key to obtain an efficient algorithm. We demonstrate that a generalization of slice sampling can utilize their specific structure for this task and illustrate the performance of the resulting slice-within-Gibbs samplers by different computed examples. These new samplers allow us to perform sample-based Bayesian inference in high-dimensional scenarios with certain priors for the first time, including the inversion of computed tomography data with the popular isotropic total variation prior.
NASA Astrophysics Data System (ADS)
Masood, Tabasum
2016-07-01
The distribution of galaxies in the universe can be well understood by correlation function analysis. The lowest order two point auto correlation function has remained a successful tool for understanding the galaxy clustering phenomena. The two point correlation function is a probability of finding two galaxies in a given volume separated by some particular distance. Given a random galaxy in a location, the correlation function describes the probability that another galaxy will be found within a given distance .The correlation function tool is important for theoretical models of physical cosmology because it provides means of testing models which assume different things about the contents of the universe Correlation function is one of the way to characterize the distribution of galaxies in the space . This can be done by observations and can be extracted from numerical N-body experiments. Correlation function is a natural quantity in theoretical dynamical description of gravitating systems. These correlations can answer many interesting questions about the evolution and the distribution of galaxies.
Locating landmarks on high-dimensional free energy surfaces.
Chen, Ming; Yu, Tang-Qing; Tuckerman, Mark E
2015-03-17
Coarse graining of complex systems possessing many degrees of freedom can often be a useful approach for analyzing and understanding key features of these systems in terms of just a few variables. The relevant energy landscape in a coarse-grained description is the free energy surface as a function of the coarse-grained variables, which, despite the dimensional reduction, can still be an object of high dimension. Consequently, navigating and exploring this high-dimensional free energy surface is a nontrivial task. In this paper, we use techniques from multiscale modeling, stochastic optimization, and machine learning to devise a strategy for locating minima and saddle points (termed "landmarks") on a high-dimensional free energy surface "on the fly" and without requiring prior knowledge of or an explicit form for the surface. In addition, we propose a compact graph representation of the landmarks and connections between them, and we show that the graph nodes can be subsequently analyzed and clustered based on key attributes that elucidate important properties of the system. Finally, we show that knowledge of landmark locations allows for the efficient determination of their relative free energies via enhanced sampling techniques.
Symptom Clusters in Adults With Chronic Health Problems and Cancer as a Comorbidity
Bender, Catherine M; Engberg, Sandra J; Donovan, Heidi S; Cohen, Susan M; Houze, Martin P; Rosenzweig, Margaret Q; Mallory, Gail A; Dunbar-Jacob, Jacqueline; Sereika, Susan M
2010-01-01
Purpose/Objectives To identify and compare symptom clusters in individuals with chronic health problems with cancer as a comorbidity versus individuals with chronic health problems who do not have cancer as a comorbidity and to explore the effect of symptoms on their quality of life. Design Secondary analysis of data from two studies. Study 1 was an investigation of the efficacy of an intervention to improve medication adherence in patients with rheumatoid arthritis (RA). Study 2 was an investigation of the efficacy of an intervention for urinary incontinence (UI) in older adults. Setting School of Nursing at the University of Pittsburgh. Sample The sample for study 1 was comprised of 639 adults with RA. The sample for study 2 was comprised of 407 adults with UI. A total of 154 (15%) subjects had a history of cancer, 56 (9%) of the subjects with RA and 98 (25%) of the subjects with UI. Methods Analysis of existing comorbidity and symptom data collected from both studies. Main Research Variables Symptom clusters, chronic disease, and cancer as a comorbidity. Findings Individuals with chronic health problems who have cancer may not have unique symptom clusters compared to individuals with chronic health problems who do not have cancer. Conclusions The symptom clusters experienced by the study participants may be more related to their primary chronic health problems and comorbidities. Implications for Nursing Additional studies are needed to examine symptom clusters in cancer survivors. As individuals are living longer with the disease, a comprehensive understanding of the symptom clusters that may be unique to cancer survivors with comorbidities is critical. PMID:18192145
Joint Adaptive Mean-Variance Regularization and Variance Stabilization of High Dimensional Data
Dazard, Jean-Eudes; Rao, J. Sunil
2012-01-01
The paper addresses a common problem in the analysis of high-dimensional high-throughput “omics” data, which is parameter estimation across multiple variables in a set of data where the number of variables is much larger than the sample size. Among the problems posed by this type of data are that variable-specific estimators of variances are not reliable and variable-wise tests statistics have low power, both due to a lack of degrees of freedom. In addition, it has been observed in this type of data that the variance increases as a function of the mean. We introduce a non-parametric adaptive regularization procedure that is innovative in that : (i) it employs a novel “similarity statistic”-based clustering technique to generate local-pooled or regularized shrinkage estimators of population parameters, (ii) the regularization is done jointly on population moments, benefiting from C. Stein's result on inadmissibility, which implies that usual sample variance estimator is improved by a shrinkage estimator using information contained in the sample mean. From these joint regularized shrinkage estimators, we derived regularized t-like statistics and show in simulation studies that they offer more statistical power in hypothesis testing than their standard sample counterparts, or regular common value-shrinkage estimators, or when the information contained in the sample mean is simply ignored. Finally, we show that these estimators feature interesting properties of variance stabilization and normalization that can be used for preprocessing high-dimensional multivariate data. The method is available as an R package, called ‘MVR’ (‘Mean-Variance Regularization’), downloadable from the CRAN website. PMID:22711950
Modifiable temporal unit problem (MTUP) and its effect on space-time cluster detection.
Cheng, Tao; Adepeju, Monsuru
2014-01-01
When analytical techniques are used to understand and analyse geographical events, adjustments to the datasets (e.g. aggregation, zoning, segmentation etc.) in both the spatial and temporal dimensions are often carried out for various reasons. The 'Modifiable Areal Unit Problem' (MAUP), which is a consequence of adjustments in the spatial dimension, has been widely researched. However, its temporal counterpart is generally ignored, especially in space-time analysis. In analogy to MAUP, the Modifiable Temporal Unit Problem (MTUP) is defined as consisting of three temporal effects (aggregation, segmentation and boundary). The effects of MTUP on the detection of space-time clusters of crime datasets of Central London are examined using Space-Time Scan Statistics (STSS). The case study reveals that MTUP has significant effects on the space-time clusters detected. The attributes of the clusters, i.e. temporal duration, spatial extent (size) and significance value (p-value), vary as the aggregation, segmentation and boundaries of the datasets change. Aggregation could be used to find the significant clusters much more quickly than at lower scales; segmentation could be used to understand the cyclic patterns of crime types. The consistencies of the clusters appearing at different temporal scales could help in identifying strong or 'true' clusters.
A parallel solution to the cutting stock problem for a cluster of workstations
Nicklas, L.D.; Atkins, R.W.; Setia, S.V.; Wang, P.Y.
1996-12-31
This paper describes the design and implementation of a solution to the constrained 2-D cutting stock problem on a cluster of workstations. The constrained 2-D cutting stock problem is an irregular problem with a dynamically modified global data set and irregular amounts and patterns of communication. A replicated data structure is used for the parallel solution since the ratio of reads to writes is known to be large. Mutual exclusion and consistency are maintained using a token-based lazy consistency mechanism, and a randomized protocol for dynamically balancing the distributed work queue is employed. Speedups are reported for three benchmark problems executed on a cluster of workstations interconnected by a 10 Mbps Ethernet.
ANISOTROPIC THERMAL CONDUCTION AND THE COOLING FLOW PROBLEM IN GALAXY CLUSTERS
Parrish, Ian J.; Sharma, Prateek; Quataert, Eliot
2009-09-20
We examine the long-standing cooling flow problem in galaxy clusters with three-dimensional magnetohydrodynamics simulations of isolated clusters including radiative cooling and anisotropic thermal conduction along magnetic field lines. The central regions of the intracluster medium (ICM) can have cooling timescales of {approx}200 Myr or shorter-in order to prevent a cooling catastrophe the ICM must be heated by some mechanism such as active galactic nucleus feedback or thermal conduction from the thermal reservoir at large radii. The cores of galaxy clusters are linearly unstable to the heat-flux-driven buoyancy instability (HBI), which significantly changes the thermodynamics of the cluster core. The HBI is a convective, buoyancy-driven instability that rearranges the magnetic field to be preferentially perpendicular to the temperature gradient. For a wide range of parameters, our simulations demonstrate that in the presence of the HBI, the effective radial thermal conductivity is reduced to {approx}<10% of the full Spitzer conductivity. With this suppression of conductive heating, the cooling catastrophe occurs on a timescale comparable to the central cooling time of the cluster. Thermal conduction alone is thus unlikely to stabilize clusters with low central entropies and short central cooling timescales. High central entropy clusters have sufficiently long cooling times that conduction can help stave off the cooling catastrophe for cosmologically interesting timescales.
Exhaustive enumeration unveils clustering and freezing in the random 3-satisfiability problem
NASA Astrophysics Data System (ADS)
Ardelius, John; Zdeborová, Lenka
2008-10-01
We study geometrical properties of the complete set of solutions of the random 3-satisfiability problem. We show that even for moderate system sizes the number of clusters corresponds surprisingly well with the theoretic asymptotic prediction. We locate the freezing transition in the space of solutions, which has been conjectured to be relevant in explaining the onset of computational hardness in random constraint satisfaction problems.
Technical Report: Scalable Parallel Algorithms for High Dimensional Numerical Integration
Masalma, Yahya; Jiao, Yu
2010-10-01
We implemented a scalable parallel quasi-Monte Carlo numerical high-dimensional integration for tera-scale data points. The implemented algorithm uses the Sobol s quasi-sequences to generate random samples. Sobol s sequence was used to avoid clustering effects in the generated random samples and to produce low-discrepancy random samples which cover the entire integration domain. The performance of the algorithm was tested. Obtained results prove the scalability and accuracy of the implemented algorithms. The implemented algorithm could be used in different applications where a huge data volume is generated and numerical integration is required. We suggest using the hyprid MPI and OpenMP programming model to improve the performance of the algorithms. If the mixed model is used, attention should be paid to the scalability and accuracy.
ERIC Educational Resources Information Center
Brusco, Michael J.; Kohn, Hans-Friedrich
2009-01-01
The clique partitioning problem (CPP) requires the establishment of an equivalence relation for the vertices of a graph such that the sum of the edge costs associated with the relation is minimized. The CPP has important applications for the social sciences because it provides a framework for clustering objects measured on a collection of nominal…
ERIC Educational Resources Information Center
Brusco, Michael J.; Kohn, Hans-Friedrich
2009-01-01
The clique partitioning problem (CPP) requires the establishment of an equivalence relation for the vertices of a graph such that the sum of the edge costs associated with the relation is minimized. The CPP has important applications for the social sciences because it provides a framework for clustering objects measured on a collection of nominal…
ERIC Educational Resources Information Center
Raver, C. Cybele; Jones, Stephanie M.; Li-Grining, Christine; Zhai, Fuhua; Metzger, Molly W.; Solomon, Bonnie
2009-01-01
The present study evaluated the efficacy of a multicomponent, classroom-based intervention in reducing preschoolers' behavior problems. The Chicago School Readiness Project model was implemented in 35 Head Start classrooms using a clustered-randomized controlled trial design. Results indicate significant treatment effects (ds = 0.53-0.89) for…
Modifiable Temporal Unit Problem (MTUP) and Its Effect on Space-Time Cluster Detection
Cheng, Tao; Adepeju, Monsuru
2014-01-01
Background When analytical techniques are used to understand and analyse geographical events, adjustments to the datasets (e.g. aggregation, zoning, segmentation etc.) in both the spatial and temporal dimensions are often carried out for various reasons. The ‘Modifiable Areal Unit Problem’ (MAUP), which is a consequence of adjustments in the spatial dimension, has been widely researched. However, its temporal counterpart is generally ignored, especially in space-time analysis. Methods In analogy to MAUP, the Modifiable Temporal Unit Problem (MTUP) is defined as consisting of three temporal effects (aggregation, segmentation and boundary). The effects of MTUP on the detection of space-time clusters of crime datasets of Central London are examined using Space-Time Scan Statistics (STSS). Results and Conclusion The case study reveals that MTUP has significant effects on the space-time clusters detected. The attributes of the clusters, i.e. temporal duration, spatial extent (size) and significance value (p-value), vary as the aggregation, segmentation and boundaries of the datasets change. Aggregation could be used to find the significant clusters much more quickly than at lower scales; segmentation could be used to understand the cyclic patterns of crime types. The consistencies of the clusters appearing at different temporal scales could help in identifying strong or ‘true’ clusters. PMID:24971885
Asymptotics of empirical eigenstructure for high dimensional spiked covariance
Wang, Weichen
2017-01-01
We derive the asymptotic distributions of the spiked eigenvalues and eigenvectors under a generalized and unified asymptotic regime, which takes into account the magnitude of spiked eigenvalues, sample size, and dimensionality. This regime allows high dimensionality and diverging eigenvalues and provides new insights into the roles that the leading eigenvalues, sample size, and dimensionality play in principal component analysis. Our results are a natural extension of those in Paul (2007) to a more general setting and solve the rates of convergence problems in Shen et al. (2013). They also reveal the biases of estimating leading eigenvalues and eigenvectors by using principal component analysis, and lead to a new covariance estimator for the approximate factor model, called shrinkage principal orthogonal complement thresholding (S-POET), that corrects the biases. Our results are successfully applied to outstanding problems in estimation of risks of large portfolios and false discovery proportions for dependent test statistics and are illustrated by simulation studies. PMID:28835726
A novel approach to the problem of non-uniqueness of the solution in hierarchical clustering.
Cattinelli, Isabella; Valentini, Giorgio; Paulesu, Eraldo; Borghese, Nunzio Alberto
2013-07-01
The existence of multiple solutions in clustering, and in hierarchical clustering in particular, is often ignored in practical applications. However, this is a non-trivial problem, as different data orderings can result in different cluster sets that, in turns, may lead to different interpretations of the same data. The method presented here offers a solution to this issue. It is based on the definition of an equivalence relation over dendrograms that allows developing all and only the significantly different dendrograms for the same dataset, thus reducing the computational complexity to polynomial from the exponential obtained when all possible dendrograms are considered. Experimental results in the neuroimaging and bioinformatics domains show the effectiveness of the proposed method.
Spatially Weighted Principal Component Regression for High-dimensional Prediction
Shen, Dan; Zhu, Hongtu
2015-01-01
We consider the problem of using high dimensional data residing on graphs to predict a low-dimensional outcome variable, such as disease status. Examples of data include time series and genetic data measured on linear graphs and imaging data measured on triangulated graphs (or lattices), among many others. Many of these data have two key features including spatial smoothness and intrinsically low dimensional structure. We propose a simple solution based on a general statistical framework, called spatially weighted principal component regression (SWPCR). In SWPCR, we introduce two sets of weights including importance score weights for the selection of individual features at each node and spatial weights for the incorporation of the neighboring pattern on the graph. We integrate the importance score weights with the spatial weights in order to recover the low dimensional structure of high dimensional data. We demonstrate the utility of our methods through extensive simulations and a real data analysis based on Alzheimer’s disease neuroimaging initiative data. PMID:26213452
On Class Visualisation for High Dimensional Data: Exploring Scientific Data Sets
NASA Astrophysics Data System (ADS)
Kaban, Ata; Sun, Jianyong; Raychaudhury, Somak; Nolan, Louisa
2006-10-01
Parametric Embedding (PE) has recently been proposed as a general-purpose algorithm for class visualisation. It takes class posteriors produced by a mixture-based clustering algorithm and projects them in 2D for visualisation. However, although this fully modularised combination of objectives (clustering and projection) is attractive for its conceptual simplicity, in the case of high dimensional data, we show that a more optimal combination of these objectives can be achieved by integrating them both into a consistent probabilistic model. In this way, the projection step will fulfil a role of regularisation, guarding against the curse of dimensionality. As a result, the tradeoff between clustering and visualisation turns out to enhance the predictive abilities of the overall model. We present results on both synthetic data and two real-world high-dimensional data sets: observed spectra of early-type galaxies and gene expression arrays.
A Selective Overview of Variable Selection in High Dimensional Feature Space
Fan, Jianqing
2010-01-01
High dimensional statistical problems arise from diverse fields of scientific research and technological development. Variable selection plays a pivotal role in contemporary statistical learning and scientific discoveries. The traditional idea of best subset selection methods, which can be regarded as a specific form of penalized likelihood, is computationally too expensive for many modern statistical applications. Other forms of penalized likelihood methods have been successfully developed over the last decade to cope with high dimensionality. They have been widely applied for simultaneously selecting important variables and estimating their effects in high dimensional statistical inference. In this article, we present a brief account of the recent developments of theory, methods, and implementations for high dimensional variable selection. What limits of the dimensionality such methods can handle, what the role of penalty functions is, and what the statistical properties are rapidly drive the advances of the field. The properties of non-concave penalized likelihood and its roles in high dimensional statistical modeling are emphasized. We also review some recent advances in ultra-high dimensional variable selection, with emphasis on independence screening and two-scale methods. PMID:21572976
NASA Astrophysics Data System (ADS)
Konno, Yohko; Suzuki, Keiji
This paper describes an approach to development of a solution algorithm of a general-purpose for large scale problems using “Local Clustering Organization (LCO)” as a new solution for Job-shop scheduling problem (JSP). Using a performance effective large scale scheduling in the study of usual LCO, a solving JSP keep stability induced better solution is examined. In this study for an improvement of a performance of a solution for JSP, processes to a optimization by LCO is examined, and a scheduling solution-structure is extended to a new solution-structure based on machine-division. A solving method introduced into effective local clustering for the solution-structure is proposed as an extended LCO. An extended LCO has an algorithm which improves scheduling evaluation efficiently by clustering of parallel search which extends over plural machines. A result verified by an application of extended LCO on various scale of problems proved to conduce to minimizing make-span and improving on the stable performance.
NASA Astrophysics Data System (ADS)
Chen, L. X.; Wu, Q. P.
2012-10-01
Recently, Dada et al. reported on the experimental entanglement concentration and violation of generalized Bell inequalities with orbital angular momentum (OAM) [Nat. Phys. 7, 677 (2011)]. Here we demonstrate that the high-dimensional entanglement concentration can be performed in arbitrary OAM subspaces with selectivity. Instead of violating the generalized Bell inequalities, the working principle of present entanglement concentration is visualized by the biphoton OAM Klyshko picture, and its good performance is confirmed and quantified through the experimental Shannon dimensionalities after concentration.
Classification of high dimensional multispectral image data
NASA Technical Reports Server (NTRS)
Hoffbeck, Joseph P.; Landgrebe, David A.
1993-01-01
A method for classifying high dimensional remote sensing data is described. The technique uses a radiometric adjustment to allow a human operator to identify and label training pixels by visually comparing the remotely sensed spectra to laboratory reflectance spectra. Training pixels for material without obvious spectral features are identified by traditional means. Features which are effective for discriminating between the classes are then derived from the original radiance data and used to classify the scene. This technique is applied to Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) data taken over Cuprite, Nevada in 1992, and the results are compared to an existing geologic map. This technique performed well even with noisy data and the fact that some of the materials in the scene lack absorption features. No adjustment for the atmosphere or other scene variables was made to the data classified. While the experimental results compare favorably with an existing geologic map, the primary purpose of this research was to demonstrate the classification method, as compared to the geology of the Cuprite scene.
Graphics Processing Units and High-Dimensional Optimization.
Zhou, Hua; Lange, Kenneth; Suchard, Marc A
2010-08-01
This paper discusses the potential of graphics processing units (GPUs) in high-dimensional optimization problems. A single GPU card with hundreds of arithmetic cores can be inserted in a personal computer and dramatically accelerates many statistical algorithms. To exploit these devices fully, optimization algorithms should reduce to multiple parallel tasks, each accessing a limited amount of data. These criteria favor EM and MM algorithms that separate parameters and data. To a lesser extent block relaxation and coordinate descent and ascent also qualify. We demonstrate the utility of GPUs in nonnegative matrix factorization, PET image reconstruction, and multidimensional scaling. Speedups of 100 fold can easily be attained. Over the next decade, GPUs will fundamentally alter the landscape of computational statistics. It is time for more statisticians to get on-board.
Graphics Processing Units and High-Dimensional Optimization
Zhou, Hua; Lange, Kenneth; Suchard, Marc A.
2011-01-01
This paper discusses the potential of graphics processing units (GPUs) in high-dimensional optimization problems. A single GPU card with hundreds of arithmetic cores can be inserted in a personal computer and dramatically accelerates many statistical algorithms. To exploit these devices fully, optimization algorithms should reduce to multiple parallel tasks, each accessing a limited amount of data. These criteria favor EM and MM algorithms that separate parameters and data. To a lesser extent block relaxation and coordinate descent and ascent also qualify. We demonstrate the utility of GPUs in nonnegative matrix factorization, PET image reconstruction, and multidimensional scaling. Speedups of 100 fold can easily be attained. Over the next decade, GPUs will fundamentally alter the landscape of computational statistics. It is time for more statisticians to get on-board. PMID:21847315
Class prediction for high-dimensional class-imbalanced data
2010-01-01
Background The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance. Results Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers. Conclusions Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when
Hoffart Lunding, Synve; Hoffart, Asle
2016-01-01
The objective of this paper was to examine the relationships between perceived parental bonding, Early Maladaptive Schemas (Young et al., 2003), and outcome of schema therapy of Cluster C personality problems and whether the perceptions of parental bonding could be influenced by schema therapy. The sample consisted of 45 patients with panic disorder and/or agoraphobia and Diagnostic and Statistical Manual of Mental Disorders, fourth edition, Cluster C personality traits who participated in an 11-week inpatient programme consisting of two phases; the first was a 5-week panic/agoraphobia-focused cognitive therapy, whereas the second phase was a personality-focused schema therapy. The patients were assessed at pre-treatment, mid-treatment and post-treatment and at 1-year follow-up. Opposite to our hypothesis, lower paternal care at pre-treatment was related to more reduction in Cluster C personality traits from pre-treatment to 1-year follow-up. Maternal protection was related to the schema domains of impaired autonomy and exaggerated standards. Overall schema severity and the schema emotional inhibition at pre-treatment were associated with less change in Cluster C traits. Perceived maternal care was reduced from pre-treatment to 1-year follow-up, and more reduction in maternal care was related to less reduction in Cluster C traits. Parental bonding failed to predict treatment outcome in the expected direction, but maternal protection was related to two of the schema domains. Overall schema severity and the particular schema emotional inhibition predicted outcome. Furthermore, perceived maternal care was reduced from before to after treatment. Future studies should examine these questions in larger samples of Cluster C patients receiving schema therapy of a longer duration. Most schemas within the impaired autonomy domain and the schema self-sacrifice seem to be related to low perceived maternal protection. Overall schema severity and the schema emotional inhibition
A model-based cluster analysis approach to adolescent problem behaviors and young adult outcomes.
Mun, Eun Young; Windle, Michael; Schainker, Lisa M
2008-01-01
Data from a community-based sample of 1,126 10th- and 11th-grade adolescents were analyzed using a model-based cluster analysis approach to empirically identify heterogeneous adolescent subpopulations from the person-oriented and pattern-oriented perspectives. The model-based cluster analysis is a new clustering procedure to investigate population heterogeneity utilizing finite mixture multivariate normal densities and accordingly to classify subpopulations using more rigorous statistical procedures for the comparison of alternative models. Four cluster groups were identified and labeled multiproblem high-risk, smoking high-risk, normative, and low-risk groups. The multiproblem high risk exhibited a constellation of high levels of problem behaviors, including delinquent and sexual behaviors, multiple illicit substance use, and depressive symptoms at age 16. They had risky temperamental attributes and lower academic functioning and educational expectations at age 15.5 and, subsequently, at age 24 completed fewer years of education, and reported lower levels of physical health and higher levels of continued involvement in substance use and abuse. The smoking high-risk group was also found to be at risk for poorer functioning in young adulthood, compared to the low-risk group. The normative and the low risk groups were, by and large, similar in their adolescent and young adult functioning. The continuity and comorbidity path from middle adolescence to young adulthood may be aided and abetted by chronic as well as episodic substance use by adolescents.
High dimensional decision dilemmas in climate models
NASA Astrophysics Data System (ADS)
Bracco, A.; Neelin, J. D.; Luo, H.; McWilliams, J. C.; Meyerson, J. E.
2013-10-01
An important source of uncertainty in climate models is linked to the calibration of model parameters. Interest in systematic and automated parameter optimization procedures stems from the desire to improve the model climatology and to quantify the average sensitivity associated with potential changes in the climate system. Building upon on the smoothness of the response of an atmospheric circulation model (AGCM) to changes of four adjustable parameters, Neelin et al. (2010) used a quadratic metamodel to objectively calibrate the AGCM. The metamodel accurately estimates global spatial averages of common fields of climatic interest, from precipitation, to low and high level winds, from temperature at various levels to sea level pressure and geopotential height, while providing a computationally cheap strategy to explore the influence of parameter settings. Here, guided by the metamodel, the ambiguities or dilemmas related to the decision making process in relation to model sensitivity and optimization are examined. Simulations of current climate are subject to considerable regional-scale biases. Those biases may vary substantially depending on the climate variable considered, and/or on the performance metric adopted. Common dilemmas are associated with model revisions yielding improvement in one field or regional pattern or season, but degradation in another, or improvement in the model climatology but degradation in the interannual variability representation. Challenges are posed to the modeler by the high dimensionality of the model output fields and by the large number of adjustable parameters. The use of the metamodel in the optimization strategy helps visualize trade-offs at a regional level, e.g., how mismatches between sensitivity and error spatial fields yield regional errors under minimization of global objective functions.
Statistical Machine Learning for Structured and High Dimensional Data
2014-09-17
AFRL-OSR-VA-TR-2014-0234 STATISTICAL MACHINE LEARNING FOR STRUCTURED AND HIGH DIMENSIONAL DATA Larry Wasserman CARNEGIE MELLON UNIVERSITY Final...Re . 8-98) v Prescribed by ANSI Std. Z39.18 14-06-2014 Final Dec 2009 - Aug 2014 Statistical Machine Learning for Structured and High Dimensional...area of resource-constrained statistical estimation. machine learning , high-dimensional statistics U U U UU John Lafferty 773-702-3813 > Research under
Anomaly Detection in Large Sets of High-Dimensional Symbol Sequences
NASA Technical Reports Server (NTRS)
Budalakoti, Suratna; Srivastava, Ashok N.; Akella, Ram; Turkov, Eugene
2006-01-01
This paper addresses the problem of detecting and describing anomalies in large sets of high-dimensional symbol sequences. The approach taken uses unsupervised clustering of sequences using the normalized longest common subsequence (LCS) as a similarity measure, followed by detailed analysis of outliers to detect anomalies. As the LCS measure is expensive to compute, the first part of the paper discusses existing algorithms, such as the Hunt-Szymanski algorithm, that have low time-complexity. We then discuss why these algorithms often do not work well in practice and present a new hybrid algorithm for computing the LCS that, in our tests, outperforms the Hunt-Szymanski algorithm by a factor of five. The second part of the paper presents new algorithms for outlier analysis that provide comprehensible indicators as to why a particular sequence was deemed to be an outlier. The algorithms provide a coherent description to an analyst of the anomalies in the sequence, compared to more normal sequences. The algorithms we present are general and domain-independent, so we discuss applications in related areas such as anomaly detection.
Clustering analysis of the ground-state structure of the vertex-cover problem
NASA Astrophysics Data System (ADS)
Barthel, Wolfgang; Hartmann, Alexander K.
2004-12-01
Vertex cover is one of the classical NP-complete problems in theoretical computer science. A vertex cover of a graph is a subset of vertices such that for each edge at least one of the two endpoints is contained in the subset. When studied on Erdös-Rényi random graphs (with connectivity c ) one observes a threshold behavior: In the thermodynamic limit the size of the minimal vertex cover is independent of the specific graph. Recent analytical studies show that on the phase boundary, for small connectivities c
NASA Astrophysics Data System (ADS)
Heggie, D.; Hut, P.
2003-10-01
focus on N = 106 for two main reasons: first, direct numerical integrations of N-body systems are beginning to approach this threshold, and second, globular star clusters provide remarkably accurate physical instantiations of the idealized N-body problem with N = 105 - 106. The authors are distinguished contributors to the study of star-cluster dynamics and the gravitational N-body problem. The book contains lucid and concise descriptions of most of the important tools in the subject, with only a modest bias towards the authors' own interests. These tools include the two-body relaxation approximation, the Vlasov and Fokker-Planck equations, regularization of close encounters, conducting fluid models, Hill's approximation, Heggie's law for binary star evolution, symplectic integration algorithms, Liapunov exponents, and so on. The book also provides an up-to-date description of the principal processes that drive the evolution of idealized N-body systems - two-body relaxation, mass segregation, escape, core collapse and core bounce, binary star hardening, gravothermal oscillations - as well as additional processes such as stellar collisions and tidal shocks that affect real star clusters but not idealized N-body systems. In a relatively short (300 pages plus appendices) book such as this, many topics have to be omitted. The reader who is hoping to learn about the phenomenology of star clusters will be disappointed, as the description of their properties is limited to only a page of text; there is also almost no discussion of other, equally interesting N-body systems such as galaxies(N approx 106 - 1012), open clusters (N simeq 102 - 104), planetary systems, or the star clusters surrounding black holes that are found in the centres of most galaxies. All of these omissions are defensible decisions. Less defensible is the uneven set of references in the text; for example, nowhere is the reader informed that the classic predecessor to this work was Spitzer's 1987 monograph
Visualization of High-Dimensionality Data Using Virtual Reality
NASA Astrophysics Data System (ADS)
Djorgovski, S. G.; Donalek, C.; Davidoff, S.; Lombeyda, S.
2015-12-01
An effective visualization of complex and high-dimensionality data sets is now a critical bottleneck on the path from data to discovery in all fields. Visual pattern recognition is the bridge between human intuition and understanding, and the quantitative content of the data and the relationships present there (correlations, outliers, clustering, etc.). We are developing a novel platform for visualization of complex, multi-dimensional data, using immersive virtual reality (VR), that leverages the recent rapid developments in the availability of commodity hardware and development software. VR immersion has been shown to significantly increase the effective visual perception and intuition, compared to the traditional flat-screen tools. This allows to more easily perceive higher dimensional spaces, with an advantage for a visual exploration of complex data compared to the traditional visualization methods. Immersive VR also offers a natural way for a collaborative visual exploration of data, with multiple users interacting with each other and with their data in the same perceptive data space.
Matton, Annelies; Goossens, Lien; Braet, Caroline; Vervaet, Myriam
2013-05-01
Little is known about the role of sensitivity to punishment (SP) and reward (SR) in eating problems during adolescence. Therefore, the aim of the present study was to examine the naturally occurring clusters of high and low SP and SR among nonclinical adolescents and the between-cluster differences in various eating problems and weight. A total of 579 adolescents (14-19 years, 39.8% boys) completed the Sensitivity to Punishment and Sensitivity to Reward Questionnaire (SPSRQ), the Behavioural Inhibition System and Behavioural Activation System scales (BIS/BAS scales), the Dutch Eating Behaviour Questionnaire and the Child Eating Disorder Examination Questionnaire and were weighed and measured. On the basis of the SPSRQ, four clusters were established, interpreted as lowSP × lowSR, lowSP × highSR, highSP × highSR and highSP × lowSR. These were associated with eating problems but not with adjusted body mass index. It seemed that specifically the highSP × highSR cluster outscored the other clusters on eating problems. These results were partly replicated with the BIS/BAS scales, although less significant relations between the clusters and eating problems were found. The implications of the findings in terms of possible risk and protective clusters are discussed.
Solution of relativistic quantum optics problems using clusters of graphical processing units
Gordon, D.F. Hafizi, B.; Helle, M.H.
2014-06-15
Numerical solution of relativistic quantum optics problems requires high performance computing due to the rapid oscillations in a relativistic wavefunction. Clusters of graphical processing units are used to accelerate the computation of a time dependent relativistic wavefunction in an arbitrary external potential. The stationary states in a Coulomb potential and uniform magnetic field are determined analytically and numerically, so that they can used as initial conditions in fully time dependent calculations. Relativistic energy levels in extreme magnetic fields are recovered as a means of validation. The relativistic ionization rate is computed for an ion illuminated by a laser field near the usual barrier suppression threshold, and the ionizing wavefunction is displayed.
Bias-Corrected Diagonal Discriminant Rules for High-Dimensional Classification
Huang, Song; Tong, Tiejun; Zhao, Hongyu
2011-01-01
Summary Diagonal discriminant rules have been successfully used for high-dimensional classification problems, but suffer from the serious drawback of biased discriminant scores. In this paper, we propose improved diagonal discriminant rules with bias-corrected discriminant scores for high-dimensional classification. We show that the proposed discriminant scores dominate the standard ones under the quadratic loss function. Analytical results on why the bias-corrected rules can potentially improve the predication accuracy are also provided. Finally, we demonstrate the improvement of the proposed rules over the original ones through extensive simulation studies and real case studies. PMID:20222939
Bias-corrected diagonal discriminant rules for high-dimensional classification.
Huang, Song; Tong, Tiejun; Zhao, Hongyu
2010-12-01
Diagonal discriminant rules have been successfully used for high-dimensional classification problems, but suffer from the serious drawback of biased discriminant scores. In this article, we propose improved diagonal discriminant rules with bias-corrected discriminant scores for high-dimensional classification. We show that the proposed discriminant scores dominate the standard ones under the quadratic loss function. Analytical results on why the bias-corrected rules can potentially improve the predication accuracy are also provided. Finally, we demonstrate the improvement of the proposed rules over the original ones through extensive simulation studies and real case studies. © 2010, The International Biometric Society.
DD-HDS: A method for visualization and exploration of high-dimensional data.
Lespinats, Sylvain; Verleysen, Michel; Giron, Alain; Fertil, Bernard
2007-09-01
Mapping high-dimensional data in a low-dimensional space, for example, for visualization, is a problem of increasingly major concern in data analysis. This paper presents data-driven high-dimensional scaling (DD-HDS), a nonlinear mapping method that follows the line of multidimensional scaling (MDS) approach, based on the preservation of distances between pairs of data. It improves the performance of existing competitors with respect to the representation of high-dimensional data, in two ways. It introduces (1) a specific weighting of distances between data taking into account the concentration of measure phenomenon and (2) a symmetric handling of short distances in the original and output spaces, avoiding false neighbor representations while still allowing some necessary tears in the original distribution. More precisely, the weighting is set according to the effective distribution of distances in the data set, with the exception of a single user-defined parameter setting the tradeoff between local neighborhood preservation and global mapping. The optimization of the stress criterion designed for the mapping is realized by "force-directed placement" (FDP). The mappings of low- and high-dimensional data sets are presented as illustrations of the features and advantages of the proposed algorithm. The weighting function specific to high-dimensional data and the symmetric handling of short distances can be easily incorporated in most distance preservation-based nonlinear dimensionality reduction methods.
Engineering two-photon high-dimensional states through quantum interference
Zhang, Yingwen; Roux, Filippus S.; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew
2016-01-01
Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits. PMID:26933685
Engineering two-photon high-dimensional states through quantum interference.
Zhang, Yingwen; Roux, Filippus S; Konrad, Thomas; Agnew, Megan; Leach, Jonathan; Forbes, Andrew
2016-02-01
Many protocols in quantum science, for example, linear optical quantum computing, require access to large-scale entangled quantum states. Such systems can be realized through many-particle qubits, but this approach often suffers from scalability problems. An alternative strategy is to consider a lesser number of particles that exist in high-dimensional states. The spatial modes of light are one such candidate that provides access to high-dimensional quantum states, and thus they increase the storage and processing potential of quantum information systems. We demonstrate the controlled engineering of two-photon high-dimensional states entangled in their orbital angular momentum through Hong-Ou-Mandel interference. We prepare a large range of high-dimensional entangled states and implement precise quantum state filtering. We characterize the full quantum state before and after the filter, and are thus able to determine that only the antisymmetric component of the initial state remains. This work paves the way for high-dimensional processing and communication of multiphoton quantum states, for example, in teleportation beyond qubits.
Detection of Subtle Context-Dependent Model Inaccuracies in High-Dimensional Robot Domains.
Mendoza, Juan Pablo; Simmons, Reid; Veloso, Manuela
2016-12-01
Autonomous robots often rely on models of their sensing and actions for intelligent decision making. However, when operating in unconstrained environments, the complexity of the world makes it infeasible to create models that are accurate in every situation. This article addresses the problem of using potentially large and high-dimensional sets of robot execution data to detect situations in which a robot model is inaccurate-that is, detecting context-dependent model inaccuracies in a high-dimensional context space. To find inaccuracies tractably, the robot conducts an informed search through low-dimensional projections of execution data to find parametric Regions of Inaccurate Modeling (RIMs). Empirical evidence from two robot domains shows that this approach significantly enhances the detection power of existing RIM-detection algorithms in high-dimensional spaces.
Parallel computations using a cluster of workstations to simulate elasticity problems
NASA Astrophysics Data System (ADS)
Darmawan, J. B. B.; Mungkasi, S.
2016-11-01
Computational physics has played important roles in real world problems. This paper is within the applied computational physics area. The aim of this study is to observe the performance of parallel computations using a cluster of workstations (COW) to simulate elasticity problems. Parallel computations with the COW configuration are conducted using the Message Passing Interface (MPI) standard. In parallel computations with COW, we consider five scenarios with twenty simulations. In addition to the execution time, efficiency is used to evaluate programming algorithm scenarios. Sequential and parallel programming performances are evaluated based on their execution time and efficiency. Results show that the one-dimensional elasticity equations are not appropriate to be solved in parallel with MPI_Send and MPI_Recv technique in the MPI standard, because the total amount of time to exchange data is considered more dominant compared with the total amount of time to conduct the basic elasticity computation.
Optimal control problem for the three-sector economic model of a cluster
NASA Astrophysics Data System (ADS)
Murzabekov, Zainel; Aipanov, Shamshi; Usubalieva, Saltanat
2016-08-01
The problem of optimal control for the three-sector economic model of a cluster is considered. Task statement is to determine the optimal distribution of investment and manpower in moving the system from a given initial state to desired final state. To solve the optimal control problem with finite-horizon planning, in case of fixed ends of trajectories, with box constraints, the method of Lagrange multipliers of a special type is used. This approach allows to represent the desired control in the form of synthesis control, depending on state of the system and current time. The results of numerical calculations for an instance of three-sector model of the economy show the effectiveness of the proposed method.
NASA Astrophysics Data System (ADS)
Li, Weixuan; Lin, Guang; Li, Bing
2016-09-01
Many uncertainty quantification (UQ) approaches suffer from the curse of dimensionality, that is, their computational costs become intractable for problems involving a large number of uncertainty parameters. In these situations, the classic Monte Carlo often remains the preferred method of choice because its convergence rate O (n - 1 / 2), where n is the required number of model simulations, does not depend on the dimension of the problem. However, many high-dimensional UQ problems are intrinsically low-dimensional, because the variation of the quantity of interest (QoI) is often caused by only a few latent parameters varying within a low-dimensional subspace, known as the sufficient dimension reduction (SDR) subspace in the statistics literature. Motivated by this observation, we propose two inverse regression-based UQ algorithms (IRUQ) for high-dimensional problems. Both algorithms use inverse regression to convert the original high-dimensional problem to a low-dimensional one, which is then efficiently solved by building a response surface for the reduced model, for example via the polynomial chaos expansion. The first algorithm, which is for the situations where an exact SDR subspace exists, is proved to converge at rate O (n-1), hence much faster than MC. The second algorithm, which doesn't require an exact SDR, employs the reduced model as a control variate to reduce the error of the MC estimate. The accuracy gain could still be significant, depending on how well the reduced model approximates the original high-dimensional one. IRUQ also provides several additional practical advantages: it is non-intrusive; it does not require computing the high-dimensional gradient of the QoI; and it reports an error bar so the user knows how reliable the result is.
A ROAD to Classification in High Dimensional Space
Fan, Jianqing; Feng, Yang; Tong, Xin
2011-01-01
Summary For high-dimensional classification, it is well known that naively performing the Fisher discriminant rule leads to poor results due to diverging spectra and noise accumulation. Therefore, researchers proposed independence rules to circumvent the diverging spectra, and sparse independence rules to mitigate the issue of noise accumulation. However, in biological applications, there are often a group of correlated genes responsible for clinical outcomes, and the use of the covariance information can significantly reduce misclassification rates. In theory the extent of such error rate reductions is unveiled by comparing the misclassification rates of the Fisher discriminant rule and the independence rule. To materialize the gain based on finite samples, a Regularized Optimal Affine Discriminant (ROAD) is proposed. ROAD selects an increasing number of features as the regularization relaxes. Further benefits can be achieved when a screening method is employed to narrow the feature pool before hitting the ROAD. An efficient Constrained Coordinate Descent algorithm (CCD) is also developed to solve the associated optimization problems. Sampling properties of oracle type are established. Simulation studies and real data analysis support our theoretical results and demonstrate the advantages of the new classification procedure under a variety of correlation structures. A delicate result on continuous piecewise linear solution path for the ROAD optimization problem at the population level justifies the linear interpolation of the CCD algorithm. PMID:23074363
Unsupervised universal steganalyzer for high-dimensional steganalytic features
NASA Astrophysics Data System (ADS)
Hou, Xiaodan; Zhang, Tao
2016-11-01
The research in developing steganalytic features has been highly successful. These features are extremely powerful when applied to supervised binary classification problems. However, they are incompatible with unsupervised universal steganalysis because the unsupervised method cannot distinguish embedding distortion from varying levels of noises caused by cover variation. This study attempts to alleviate the problem by introducing similarity retrieval of image statistical properties (SRISP), with the specific aim of mitigating the effect of cover variation on the existing steganalytic features. First, cover images with some statistical properties similar to those of a given test image are searched from a retrieval cover database to establish an aided sample set. Then, unsupervised outlier detection is performed on a test set composed of the given test image and its aided sample set to determine the type (cover or stego) of the given test image. Our proposed framework, called SRISP-aided unsupervised outlier detection, requires no training. Thus, it does not suffer from model mismatch mess. Compared with prior unsupervised outlier detectors that do not consider SRISP, the proposed framework not only retains the universality but also exhibits superior performance when applied to high-dimensional steganalytic features.
Hyper-spectral image segmentation using spectral clustering with covariance descriptors
NASA Astrophysics Data System (ADS)
Kursun, Olcay; Karabiber, Fethullah; Koc, Cemalettin; Bal, Abdullah
2009-02-01
Image segmentation is an important and difficult computer vision problem. Hyper-spectral images pose even more difficulty due to their high-dimensionality. Spectral clustering (SC) is a recently popular clustering/segmentation algorithm. In general, SC lifts the data to a high dimensional space, also known as the kernel trick, then derive eigenvectors in this new space, and finally using these new dimensions partition the data into clusters. We demonstrate that SC works efficiently when combined with covariance descriptors that can be used to assess pixelwise similarities rather than in the high-dimensional Euclidean space. We present the formulations and some preliminary results of the proposed hybrid image segmentation method for hyper-spectral images.
Li, Weixuan; Lin, Guang; Li, Bing
2016-09-01
A well-known challenge in uncertainty quantification (UQ) is the "curse of dimensionality". However, many high-dimensional UQ problems are essentially low-dimensional, because the randomness of the quantity of interest (QoI) is caused only by uncertain parameters varying within a low-dimensional subspace, known as the sufficient dimension reduction (SDR) subspace. Motivated by this observation, we propose and demonstrate in this paper an inverse regression-based UQ approach (IRUQ) for high-dimensional problems. Specifically, we use an inverse regression procedure to estimate the SDR subspace and then convert the original problem to a low-dimensional one, which can be efficiently solved by building a response surface model such as a polynomial chaos expansion. The novelty and advantages of the proposed approach is seen in its computational efficiency and practicality. Comparing with Monte Carlo, the traditionally preferred approach for high-dimensional UQ, IRUQ with a comparable cost generally gives much more accurate solutions even for high-dimensional problems, and even when the dimension reduction is not exactly sufficient. Theoretically, IRUQ is proved to converge twice as fast as the approach it uses seeking the SDR subspace. For example, while a sliced inverse regression method converges to the SDR subspace at the rate of $O(n^{-1/2})$, the corresponding IRUQ converges at $O(n^{-1})$. IRUQ also provides several desired conveniences in practice. It is non-intrusive, requiring only a simulator to generate realizations of the QoI, and there is no need to compute the high-dimensional gradient of the QoI. Finally, error bars can be derived for the estimation results reported by IRUQ.
Lee, Jenny Hyunjung; McDonnell, Kevin T; Zelenyuk, Alla; Imre, Dan; Mueller, Klaus
2014-03-01
Although the euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging intercluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multidimensional scaling (MDS) where one can often observe nonintuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our biscale framework distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate euclidean distance.
Statistical mechanics of complex neural systems and high dimensional data
NASA Astrophysics Data System (ADS)
Advani, Madhu; Lahiri, Subhaneil; Ganguli, Surya
2013-03-01
Recent experimental advances in neuroscience have opened new vistas into the immense complexity of neuronal networks. This proliferation of data challenges us on two parallel fronts. First, how can we form adequate theoretical frameworks for understanding how dynamical network processes cooperate across widely disparate spatiotemporal scales to solve important computational problems? Second, how can we extract meaningful models of neuronal systems from high dimensional datasets? To aid in these challenges, we give a pedagogical review of a collection of ideas and theoretical methods arising at the intersection of statistical physics, computer science and neurobiology. We introduce the interrelated replica and cavity methods, which originated in statistical physics as powerful ways to quantitatively analyze large highly heterogeneous systems of many interacting degrees of freedom. We also introduce the closely related notion of message passing in graphical models, which originated in computer science as a distributed algorithm capable of solving large inference and optimization problems involving many coupled variables. We then show how both the statistical physics and computer science perspectives can be applied in a wide diversity of contexts to problems arising in theoretical neuroscience and data analysis. Along the way we discuss spin glasses, learning theory, illusions of structure in noise, random matrices, dimensionality reduction and compressed sensing, all within the unified formalism of the replica method. Moreover, we review recent conceptual connections between message passing in graphical models, and neural computation and learning. Overall, these ideas illustrate how statistical physics and computer science might provide a lens through which we can uncover emergent computational functions buried deep within the dynamical complexities of neuronal networks.
Mulder, Samuel A; Wunsch, Donald C
2003-01-01
The Traveling Salesman Problem (TSP) is a very hard optimization problem in the field of operations research. It has been shown to be NP-complete, and is an often-used benchmark for new optimization techniques. One of the main challenges with this problem is that standard, non-AI heuristic approaches such as the Lin-Kernighan algorithm (LK) and the chained LK variant are currently very effective and in wide use for the common fully connected, Euclidean variant that is considered here. This paper presents an algorithm that uses adaptive resonance theory (ART) in combination with a variation of the Lin-Kernighan local optimization algorithm to solve very large instances of the TSP. The primary advantage of this algorithm over traditional LK and chained-LK approaches is the increased scalability and parallelism allowed by the divide-and-conquer clustering paradigm. Tours obtained by the algorithm are lower quality, but scaling is much better and there is a high potential for increasing performance using parallel hardware.
Hyperspherical Sparse Approximation Techniques for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max; Burkardt, John
2016-08-04
This work proposes a hyperspherical sparse approximation framework for detecting jump discontinuities in functions in high-dimensional spaces. The need for a novel approach results from the theoretical and computational inefficiencies of well-known approaches, such as adaptive sparse grids, for discontinuity detection. Our approach constructs the hyperspherical coordinate representation of the discontinuity surface of a function. Then sparse approximations of the transformed function are built in the hyperspherical coordinate system, with values at each point estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. Several approaches are used to approximate the transformed discontinuity surface in the hyperspherical system, including adaptive sparse grid and radial basis function interpolation, discrete least squares projection, and compressed sensing approximation. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. In conclusion, rigorous complexity analyses of the new methods are provided, as are several numerical examples that illustrate the effectiveness of our approach.
HYPOTHESIS TESTING FOR HIGH-DIMENSIONAL SPARSE BINARY REGRESSION
Mukherjee, Rajarshi; Pillai, Natesh S.; Lin, Xihong
2015-01-01
In this paper, we study the detection boundary for minimax hypothesis testing in the context of high-dimensional, sparse binary regression models. Motivated by genetic sequencing association studies for rare variant effects, we investigate the complexity of the hypothesis testing problem when the design matrix is sparse. We observe a new phenomenon in the behavior of detection boundary which does not occur in the case of Gaussian linear regression. We derive the detection boundary as a function of two components: a design matrix sparsity index and signal strength, each of which is a function of the sparsity of the alternative. For any alternative, if the design matrix sparsity index is too high, any test is asymptotically powerless irrespective of the magnitude of signal strength. For binary design matrices with the sparsity index that is not too high, our results are parallel to those in the Gaussian case. In this context, we derive detection boundaries for both dense and sparse regimes. For the dense regime, we show that the generalized likelihood ratio is rate optimal; for the sparse regime, we propose an extended Higher Criticism Test and show it is rate optimal and sharp. We illustrate the finite sample properties of the theoretical results using simulation studies. PMID:26246645
Hyperspherical Sparse Approximation Techniques for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max; ...
2016-08-04
This work proposes a hyperspherical sparse approximation framework for detecting jump discontinuities in functions in high-dimensional spaces. The need for a novel approach results from the theoretical and computational inefficiencies of well-known approaches, such as adaptive sparse grids, for discontinuity detection. Our approach constructs the hyperspherical coordinate representation of the discontinuity surface of a function. Then sparse approximations of the transformed function are built in the hyperspherical coordinate system, with values at each point estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the new technique can identify jump discontinuities with significantly reduced computationalmore » cost, compared to existing methods. Several approaches are used to approximate the transformed discontinuity surface in the hyperspherical system, including adaptive sparse grid and radial basis function interpolation, discrete least squares projection, and compressed sensing approximation. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. In conclusion, rigorous complexity analyses of the new methods are provided, as are several numerical examples that illustrate the effectiveness of our approach.« less
Hyperspherical Sparse Approximation Techniques for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max; Burkardt, John
2016-08-04
This work proposes a hyperspherical sparse approximation framework for detecting jump discontinuities in functions in high-dimensional spaces. The need for a novel approach results from the theoretical and computational inefficiencies of well-known approaches, such as adaptive sparse grids, for discontinuity detection. Our approach constructs the hyperspherical coordinate representation of the discontinuity surface of a function. Then sparse approximations of the transformed function are built in the hyperspherical coordinate system, with values at each point estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. Several approaches are used to approximate the transformed discontinuity surface in the hyperspherical system, including adaptive sparse grid and radial basis function interpolation, discrete least squares projection, and compressed sensing approximation. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. In conclusion, rigorous complexity analyses of the new methods are provided, as are several numerical examples that illustrate the effectiveness of our approach.
Inference for High-dimensional Differential Correlation Matrices *
Cai, T. Tony; Zhang, Anru
2015-01-01
Motivated by differential co-expression analysis in genomics, we consider in this paper estimation and testing of high-dimensional differential correlation matrices. An adaptive thresholding procedure is introduced and theoretical guarantees are given. Minimax rate of convergence is established and the proposed estimator is shown to be adaptively rate-optimal over collections of paired correlation matrices with approximately sparse differences. Simulation results show that the procedure significantly outperforms two other natural methods that are based on separate estimation of the individual correlation matrices. The procedure is also illustrated through an analysis of a breast cancer dataset, which provides evidence at the gene co-expression level that several genes, of which a subset has been previously verified, are associated with the breast cancer. Hypothesis testing on the differential correlation matrices is also considered. A test, which is particularly well suited for testing against sparse alternatives, is introduced. In addition, other related problems, including estimation of a single sparse correlation matrix, estimation of the differential covariance matrices, and estimation of the differential cross-correlation matrices, are also discussed. PMID:26500380
Inference for High-dimensional Differential Correlation Matrices.
Cai, T Tony; Zhang, Anru
2016-01-01
Motivated by differential co-expression analysis in genomics, we consider in this paper estimation and testing of high-dimensional differential correlation matrices. An adaptive thresholding procedure is introduced and theoretical guarantees are given. Minimax rate of convergence is established and the proposed estimator is shown to be adaptively rate-optimal over collections of paired correlation matrices with approximately sparse differences. Simulation results show that the procedure significantly outperforms two other natural methods that are based on separate estimation of the individual correlation matrices. The procedure is also illustrated through an analysis of a breast cancer dataset, which provides evidence at the gene co-expression level that several genes, of which a subset has been previously verified, are associated with the breast cancer. Hypothesis testing on the differential correlation matrices is also considered. A test, which is particularly well suited for testing against sparse alternatives, is introduced. In addition, other related problems, including estimation of a single sparse correlation matrix, estimation of the differential covariance matrices, and estimation of the differential cross-correlation matrices, are also discussed.
Classification of sparse high-dimensional vectors.
Ingster, Yuri I; Pouet, Christophe; Tsybakov, Alexandre B
2009-11-13
We study the problem of classification of d-dimensional vectors into two classes (one of which is 'pure noise') based on a training sample of size m. The main specific feature is that the dimension d can be very large. We suppose that the difference between the distribution of the population and that of the noise is only in a shift, which is a sparse vector. For Gaussian noise, fixed sample size m, and dimension d that tends to infinity, we obtain the sharp classification boundary, i.e. the necessary and sufficient conditions for the possibility of successful classification. We propose classifiers attaining this boundary. We also give extensions of the result to the case where the sample size m depends on d and satisfies the condition (log m)/log d --> gamma, 0
Blöchliger, Nicolas; Caflisch, Amedeo; Vitalis, Andreas
2015-11-10
Data mining techniques depend strongly on how the data are represented and how distance between samples is measured. High-dimensional data often contain a large number of irrelevant dimensions (features) for a given query. These features act as noise and obfuscate relevant information. Unsupervised approaches to mine such data require distance measures that can account for feature relevance. Molecular dynamics simulations produce high-dimensional data sets describing molecules observed in time. Here, we propose to globally or locally weight simulation features based on effective rates. This emphasizes, in a data-driven manner, slow degrees of freedom that often report on the metastable states sampled by the molecular system. We couple this idea to several unsupervised learning protocols. Our approach unmasks slow side chain dynamics within the native state of a miniprotein and reveals additional metastable conformations of a protein. The approach can be combined with most algorithms for clustering or dimensionality reduction.
Visual exploration of high-dimensional data through subspace analysis and dynamic projections
Liu, S.; Wang, B.; Thiagarajan, J. J.; ...
2015-06-01
Here, we introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that createmore » smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.« less
Visual exploration of high-dimensional data through subspace analysis and dynamic projections
Liu, S.; Wang, B.; Thiagarajan, J. J.; Bremer, P. -T.; Pascucci, V.
2015-06-01
Here, we introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that create smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.
Visual Exploration of High-Dimensional Data through Subspace Analysis and Dynamic Projections
Liu, S.; Wang, B.; Thiagarajan, Jayaraman J.; Bremer, Peer -Timo; Pascucci, Valerio
2015-06-01
We introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that create smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.
NASA Astrophysics Data System (ADS)
Hill, C.
2008-12-01
Low cost graphic cards today use many, relatively simple, compute cores to deliver support for memory bandwidth of more than 100GB/s and theoretical floating point performance of more than 500 GFlop/s. Right now this performance is, however, only accessible to highly parallel algorithm implementations that, (i) can use a hundred or more, 32-bit floating point, concurrently executing cores, (ii) can work with graphics memory that resides on the graphics card side of the graphics bus and (iii) can be partially expressed in a language that can be compiled by a graphics programming tool. In this talk we describe our experiences implementing a complete, but relatively simple, time dependent shallow-water equations simulation targeting a cluster of 30 computers each hosting one graphics card. The implementation takes into account the considerations (i), (ii) and (iii) listed previously. We code our algorithm as a series of numerical kernels. Each kernel is designed to be executed by multiple threads of a single process. Kernels are passed memory blocks to compute over which can be persistent blocks of memory on a graphics card. Each kernel is individually implemented using the NVidia CUDA language but driven from a higher level supervisory code that is almost identical to a standard model driver. The supervisory code controls the overall simulation timestepping, but is written to minimize data transfer between main memory and graphics memory (a massive performance bottle-neck on current systems). Using the recipe outlined we can boost the performance of our cluster by nearly an order of magnitude, relative to the same algorithm executing only on the cluster CPU's. Achieving this performance boost requires that many threads are available to each graphics processor for execution within each numerical kernel and that the simulations working set of data can fit into the graphics card memory. As we describe, this puts interesting upper and lower bounds on the problem sizes
Duke Workshop on High-Dimensional Data Sensing and Analysis
2015-05-06
acquisition of high-dimensional data, including compressive sensing (CS). The meeting focused on new theory , algorithms and application. In addition to having... theory , algorithms and application. In addition to having talks from many of the leading researchers from academia, there were talks from the members of...analysis and acquisition of high-dimensional data, including compressive sensing (CS). The meeting focused on new theory , algorithms and application
Towards robust particle filters for high-dimensional systems
NASA Astrophysics Data System (ADS)
van Leeuwen, Peter Jan
2015-04-01
In recent years particle filters have matured and several variants are now available that are not degenerate for high-dimensional systems. Often they are based on ad-hoc combinations with Ensemble Kalman Filters. Unfortunately it is unclear what approximations are made when these hybrids are used. The proper way to derive particle filters for high-dimensional systems is exploring the freedom in the proposal density. It is well known that using an Ensemble Kalman Filter as proposal density (the so-called Weighted Ensemble Kalman Filter) does not work for high-dimensional systems. However, much better results are obtained when weak-constraint 4Dvar is used as proposal, leading to the implicit particle filter. Still this filter is degenerate when the number of independent observations is large. The Equivalent-Weights Particle Filter is a filter that works well in systems of arbitrary dimensions, but it contains a few tuning parameters that have to be chosen well to avoid biases. In this paper we discuss ways to derive more robust particle filters for high-dimensional systems. Using ideas from large-deviation theory and optimal transportation particle filters will be generated that are robust and work well in these systems. It will be shown that all successful filters can be derived from one general framework. Also, the performance of the filters will be tested on simple but high-dimensional systems, and, if time permits, on a high-dimensional highly nonlinear barotropic vorticity equation model.
On The Missing Dwarf Problem In Clusters And Around The Nearby Galaxy M33
NASA Astrophysics Data System (ADS)
Keenan, Olivia Charlotte
2017-08-01
This thesis explores possible solutions to the dwarf galaxy problem. This is a discrepancy between the number of dwarf galaxies we observe, and the number predicted from cosmological computer simulations. Simulations predict around ten times more dwarf galaxy satellites than are currently observed. I have investigated two possible solutions: dark galaxies and the low surface brightness universe. Dark galaxies are dark matter halos which contain gas, but few or no stars, hence are optically dark. As part of the Arecibo Galaxy Environment Survey I surveyed the neutral hydrogen gas around the nearby galaxy M33. I found 32 gas clouds, 11 of which are new detections. Amongst these there was one particularly interesting cloud. AGESM33-32 is ring shaped and larger than M33 itself, if at the same distance. It has a velocity width which is similar to the velocity dispersion of gas in a disk galaxy, as well as having a clear velocity gradient across it which may be due to rotation. The fact that it also currently has no observed associated stars means it is a dark galaxy candidate. Optically, dwarf galaxies may be out there, but too faint for us to detect. This means that with newer, deeper, images we may be able to unveil a large, low surface brightness, population of dwarf galaxies. However, the question remains as to how these can be distinguished from background galaxies. I have used Next Generation Virgo Survey (NGVS) data to carry out photometry on 852 Virgo galaxies in four bands. I also measured the photometric properties of galaxies on a background (non-cluster) NGVS frame. I discovered that a combination of colour, magnitude and surface brightness information could be used to identify cluster dwarf galaxies from background field galaxies. The most effective method is to use the surface brightness-magnitude relation.
Lee, Hyun Jung; McDonnell, Kevin T.; Zelenyuk, Alla; Imre, D.; Mueller, Klaus
2014-03-01
Although the Euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging inter-cluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multi-dimensional scaling (MDS) where one can often observe non-intuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our MDS plots also exhibit similar visual relationships as the method of parallel coordinates which is often used alongside to visualize the high-dimensional data in raw form. We then cast our metric into a bi-scale framework which distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate Euclidean distance.
Choosing ℓp norms in high-dimensional spaces based on hub analysis
Flexer, Arthur; Schnitzer, Dominik
2015-01-01
The hubness phenomenon is a recently discovered aspect of the curse of dimensionality. Hub objects have a small distance to an exceptionally large number of data points while anti-hubs lie far from all other data points. A closely related problem is the concentration of distances in high-dimensional spaces. Previous work has already advocated the use of fractional ℓp norms instead of the ubiquitous Euclidean norm to avoid the negative effects of distance concentration. However, which exact fractional norm to use is a largely unsolved problem. The contribution of this work is an empirical analysis of the relation of different ℓp norms and hubness. We propose an unsupervised approach for choosing an ℓp norm which minimizes hubs while simultaneously maximizing nearest neighbor classification. Our approach is evaluated on seven high-dimensional data sets and compared to three approaches that re-scale distances to avoid hubness. PMID:26640321
The huge Package for High-dimensional Undirected Graph Estimation in R
Zhao, Tuo; Liu, Han; Roeder, Kathryn; Lafferty, John; Wasserman, Larry
2015-01-01
We describe an R package named huge which provides easy-to-use functions for estimating high dimensional undirected graphs from data. This package implements recent results in the literature, including Friedman et al. (2007), Liu et al. (2009, 2012) and Liu et al. (2010). Compared with the existing graph estimation package glasso, the huge package provides extra features: (1) instead of using Fortan, it is written in C, which makes the code more portable and easier to modify; (2) besides fitting Gaussian graphical models, it also provides functions for fitting high dimensional semiparametric Gaussian copula models; (3) more functions like data-dependent model selection, data generation and graph visualization; (4) a minor convergence problem of the graphical lasso algorithm is corrected; (5) the package allows the user to apply both lossless and lossy screening rules to scale up large-scale problems, making a tradeoff between computational and statistical efficiency. PMID:26834510
Evangelista, Francesco A
2011-06-14
We report a general implementation of alternative formulations of single-reference coupled cluster theory (extended, unitary, and variational) with arbitrary-order truncation of the cluster operator. These methods are applied to compute the energy of Ne and the equilibrium properties of HF and C(2). Potential energy curves for the dissociation of HF and the BeH(2) model computed with the extended, variational, and unitary coupled cluster approaches are compared to those obtained from the multireference coupled cluster approach of Mukherjee et al. [J. Chem. Phys. 110, 6171 (1999)] and the internally contracted multireference coupled cluster approach [F. A. Evangelista and J. Gauss, J. Chem. Phys. 134, 114102 (2011)]. In the case of Ne, HF, and C(2), the alternative coupled cluster approaches yield almost identical bond length, harmonic vibrational frequency, and anharmonic constant, which are more accurate than those from traditional coupled cluster theory. For potential energy curves, the alternative coupled cluster methods are found to be more accurate than traditional coupled cluster theory, but are three to ten times less accurate than multireference coupled cluster approaches. The most challenging benchmark, the BeH(2) model, highlights the strong dependence of the alternative coupled cluster theories on the choice of the Fermi vacuum. When evaluated by the accuracy to cost ratio, the alternative coupled cluster methods are not competitive with respect to traditional CC theory, in other words, the simplest theory is found to be the most effective one.
ClusterSculptor: Software for Expert-Steered Classification of Single Particle Mass Spectra
Zelenyuk, Alla; Imre, Dan G.; Nam, Eun Ju; Han, Yiping; Mueller, Klaus
2008-08-01
To take full advantage of the vast amount of highly detailed data acquired by single particle mass spectrometers requires that the data be organized according to some rules that have the potential to be insightful. Most commonly statistical tools are used to cluster the individual particle mass spectra on the basis of their similarity. Cluster analysis is a powerful strategy for the exploration of high-dimensional data in the absence of a-priori hypotheses or data classification models, and the results of cluster analysis can then be used to form such models. More often than not, when examining the data clustering results we find that many clusters contain particles of different types and that many particles of one type end up in a number of separate clusters. Our experience with cluster analysis shows that we have a vast amount of non-compiled knowledge and intuition that should be brought to bear in this effort. We will present new software we call ClusterSculptor that provides comprehensive and intuitive framework to aid scientists in data classification. ClusterSculptor uses k-means as the overall clustering engine, but allows tuning its parameters interactively, based on a non-distorted compact visual presentation of the inherent characteristics of the data in high-dimensional space. ClusterSculptor provides all the tools necessary for a high-dimensional activity we call cluster sculpting. ClusterSculptor is designed to be coupled to SpectraMiner, our data mining and visualization software package. The data are first visualized with SpectraMiner and identified problems are exported to ClusterSculptor, where the user steers the reclassification and recombination of clusters of tens of thousands particle mass spectra in real-time. The resulting sculpted clusters can be then imported back into SpectraMiner. Here we will greatly improved single particle chemical speciation in an example of application of this new tool to a number of particle types of atmospheric
Convex Clustering: An Attractive Alternative to Hierarchical Clustering
Chen, Gary K.; Chi, Eric C.; Ranola, John Michael O.; Lange, Kenneth
2015-01-01
The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/ PMID:25965340
Platon, Ludovic; Pejoski, David; Gautreau, Guillaume; Targat, Brice; Le Grand, Roger; Beignon, Anne-Sophie; Tchitchek, Nicolas
2017-09-14
Cytometry is an experimental technique used to measure molecules expressed by cells at a single cell resolution. Recently, several technological improvements have made possible to increase greatly the number of cell markers that can be simultaneously measured. Many computational methods have been proposed to identify clusters of cells having similar phenotypes. Nevertheless, only a limited number of computational methods permits to compare the phenotypes of the cell clusters identified by different clustering approaches. These phenotypic comparisons are necessary to choose the appropriate clustering methods and settings. Because of this lack of tools, comparisons of cell cluster phenotypes are often performed manually, a highly biased and time-consuming process. We designed CytoCompare, an R package that performs comparisons between the phenotypes of cell clusters with the purpose of identifying similar and different ones, based on the distribution of marker expressions. For each phenotype comparison of two cell clusters, CytoCompare provides a distance measure as well as a p-value asserting the statistical significance of the difference. CytoCompare can import clustering results from various algorithms including SPADE, viSNE/ACCENSE, and Citrus, the most current widely used algorithms. Additionally, CytoCompare can generate parallel coordinates, parallel heatmaps, multidimensional scaling or circular graph representations to visualize easily cell cluster phenotypes and the comparison results. CytoCompare is a flexible analysis pipeline for comparing the phenotypes of cell clusters identified by automatic gating algorithms in high-dimensional cytometry data. This R package is ideal for benchmarking different clustering algorithms and associated parameters. CytoCompare is freely distributed under the GPL-3 license and is available on https://github.com/tchitchek-lab/CytoCompare. Copyright © 2017 The Authors. Published by Elsevier Inc. All rights reserved.
Autonomous mental development in high dimensional context and action spaces.
Joshi, Ameet; Weng, Juyang
2003-01-01
Autonomous Mental Development (AMD) of robots opened a new paradigm for developing machine intelligence, using neural network type of techniques and it fundamentally changed the way an intelligent machine is developed from manual to autonomous. The work presented here is a part of SAIL (Self-Organizing Autonomous Incremental Learner) project which deals with autonomous development of humanoid robot with vision, audition, manipulation and locomotion. The major issue addressed here is the challenge of high dimensional action space (5-10) in addition to the high dimensional context space (hundreds to thousands and beyond), typically required by an AMD machine. This is the first work that studies a high dimensional (numeric) action space in conjunction with a high dimensional perception (context state) space, under the AMD mode. Two new learning algorithms, Direct Update on Direction Cosines (DUDC) and High-Dimensional Conjugate Gradient Search (HCGS), are developed, implemented and tested. The convergence properties of both the algorithms and their targeted applications are discussed. Autonomous learning of speech production under reinforcement learning is studied as an example.
Harnessing high-dimensional hyperentanglement through a biphoton frequency comb
NASA Astrophysics Data System (ADS)
Xie, Zhenda; Zhong, Tian; Shrestha, Sajan; Xu, Xinan; Liang, Junlin; Gong, Yan-Xiao; Bienfang, Joshua C.; Restelli, Alessandro; Shapiro, Jeffrey H.; Wong, Franco N. C.; Wei Wong, Chee
2015-08-01
Quantum entanglement is a fundamental resource for secure information processing and communications, and hyperentanglement or high-dimensional entanglement has been separately proposed for its high data capacity and error resilience. The continuous-variable nature of the energy-time entanglement makes it an ideal candidate for efficient high-dimensional coding with minimal limitations. Here, we demonstrate the first simultaneous high-dimensional hyperentanglement using a biphoton frequency comb to harness the full potential in both the energy and time domain. Long-postulated Hong-Ou-Mandel quantum revival is exhibited, with up to 19 time-bins and 96.5% visibilities. We further witness the high-dimensional energy-time entanglement through Franson revivals, observed periodically at integer time-bins, with 97.8% visibility. This qudit state is observed to simultaneously violate the generalized Bell inequality by up to 10.95 standard deviations while observing recurrent Clauser-Horne-Shimony-Holt S-parameters up to 2.76. Our biphoton frequency comb provides a platform for photon-efficient quantum communications towards the ultimate channel capacity through energy-time-polarization high-dimensional encoding.
Optimally splitting cases for training and testing high dimensional classifiers
2011-01-01
Background We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? Results We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. Conclusions By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split. PMID:21477282
An Effective Parameter Screening Strategy for High Dimensional Watershed Models
NASA Astrophysics Data System (ADS)
Khare, Y. P.; Martinez, C. J.; Munoz-Carpena, R.
2014-12-01
Watershed simulation models can assess the impacts of natural and anthropogenic disturbances on natural systems. These models have become important tools for tackling a range of water resources problems through their implementation in the formulation and evaluation of Best Management Practices, Total Maximum Daily Loads, and Basin Management Action Plans. For accurate applications of watershed models they need to be thoroughly evaluated through global uncertainty and sensitivity analyses (UA/SA). However, due to the high dimensionality of these models such evaluation becomes extremely time- and resource-consuming. Parameter screening, the qualitative separation of important parameters, has been suggested as an essential step before applying rigorous evaluation techniques such as the Sobol' and Fourier Amplitude Sensitivity Test (FAST) methods in the UA/SA framework. The method of elementary effects (EE) (Morris, 1991) is one of the most widely used screening methodologies. Some of the common parameter sampling strategies for EE, e.g. Optimized Trajectories [OT] (Campolongo et al., 2007) and Modified Optimized Trajectories [MOT] (Ruano et al., 2012), suffer from inconsistencies in the generated parameter distributions, infeasible sample generation time, etc. In this work, we have formulated a new parameter sampling strategy - Sampling for Uniformity (SU) - for parameter screening which is based on the principles of the uniformity of the generated parameter distributions and the spread of the parameter sample. A rigorous multi-criteria evaluation (time, distribution, spread and screening efficiency) of OT, MOT, and SU indicated that SU is superior to other sampling strategies. Comparison of the EE-based parameter importance rankings with those of Sobol' helped to quantify the qualitativeness of the EE parameter screening approach, reinforcing the fact that one should use EE only to reduce the resource burden required by FAST/Sobol' analyses but not to replace it.
Designing Progressive and Interactive Analytics Processes for High-Dimensional Data Analysis.
Turkay, Cagatay; Kaya, Erdem; Balcisoy, Selim; Hauser, Helwig
2017-01-01
In interactive data analysis processes, the dialogue between the human and the computer is the enabling mechanism that can lead to actionable observations about the phenomena being investigated. It is of paramount importance that this dialogue is not interrupted by slow computational mechanisms that do not consider any known temporal human-computer interaction characteristics that prioritize the perceptual and cognitive capabilities of the users. In cases where the analysis involves an integrated computational method, for instance to reduce the dimensionality of the data or to perform clustering, such non-optimal processes are often likely. To remedy this, progressive computations, where results are iteratively improved, are getting increasing interest in visual analytics. In this paper, we present techniques and design considerations to incorporate progressive methods within interactive analysis processes that involve high-dimensional data. We define methodologies to facilitate processes that adhere to the perceptual characteristics of users and describe how online algorithms can be incorporated within these. A set of design recommendations and according methods to support analysts in accomplishing high-dimensional data analysis tasks are then presented. Our arguments and decisions here are informed by observations gathered over a series of analysis sessions with analysts from finance. We document observations and recommendations from this study and present evidence on how our approach contribute to the efficiency and productivity of interactive visual analysis sessions involving high-dimensional data.
An Overview of Air Pollution Problem in Megacities and City Clusters in China
NASA Astrophysics Data System (ADS)
Tang, X.
2007-05-01
China has experienced the rapid economic growth in last twenty years. City clusters, which consist of one or several mega cities in close vicinity and many satellite cities and towns, are playing a leading role in Chinese economic growth, owing to their collective economic capacity and interdependency. However, accompanying with the economic boom, population growth and increased energy consumption, the air quality has been degrading in the past two decades. Air pollution in those areas is characterized by concurrent occurrence of high concentrations of multiple primary pollutants leading to form complex secondary pollution problem. After decades long efforts to control air pollution, both the government and scientific communities have realized that to control regional scale air pollution, regional efforts are needed. Field experiments covering the regions like Pearl River Delta region and Beijing City with surrounding areas are critical to understand the chemical and physical processes leading to the formation of regional scale air pollution. In order to formulate policy suggestions for air quality attainment during 2008 Beijing Olympic game and to propose objectives of air quality attainment in 2010 in Beijing, CAREBEIJING (Campaigns of Air Quality Research in Beijing and Surrounding Region) was organized by Peking University in 2006 to learn current air pollution situation of the region, and to identify the transport and transformation processes that lead to the impact of the surrounding area on air quality in Beijing. Same as the purpose for understanding the chemical and physical processes happened in regional scale, the fall and summer campaigns in 2004 and 2006 were carried out in Pearl River Delta. More than 16 domestic and foreign institutions were involved in these campaigns. The background, current status, problems, and some results of these campaigns will be introduced in this presentation.
Resolving the timing problem of the globular clusters orbiting the Fornax dwarf galaxy
NASA Astrophysics Data System (ADS)
Angus, G. W.; Diaferio, A.
2009-06-01
We re-investigate the old problem of the survival of the five globular clusters (GCs) orbiting the Fornax dwarf galaxy in both standard and modified Newtonian dynamics (MOND). For the first time in the history of the topic, we use accurate mass models for the Fornax dwarf, obtained through Jeans modelling of the recently published line-of-sight (LOS) velocity dispersion data, and we are also not resigned to circular orbits for the GCs. Previously conceived problems stem from fixing the starting distances of the globulars to be less than half the tidal radius. We relax this constraint since there is absolutely no evidence for it and show that the dark matter (DM) paradigm, with either cusped or cored DM profiles, has no trouble sustaining the orbits of the two least massive GCs for a Hubble time almost regardless of their initial distance from Fornax. The three most massive globulars can remain in orbit as long as their starting distances are marginally outside the tidal radius. The outlook for MOND is also not nearly as bleak as previously reported. Although dynamical friction (DF) inside the tidal radius is far stronger in MOND, outside DF is negligible due to the absence of stars. This allows highly radial orbits to survive, but more importantly circular orbits at distances more than 85 per cent of Fornax's tidal radius to survive indefinitely. The probability of the GCs being on circular orbits at this distance compared with their current projected distances is discussed and shown to be plausible. Finally, if we ignore the presence of the most massive globular (giving it a large LOS distance), we demonstrate that the remaining four globulars can survive within the tidal radius for the Hubble time with perfectly sensible orbits.
Modified Dendrogram of High-dimensional Feature Space for Transfer Function Design
Wang, Lei; Zhao, Xin; Kaufman, Arie
2010-01-01
We introduce a modified dendrogram (MD) (with sub-trees to represent the feature space clusters) and display it in continuous space for multi-dimensional transfer function (TF) design and modification. Such a TF for direct volume rendering often employs a multi-dimensional feature space. In an n-dimensional (nD) feature space, each voxel is described using n attributes and represented by a vector of n values. The MD reveals the hierarchical structure information of the high-dimensional feature space clusters. Using the MD user interface (UI), the user can design and modify the TF in 2D in an intuitive and informative manner instead of designing it directly in multi-dimensional space where it is complicated and harder to understand the relationship of the feature space vectors. In addition, we provide the capability to interactively change the granularity of the MD. The coarse-grained MD shows primarily the global information of the feature space while the fine-grained MD reveals the finer details, and the separation ability of the high-dimensional feature space is completely preserved in the finest granularity. With the so called multi-grained method, the user can efficiently create a TF using the coarse-grained MD, then fine tune it with the finer-grained MDs to improve the quality of the volume rendering. Furthermore, we propose a fast interactive hierarchical clustering (FIHC) algorithm for accelerating the MD computation and supporting the interactive multi-grained TF design. In the FIHC, the finest-grained MD is established by linking the feature space vectors, then the feature space vectors being the leaves of this tree are clustered using a hierarchical leaf clustering (HLC) algorithm forming a leaf vector hierarchical tree (LVHT). The granularity of the MD can be changed by setting the precision of the LVHT. Our method is independent on the type of the attributes and supports arbitrary-dimension feature space. PMID:26279612
Hypergraph-based anomaly detection of high-dimensional co-occurrences.
Silva, Jorge; Willett, Rebecca
2009-03-01
This paper addresses the problem of detecting anomalous multivariate co-occurrences using a limited number of unlabeled training observations. A novel method based on using a hypergraph representation of the data is proposed to deal with this very high-dimensional problem. Hypergraphs constitute an important extension of graphs which allow edges to connect more than two vertices simultaneously. A variational Expectation-Maximization algorithm for detecting anomalies directly on the hypergraph domain without any feature selection or dimensionality reduction is presented. The resulting estimate can be used to calculate a measure of anomalousness based on the False Discovery Rate. The algorithm has O(np) computational complexity, where n is the number of training observations and p is the number of potential participants in each co-occurrence event. This efficiency makes the method ideally suited for very high-dimensional settings, and requires no tuning, bandwidth or regularization parameters. The proposed approach is validated on both high-dimensional synthetic data and the Enron email database, where p > 75,000, and it is shown that it can outperform other state-of-the-art methods.
Stability Approach to Regularization Selection (StARS) for High Dimensional Graphical Models.
Liu, Han; Roeder, Kathryn; Wasserman, Larry
2010-12-31
A challenging problem in estimating high-dimensional graphical models is to choose the regularization parameter in a data-dependent way. The standard techniques include K-fold cross-validation (K-CV), Akaike information criterion (AIC), and Bayesian information criterion (BIC). Though these methods work well for low-dimensional problems, they are not suitable in high dimensional settings. In this paper, we present StARS: a new stability-based method for choosing the regularization parameter in high dimensional inference for undirected graphs. The method has a clear interpretation: we use the least amount of regularization that simultaneously makes a graph sparse and replicable under random sampling. This interpretation requires essentially no conditions. Under mild conditions, we show that StARS is partially sparsistent in terms of graph estimation: i.e. with high probability, all the true edges will be included in the selected model even when the graph size diverges with the sample size. Empirically, the performance of StARS is compared with the state-of-the-art model selection procedures, including K-CV, AIC, and BIC, on both synthetic data and a real microarray dataset. StARS outperforms all these competing procedures.
Understanding 3D human torso shape via manifold clustering
NASA Astrophysics Data System (ADS)
Li, Sheng; Li, Peng; Fu, Yun
2013-05-01
Discovering the variations in human torso shape plays a key role in many design-oriented applications, such as suit designing. With recent advances in 3D surface imaging technologies, people can obtain 3D human torso data that provide more information than traditional measurements. However, how to find different human shapes from 3D torso data is still an open problem. In this paper, we propose to use spectral clustering approach on torso manifold to address this problem. We first represent high-dimensional torso data in a low-dimensional space using manifold learning algorithm. Then the spectral clustering method is performed to get several disjoint clusters. Experimental results show that the clusters discovered by our approach can describe the discrepancies in both genders and human shapes, and our approach achieves better performance than the compared clustering method.
Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data.
Cai, T Tony; Zhang, Anru
2016-09-01
Missing data occur frequently in a wide range of applications. In this paper, we consider estimation of high-dimensional covariance matrices in the presence of missing observations under a general missing completely at random model in the sense that the missingness is not dependent on the values of the data. Based on incomplete data, estimators for bandable and sparse covariance matrices are proposed and their theoretical and numerical properties are investigated. Minimax rates of convergence are established under the spectral norm loss and the proposed estimators are shown to be rate-optimal under mild regularity conditions. Simulation studies demonstrate that the estimators perform well numerically. The methods are also illustrated through an application to data from four ovarian cancer studies. The key technical tools developed in this paper are of independent interest and potentially useful for a range of related problems in high-dimensional statistical inference with missing data.
Predicting Time Series from Short-Term High-Dimensional Data
NASA Astrophysics Data System (ADS)
Ma, Huanfei; Zhou, Tianshou; Aihara, Kazuyuki; Chen, Luonan
The prediction of future values of time series is a challenging task in many fields. In particular, making prediction based on short-term data is believed to be difficult. Here, we propose a method to predict systems' low-dimensional dynamics from high-dimensional but short-term data. Intuitively, it can be considered as a transformation from the inter-variable information of the observed high-dimensional data into the corresponding low-dimensional but long-term data, thereby equivalent to prediction of time series data. Technically, this method can be viewed as an inverse implementation of delayed embedding reconstruction. Both methods and algorithms are developed. To demonstrate the effectiveness of the theoretical result, benchmark examples and real-world problems from various fields are studied.
Wahid, Abdul; Khan, Dost Muhammad; Hussain, Ijaz
2017-01-01
High dimensional data are commonly encountered in various scientific fields and pose great challenges to modern statistical analysis. To address this issue different penalized regression procedures have been introduced in the litrature, but these methods cannot cope with the problem of outliers and leverage points in the heavy tailed high dimensional data. For this purppose, a new Robust Adaptive Lasso (RAL) method is proposed which is based on pearson residuals weighting scheme. The weight function determines the compatibility of each observations and downweight it if they are inconsistent with the assumed model. It is observed that RAL estimator can correctly select the covariates with non-zero coefficients and can estimate parameters, simultaneously, not only in the presence of influential observations, but also in the presence of high multicolliearity. We also discuss the model selection oracle property and the asymptotic normality of the RAL. Simulations findings and real data examples also demonstrate the better performance of the proposed penalized regression approach.
Minimax Rate-optimal Estimation of High-dimensional Covariance Matrices with Incomplete Data*
Cai, T. Tony; Zhang, Anru
2016-01-01
Missing data occur frequently in a wide range of applications. In this paper, we consider estimation of high-dimensional covariance matrices in the presence of missing observations under a general missing completely at random model in the sense that the missingness is not dependent on the values of the data. Based on incomplete data, estimators for bandable and sparse covariance matrices are proposed and their theoretical and numerical properties are investigated. Minimax rates of convergence are established under the spectral norm loss and the proposed estimators are shown to be rate-optimal under mild regularity conditions. Simulation studies demonstrate that the estimators perform well numerically. The methods are also illustrated through an application to data from four ovarian cancer studies. The key technical tools developed in this paper are of independent interest and potentially useful for a range of related problems in high-dimensional statistical inference with missing data. PMID:27777471
A High-Dimensional Nonparametric Multivariate Test for Mean Vector.
Wang, Lan; Peng, Bo; Li, Runze
This work is concerned with testing the population mean vector of nonnormal high-dimensional multivariate data. Several tests for high-dimensional mean vector, based on modifying the classical Hotelling T(2) test, have been proposed in the literature. Despite their usefulness, they tend to have unsatisfactory power performance for heavy-tailed multivariate data, which frequently arise in genomics and quantitative finance. This paper proposes a novel high-dimensional nonparametric test for the population mean vector for a general class of multivariate distributions. With the aid of new tools in modern probability theory, we proved that the limiting null distribution of the proposed test is normal under mild conditions when p is substantially larger than n. We further study the local power of the proposed test and compare its relative efficiency with a modified Hotelling T(2) test for high-dimensional data. An interesting finding is that the newly proposed test can have even more substantial power gain with large p than the traditional nonparametric multivariate test does with finite fixed p. We study the finite sample performance of the proposed test via Monte Carlo simulations. We further illustrate its application by an empirical analysis of a genomics data set.
High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries
Zollanvari, Amin
2015-01-01
High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical–statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject. PMID:27081307
Querying Patterns in High-Dimensional Heterogenous Datasets
ERIC Educational Resources Information Center
Singh, Vishwakarma
2012-01-01
The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a…
Multiple imputation in the presence of high-dimensional data.
Zhao, Yize; Long, Qi
2016-10-01
Missing data are frequently encountered in biomedical, epidemiologic and social research. It is well known that a naive analysis without adequate handling of missing data may lead to bias and/or loss of efficiency. Partly due to its ease of use, multiple imputation has become increasingly popular in practice for handling missing data. However, it is unclear what is the best strategy to conduct multiple imputation in the presence of high-dimensional data. To answer this question, we investigate several approaches of using regularized regression and Bayesian lasso regression to impute missing values in the presence of high-dimensional data. We compare the performance of these methods through numerical studies, in which we also evaluate the impact of the dimension of the data, the size of the true active set for imputation, and the strength of correlation. Our numerical studies show that in the presence of high-dimensional data the standard multiple imputation approach performs poorly and the imputation approach using Bayesian lasso regression achieves, in most cases, better performance than the other imputation methods including the standard imputation approach using the correctly specified imputation model. Our results suggest that Bayesian lasso regression and its extensions are better suited for multiple imputation in the presence of high-dimensional data than the other regression methods. © The Author(s) 2013.
Querying Patterns in High-Dimensional Heterogenous Datasets
ERIC Educational Resources Information Center
Singh, Vishwakarma
2012-01-01
The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a…
High-Dimensional Statistical Learning: Roots, Justifications, and Potential Machineries.
Zollanvari, Amin
2015-01-01
High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations. Much work has been done in developing mathematical-statistical techniques for analyzing high-dimensional data. Despite remarkable progress in this field, many practitioners still utilize classical methods for analyzing such datasets. This state of affairs can be attributed, in part, to a lack of knowledge and, in part, to the ready-to-use computational and statistical software packages that are well developed for classical techniques. Moreover, many scientists working in a specific field of high-dimensional statistical learning are either not aware of other existing machineries in the field or are not willing to try them out. The primary goal in this work is to bring together various machineries of high-dimensional analysis, give an overview of the important results, and present the operating conditions upon which they are grounded. When appropriate, readers are referred to relevant review articles for more information on a specific subject.
A High-Dimensional Nonparametric Multivariate Test for Mean Vector
Wang, Lan; Peng, Bo; Li, Runze
2015-01-01
This work is concerned with testing the population mean vector of nonnormal high-dimensional multivariate data. Several tests for high-dimensional mean vector, based on modifying the classical Hotelling T2 test, have been proposed in the literature. Despite their usefulness, they tend to have unsatisfactory power performance for heavy-tailed multivariate data, which frequently arise in genomics and quantitative finance. This paper proposes a novel high-dimensional nonparametric test for the population mean vector for a general class of multivariate distributions. With the aid of new tools in modern probability theory, we proved that the limiting null distribution of the proposed test is normal under mild conditions when p is substantially larger than n. We further study the local power of the proposed test and compare its relative efficiency with a modified Hotelling T2 test for high-dimensional data. An interesting finding is that the newly proposed test can have even more substantial power gain with large p than the traditional nonparametric multivariate test does with finite fixed p. We study the finite sample performance of the proposed test via Monte Carlo simulations. We further illustrate its application by an empirical analysis of a genomics data set. PMID:26848205
Cluster headache - a symptom of different problems or a primary form? A case report.
Domitrz, Izabela; Gaweł, Małgorzata; Maj, Edyta
2013-01-01
Headache with severe, strictly one-sided unilateral attacks of pain in orbital, supraorbital, temporal localisation lasting 15-180 minutes occurring from once every two days to 8 times daily, typically with one or more autonomic symptoms, is recognized as cluster headache (CH). Headache with normal neurological examination and abnormal neuroimaging studies, mimicking cluster headache, is reported by several authors. We present an elderly woman with a cluster-like headache probably associated with other comorbidities. We differentiate between primary, but 'atypical' CH and symptomatic cluster headache due to frontal sinusitis, pontine venous angioma or vascular compression of the trigeminal nerve root. This headache is not so rare in the general population and its secondary causes must be ruled out before the diagnosis of a primary headache as cluster headache is made.
ERIC Educational Resources Information Center
Dry, Matthew J.; Preiss, Kym; Wagemans, Johan
2012-01-01
We investigated human performance on the Euclidean Traveling Salesperson Problem (TSP) and Euclidean Minimum Spanning Tree Problem (MST-P) in regards to a factor that has previously received little attention within the literature: the spatial distributions of TSP and MST-P stimuli. First, we describe a method for quantifying the relative degree of…
ERIC Educational Resources Information Center
Dry, Matthew J.; Preiss, Kym; Wagemans, Johan
2012-01-01
We investigated human performance on the Euclidean Traveling Salesperson Problem (TSP) and Euclidean Minimum Spanning Tree Problem (MST-P) in regards to a factor that has previously received little attention within the literature: the spatial distributions of TSP and MST-P stimuli. First, we describe a method for quantifying the relative degree of…
Reduced nonlinear prognostic model construction from high-dimensional data
NASA Astrophysics Data System (ADS)
Gavrilov, Andrey; Mukhin, Dmitry; Loskutov, Evgeny; Feigin, Alexander
2017-04-01
Construction of a data-driven model of evolution operator using universal approximating functions can only be statistically justified when the dimension of its phase space is small enough, especially in the case of short time series. At the same time in many applications real-measured data is high-dimensional, e.g. it is space-distributed and multivariate in climate science. Therefore it is necessary to use efficient dimensionality reduction methods which are also able to capture key dynamical properties of the system from observed data. To address this problem we present a Bayesian approach to an evolution operator construction which incorporates two key reduction steps. First, the data is decomposed into a set of certain empirical modes, such as standard empirical orthogonal functions or recently suggested nonlinear dynamical modes (NDMs) [1], and the reduced space of corresponding principal components (PCs) is obtained. Then, the model of evolution operator for PCs is constructed which maps a number of states in the past to the current state. The second step is to reduce this time-extended space in the past using appropriate decomposition methods. Such a reduction allows us to capture only the most significant spatio-temporal couplings. The functional form of the evolution operator includes separately linear, nonlinear (based on artificial neural networks) and stochastic terms. Explicit separation of the linear term from the nonlinear one allows us to more easily interpret degree of nonlinearity as well as to deal better with smooth PCs which can naturally occur in the decompositions like NDM, as they provide a time scale separation. Results of application of the proposed method to climate data are demonstrated and discussed. The study is supported by Government of Russian Federation (agreement #14.Z50.31.0033 with the Institute of Applied Physics of RAS). 1. Mukhin, D., Gavrilov, A., Feigin, A., Loskutov, E., & Kurths, J. (2015). Principal nonlinear dynamical
Semi-supervised clustering algorithm for haplotype assembly problem based on MEC model.
Xu, Xin-Shun; Li, Ying-Xin
2012-01-01
Haplotype assembly is to infer a pair of haplotypes from localized polymorphism data. In this paper, a semi-supervised clustering algorithm-SSK (semi-supervised K-means) is proposed for it, which, to our knowledge, is the first semi-supervised clustering method for it. In SSK, some positive information is firstly extracted. The information is then used to help k-means to cluster all SNP fragments into two sets from which two haplotypes can be reconstructed. The performance of SSK is tested on both real data and simulated data. The results show that it outperforms several state-of-the-art algorithms on minimum error correction (MEC) model.
Velocity Bias from Merging in Clusters of Galaxies: The beta < 1 Problem
NASA Astrophysics Data System (ADS)
Fusco-Femiano, R.; Menci, N.
1995-08-01
We study the evolution of the galaxy velocity distribution in galaxy clusters under binary aggregations. Starting with an initial Maxwell distribution, we solve the complete Boltzmann-Liouville equation including collisions. We find an asymptotic distribution characterized by a galaxy velocity dispersion smaller than that of the dark matter. This is due to the transfer from orbital to internal energy occurring in galaxy merging, which is not completely balanced by the galaxy response to the cluster gravitational field. As a consequence, the value of the parameter β that enters in the standard hydrostatic isothermal β-model is less than 1, as determined from the fits to the X-ray surface brightness data. The result is robust with respect to different shapes of the cluster mass distribution. The dependence of β on the cluster velocity dispersion and size is computed and discussed.
High-dimensional statistical inference: From vector to matrix
NASA Astrophysics Data System (ADS)
Zhang, Anru
Statistical inference for sparse signals or low-rank matrices in high-dimensional settings is of significant interest in a range of contemporary applications. It has attracted significant recent attention in many fields including statistics, applied mathematics and electrical engineering. In this thesis, we consider several problems in including sparse signal recovery (compressed sensing under restricted isometry) and low-rank matrix recovery (matrix recovery via rank-one projections and structured matrix completion). The first part of the thesis discusses compressed sensing and affine rank minimization in both noiseless and noisy cases and establishes sharp restricted isometry conditions for sparse signal and low-rank matrix recovery. The analysis relies on a key technical tool which represents points in a polytope by convex combinations of sparse vectors. The technique is elementary while leads to sharp results. It is shown that, in compressed sensing, delta kA < 1/3, deltak A+ thetak,kA < 1, or deltatkA < √( t - 1)/t for any given constant t ≥ 4/3 guarantee the exact recovery of all k sparse signals in the noiseless case through the constrained ℓ1 minimization, and similarly in affine rank minimization delta rM < 1/3, deltar M + thetar, rM < 1, or deltatrM< √( t - 1)/t ensure the exact reconstruction of all matrices with rank at most r in the noiseless case via the constrained nuclear norm minimization. Moreover, for any epsilon > 0, delta kA < 1/3 + epsilon, deltak A + thetak,kA < 1 + epsilon, or deltatkA< √(t - 1) / t + epsilon are not sufficient to guarantee the exact recovery of all k-sparse signals for large k. Similar result also holds for matrix recovery. In addition, the conditions delta kA<1/3, deltak A+ thetak,kA<1, delta tkA < √(t - 1)/t and deltarM<1/3, delta rM+ thetar,rM<1, delta trM< √(t - 1)/ t are also shown to be sufficient respectively for stable recovery of approximately sparse signals and low-rank matrices in the noisy case
ERIC Educational Resources Information Center
Brusco, Michael J.
2007-01-01
The study of human performance on discrete optimization problems has a considerable history that spans various disciplines. The two most widely studied problems are the Euclidean traveling salesperson problem and the quadratic assignment problem. The purpose of this paper is to outline a program of study for the measurement of human performance on…
ERIC Educational Resources Information Center
Brusco, Michael J.
2007-01-01
The study of human performance on discrete optimization problems has a considerable history that spans various disciplines. The two most widely studied problems are the Euclidean traveling salesperson problem and the quadratic assignment problem. The purpose of this paper is to outline a program of study for the measurement of human performance on…
Reinforcement learning on slow features of high-dimensional input streams.
Legenstein, Robert; Wilbert, Niko; Wiskott, Laurenz
2010-08-19
Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA) network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning.
Reinforcement Learning on Slow Features of High-Dimensional Input Streams
Legenstein, Robert; Wilbert, Niko; Wiskott, Laurenz
2010-01-01
Humans and animals are able to learn complex behaviors based on a massive stream of sensory information from different modalities. Early animal studies have identified learning mechanisms that are based on reward and punishment such that animals tend to avoid actions that lead to punishment whereas rewarded actions are reinforced. However, most algorithms for reward-based learning are only applicable if the dimensionality of the state-space is sufficiently small or its structure is sufficiently simple. Therefore, the question arises how the problem of learning on high-dimensional data is solved in the brain. In this article, we propose a biologically plausible generic two-stage learning system that can directly be applied to raw high-dimensional input streams. The system is composed of a hierarchical slow feature analysis (SFA) network for preprocessing and a simple neural network on top that is trained based on rewards. We demonstrate by computer simulations that this generic architecture is able to learn quite demanding reinforcement learning tasks on high-dimensional visual input streams in a time that is comparable to the time needed when an explicit highly informative low-dimensional state-space representation is given instead of the high-dimensional visual input. The learning speed of the proposed architecture in a task similar to the Morris water maze task is comparable to that found in experimental studies with rats. This study thus supports the hypothesis that slowness learning is one important unsupervised learning principle utilized in the brain to form efficient state representations for behavioral learning. PMID:20808883
Janes, C R; Ames, G M
1992-10-01
We examine the clustering of attendance, illness, and accidental injury problems in a large unionized manufacturing plant using both quantitative and qualitative methods. We find that the distribution of workers into problem groups is related to 1) conflicts over seniority, 2) physical stressors and their influence on perceived desirability of certain kinds of jobs, and 3) organizational conditions and environments congenial to the development of distinct occupational "subcultures." We suggest that the case study approach we apply in this paper is critical to the design of programs of preventive intervention and complements the more commonly applied multiple-site and individually focused, survey approaches.
Lee, Jenny Hyunjung; McDonnell, Kevin T; Zelenyuk, Alla; Imre, Dan; Mueller, Klaus
2013-07-11
Although the Euclidean distance does well in measuring data distances within high-dimensional clusters, it does poorly when it comes to gauging inter-cluster distances. This significantly impacts the quality of global, low-dimensional space embedding procedures such as the popular multi-dimensional scaling (MDS) where one can often observe non-intuitive layouts. We were inspired by the perceptual processes evoked in the method of parallel coordinates which enables users to visually aggregate the data by the patterns the polylines exhibit across the dimension axes. We call the path of such a polyline its structure and suggest a metric that captures this structure directly in high-dimensional space. This allows us to better gauge the distances of spatially distant data constellations and so achieve data aggregations in MDS plots that are more cognizant of existing high-dimensional structure similarities. Our bi-scale framework distinguishes far-distances from near-distances. The coarser scale uses the structural similarity metric to separate data aggregates obtained by prior classification or clustering, while the finer scale employs the appropriate Euclidean distance.
Partially supervised speaker clustering.
Tang, Hao; Chu, Stephen Mingyu; Hasegawa-Johnson, Mark; Huang, Thomas S
2012-05-01
Content-based multimedia indexing, retrieval, and processing as well as multimedia databases demand the structuring of the media content (image, audio, video, text, etc.), one significant goal being to associate the identity of the content to the individual segments of the signals. In this paper, we specifically address the problem of speaker clustering, the task of assigning every speech utterance in an audio stream to its speaker. We offer a complete treatment to the idea of partially supervised speaker clustering, which refers to the use of our prior knowledge of speakers in general to assist the unsupervised speaker clustering process. By means of an independent training data set, we encode the prior knowledge at the various stages of the speaker clustering pipeline via 1) learning a speaker-discriminative acoustic feature transformation, 2) learning a universal speaker prior model, and 3) learning a discriminative speaker subspace, or equivalently, a speaker-discriminative distance metric. We study the directional scattering property of the Gaussian mixture model (GMM) mean supervector representation of utterances in the high-dimensional space, and advocate exploiting this property by using the cosine distance metric instead of the euclidean distance metric for speaker clustering in the GMM mean supervector space. We propose to perform discriminant analysis based on the cosine distance metric, which leads to a novel distance metric learning algorithm—linear spherical discriminant analysis (LSDA). We show that the proposed LSDA formulation can be systematically solved within the elegant graph embedding general dimensionality reduction framework. Our speaker clustering experiments on the GALE database clearly indicate that 1) our speaker clustering methods based on the GMM mean supervector representation and vector-based distance metrics outperform traditional speaker clustering methods based on the “bag of acoustic features” representation and statistical
Dimensionality reduction for registration of high-dimensional data sets.
Xu, Min; Chen, Hao; Varshney, Pramod K
2013-08-01
Registration of two high-dimensional data sets often involves dimensionality reduction to yield a single-band image from each data set followed by pairwise image registration. We develop a new application-specific algorithm for dimensionality reduction of high-dimensional data sets such that the weighted harmonic mean of Cramér-Rao lower bounds for the estimation of the transformation parameters for registration is minimized. The performance of the proposed dimensionality reduction algorithm is evaluated using three remotes sensing data sets. The experimental results using mutual information-based pairwise registration technique demonstrate that our proposed dimensionality reduction algorithm combines the original data sets to obtain the image pair with more texture, resulting in improved image registration.
Structural analysis of high-dimensional basins of attraction.
Martiniani, Stefano; Schrenk, K Julian; Stevenson, Jacob D; Wales, David J; Frenkel, Daan
2016-09-01
We propose an efficient Monte Carlo method for the computation of the volumes of high-dimensional bodies with arbitrary shape. We start with a region of known volume within the interior of the manifold and then use the multistate Bennett acceptance-ratio method to compute the dimensionless free-energy difference between a series of equilibrium simulations performed within this object. The method produces results that are in excellent agreement with thermodynamic integration, as well as a direct estimate of the associated statistical uncertainties. The histogram method also allows us to directly obtain an estimate of the interior radial probability density profile, thus yielding useful insight into the structural properties of such a high-dimensional body. We illustrate the method by analyzing the effect of structural disorder on the basins of attraction of mechanically stable packings of soft repulsive spheres.
Quantum Teleportation of High-dimensional Atomic Momenta State
NASA Astrophysics Data System (ADS)
Qurban, Misbah; Abbas, Tasawar; Rameez-ul-Islam; Ikram, Manzoor
2016-06-01
Atomic momenta states of the neutral atoms are known to be decoherence resistant and therefore present a viable solution for most of the quantum information tasks including the quantum teleportation. We present a systematic protocol for the teleportation of high-dimensional quantized momenta atomic states to the field state inside the cavities by applying standard cavity QED techniques. The proposal can be executed under prevailing experimental scenario.
Solvent states and spectroscopy of doped helium clusters as a quantum-chemistry-like problem.
Aguirre, Néstor F; Villarreal, Pablo; Delgado-Barrio, Gerardo; Mitrushchenkov, Alexander O; de Lara-Castells, María Pilar
2013-07-07
The Full-Configuration-Interaction Nuclear-Orbital (FCI-NO) approach [J. Chem. Phys., 2009, 131, 19401], as the implementation of the quantum-chemistry ansatz, is overviewed and applied to (He)N-Cl2(X) clusters (N≤ 4). The ground and excited states of both fermionic (3)He and bosonic (4)He [see also, J. Phys. Chem. Lett., 2012, 2, 2145] clusters are studied. It is shown that the FCI-NO approach allows us to overcome three main difficulties: (1) the Fermi-Dirac (Bose-Einstein) nuclear statistics; (2) the wide (highly anharmonic) amplitudes of the He-dopant and He-He motions; and (3) both the weakly attractive (long-range) and the strongly repulsive (short-range) interaction between the helium atoms. Special emphasis is placed on the dependence of the cluster properties on the number of helium atoms, and on the comparison between the two helium isotopes. In particular, we analyze the analogies between quantum rings comprising electrons and (3)He atoms. The synthetic vibro-rotational Raman spectra of Cl2(X) immersed in ((3,4)He)N clusters (N≤ 4) are discussed as a function of the cluster size and the nuclear statistics. It is shown that the Coriolis couplings play a key role in modifying the spectral dopant profile in (3)He. Finally, we point out possible directions for future research using the quantum-chemistry ansatz.
High-dimensional quantum cloning and applications to quantum hacking
Bouchard, Frédéric; Fickler, Robert; Boyd, Robert W.; Karimi, Ebrahim
2017-01-01
Attempts at cloning a quantum system result in the introduction of imperfections in the state of the copies. This is a consequence of the no-cloning theorem, which is a fundamental law of quantum physics and the backbone of security for quantum communications. Although perfect copies are prohibited, a quantum state may be copied with maximal accuracy via various optimal cloning schemes. Optimal quantum cloning, which lies at the border of the physical limit imposed by the no-signaling theorem and the Heisenberg uncertainty principle, has been experimentally realized for low-dimensional photonic states. However, an increase in the dimensionality of quantum systems is greatly beneficial to quantum computation and communication protocols. Nonetheless, no experimental demonstration of optimal cloning machines has hitherto been shown for high-dimensional quantum systems. We perform optimal cloning of high-dimensional photonic states by means of the symmetrization method. We show the universality of our technique by conducting cloning of numerous arbitrary input states and fully characterize our cloning machine by performing quantum state tomography on cloned photons. In addition, a cloning attack on a Bennett and Brassard (BB84) quantum key distribution protocol is experimentally demonstrated to reveal the robustness of high-dimensional states in quantum cryptography. PMID:28168219
Cell Fate Decision as High-Dimensional Critical State Transition
Zhou, Joseph; Castaño, Ivan G.; Leong-Quong, Rebecca Y. Y.; Chang, Hannah; Trachana, Kalliopi; Giuliani, Alessandro; Huang, Sui
2016-01-01
Cell fate choice and commitment of multipotent progenitor cells to a differentiated lineage requires broad changes of their gene expression profile. But how progenitor cells overcome the stability of their gene expression configuration (attractor) to exit the attractor in one direction remains elusive. Here we show that commitment of blood progenitor cells to the erythroid or myeloid lineage is preceded by the destabilization of their high-dimensional attractor state, such that differentiating cells undergo a critical state transition. Single-cell resolution analysis of gene expression in populations of differentiating cells affords a new quantitative index for predicting critical transitions in a high-dimensional state space based on decrease of correlation between cells and concomitant increase of correlation between genes as cells approach a tipping point. The detection of “rebellious cells” that enter the fate opposite to the one intended corroborates the model of preceding destabilization of a progenitor attractor. Thus, early warning signals associated with critical transitions can be detected in statistical ensembles of high-dimensional systems, offering a formal theory-based approach for analyzing single-cell molecular profiles that goes beyond current computational pattern recognition, does not require knowledge of specific pathways, and could be used to predict impending major shifts in development and disease. PMID:28027308
Cell Fate Decision as High-Dimensional Critical State Transition.
Mojtahedi, Mitra; Skupin, Alexander; Zhou, Joseph; Castaño, Ivan G; Leong-Quong, Rebecca Y Y; Chang, Hannah; Trachana, Kalliopi; Giuliani, Alessandro; Huang, Sui
2016-12-01
Cell fate choice and commitment of multipotent progenitor cells to a differentiated lineage requires broad changes of their gene expression profile. But how progenitor cells overcome the stability of their gene expression configuration (attractor) to exit the attractor in one direction remains elusive. Here we show that commitment of blood progenitor cells to the erythroid or myeloid lineage is preceded by the destabilization of their high-dimensional attractor state, such that differentiating cells undergo a critical state transition. Single-cell resolution analysis of gene expression in populations of differentiating cells affords a new quantitative index for predicting critical transitions in a high-dimensional state space based on decrease of correlation between cells and concomitant increase of correlation between genes as cells approach a tipping point. The detection of "rebellious cells" that enter the fate opposite to the one intended corroborates the model of preceding destabilization of a progenitor attractor. Thus, early warning signals associated with critical transitions can be detected in statistical ensembles of high-dimensional systems, offering a formal theory-based approach for analyzing single-cell molecular profiles that goes beyond current computational pattern recognition, does not require knowledge of specific pathways, and could be used to predict impending major shifts in development and disease.
High-dimensional quantum cloning and applications to quantum hacking.
Bouchard, Frédéric; Fickler, Robert; Boyd, Robert W; Karimi, Ebrahim
2017-02-01
Attempts at cloning a quantum system result in the introduction of imperfections in the state of the copies. This is a consequence of the no-cloning theorem, which is a fundamental law of quantum physics and the backbone of security for quantum communications. Although perfect copies are prohibited, a quantum state may be copied with maximal accuracy via various optimal cloning schemes. Optimal quantum cloning, which lies at the border of the physical limit imposed by the no-signaling theorem and the Heisenberg uncertainty principle, has been experimentally realized for low-dimensional photonic states. However, an increase in the dimensionality of quantum systems is greatly beneficial to quantum computation and communication protocols. Nonetheless, no experimental demonstration of optimal cloning machines has hitherto been shown for high-dimensional quantum systems. We perform optimal cloning of high-dimensional photonic states by means of the symmetrization method. We show the universality of our technique by conducting cloning of numerous arbitrary input states and fully characterize our cloning machine by performing quantum state tomography on cloned photons. In addition, a cloning attack on a Bennett and Brassard (BB84) quantum key distribution protocol is experimentally demonstrated to reveal the robustness of high-dimensional states in quantum cryptography.
REVIEWS OF TOPICAL PROBLEMS: Hadron clusters and half-dressed particles in quantum field theory
NASA Astrophysics Data System (ADS)
Feĭnberg, E. L.
1980-10-01
Accelerator experiments show that multiple production of hadrons in high-energy collisions of particles involves the formation of unstable intermediate entities, which subsequently decay into the final hadrons. These entities are apparently not only the comparatively light resonances with which we are already familiar but also heavy nonresonant clusters (with a mass above 2-5 GeV). The cluster concept was introduced previously in cosmic-ray physics, under the name "fireballs". To determine what these clusters are from the standpoint of quantum field theory, a detailed and thorough analysis is made of some analogous processes in quantum electrodynamics which are amenable to calculation. The QED analogs of the nonresonant clusters are "half-dressed" electrons and heavy photons. The half-dressed electrons decay into photons and electrons and are completely observable entities, whose interaction properties distinguish them from dressed electrons. In other words, the nonresonant particles are generally off-shell particles (the excursion from the mass shell is in the timelike direction). The assumption that hadron clusters are only resonances would be equivalent to a very specialized assumption regarding the nature of the spectral function of the hadron propagator; it would be different from that in electrodynamics, where the spectral function can be calculated. Nonresonant hadron clusters thus fit naturally into hadron field theory and are nonequilibrium hadrons far from the mass shell in the timelike direction. (In certain cases, their structural distortion is of the same nature as that of a half-dressed electron, so that this term can be conventionally applied to them as well.
NASA Astrophysics Data System (ADS)
Arca-Sedda, Manuel; Capuzzo-Dolcetta, Roberto
2017-01-01
One of the leading scenarios for the formation of nuclear star clusters in galaxies is related to the orbital decay of globular clusters (GCs) and their subsequent merging, though alternative theories are currently debated. The availability of high-quality data for structural and orbital parameters of GCs allows us to test different nuclear star cluster formation scenarios. The Fornax dwarf spheroidal (dSph) galaxy is the heaviest satellite of the Milky Way and it is the only known dSph hosting five GCs, whereas there are no clear signatures for the presence of a central massive black hole. For this reason, it represents a suited place to study the orbital decay process in dwarf galaxies. In this paper, we model the future evolution of the Fornax GCs simulating them and the host galaxy by means of direct N-body simulations. Our simulations also take into account the gravitational field generated by the Milky Way. We found that if the Fornax galaxy is embedded in a standard cold dark matter halo, the nuclear cluster formation would be significantly hampered by the high central galactic mass density. In this context, we discuss the possibility that infalling GCs drive the flattening of the galactic density profile, giving a possible alternative explanation to the so-called cusp/core problem. Moreover, we briefly discuss the link between GC infall process and the absence of massive black holes in the centre of dSphs.
High-dimensional entropy estimation for finite accuracy data: R-NN entropy estimator.
Kybic, Jan
2007-01-01
We address the problem of entropy estimation for high-dimensional finite-accuracy data. Our main application is evaluating high-order mutual information image similarity criteria for multimodal image registration. The basis of our method is an estimator based on k-th nearest neighbor (NN) distances, modified so that only distances greater than some constant R are evaluated. This modification requires a correction which is found numerically in a preprocessing step using quadratic programming. We compare experimentally our new method with k-NN and histogram estimators on synthetic data as well as for evaluation of mutual information for image similarity.
Geraci, Joseph; Dharsee, Moyez; Nuin, Paulo; Haslehurst, Alexandria; Koti, Madhuri; Feilotter, Harriet E; Evans, Ken
2014-03-01
We introduce a novel method for visualizing high dimensional data via a discrete dynamical system. This method provides a 2D representation of the relationship between subjects according to a set of variables without geometric projections, transformed axes or principal components. The algorithm exploits a memory-type mechanism inherent in a certain class of discrete dynamical systems collectively referred to as the chaos game that are closely related to iterative function systems. The goal of the algorithm was to create a human readable representation of high dimensional patient data that was capable of detecting unrevealed subclusters of patients from within anticipated classifications. This provides a mechanism to further pursue a more personalized exploration of pathology when used with medical data. For clustering and classification protocols, the dynamical system portion of the algorithm is designed to come after some feature selection filter and before some model evaluation (e.g. clustering accuracy) protocol. In the version given here, a univariate features selection step is performed (in practice more complex feature selection methods are used), a discrete dynamical system is driven by this reduced set of variables (which results in a set of 2D cluster models), these models are evaluated for their accuracy (according to a user-defined binary classification) and finally a visual representation of the top classification models are returned. Thus, in addition to the visualization component, this methodology can be used for both supervised and unsupervised machine learning as the top performing models are returned in the protocol we describe here. Butterfly, the algorithm we introduce and provide working code for, uses a discrete dynamical system to classify high dimensional data and provide a 2D representation of the relationship between subjects. We report results on three datasets (two in the article; one in the appendix) including a public lung cancer
Analog computation through high-dimensional physical chaotic neuro-dynamics
NASA Astrophysics Data System (ADS)
Horio, Yoshihiko; Aihara, Kazuyuki
2008-07-01
Conventional von Neumann computers have difficulty in solving complex and ill-posed real-world problems. However, living organisms often face such problems in real life, and must quickly obtain suitable solutions through physical, dynamical, and collective computations involving vast assemblies of neurons. These highly parallel computations through high-dimensional dynamics (computation through dynamics) are completely different from the numerical computations on von Neumann computers (computation through algorithms). In this paper, we explore a novel computational mechanism with high-dimensional physical chaotic neuro-dynamics. We physically constructed two hardware prototypes using analog chaotic-neuron integrated circuits. These systems combine analog computations with chaotic neuro-dynamics and digital computation through algorithms. We used quadratic assignment problems (QAPs) as benchmarks. The first prototype utilizes an analog chaotic neural network with 800-dimensional dynamics. An external algorithm constructs a solution for a QAP using the internal dynamics of the network. In the second system, 300-dimensional analog chaotic neuro-dynamics drive a tabu-search algorithm. We demonstrate experimentally that both systems efficiently solve QAPs through physical chaotic dynamics. We also qualitatively analyze the underlying mechanism of the highly parallel and collective analog computations by observing global and local dynamics. Furthermore, we introduce spatial and temporal mutual information to quantitatively evaluate the system dynamics. The experimental results confirm the validity and efficiency of the proposed computational paradigm with the physical analog chaotic neuro-dynamics.
Ma, Huanfei; Lin, Wei; Lai, Ying-Cheng
2013-05-01
Detecting unstable periodic orbits (UPOs) in chaotic systems based solely on time series is a fundamental but extremely challenging problem in nonlinear dynamics. Previous approaches were applicable but mostly for low-dimensional chaotic systems. We develop a framework, integrating approximation theory of neural networks and adaptive synchronization, to address the problem of time-series-based detection of UPOs in high-dimensional chaotic systems. An example of finding UPOs from the classic Mackey-Glass equation is presented.
Some Unsolved Problems, Questions, and Applications of the Brightsen Nucleon Cluster Model
NASA Astrophysics Data System (ADS)
Smarandache, Florentin
2010-10-01
Brightsen Model is opposite to the Standard Model, and it was build on John Weeler's Resonating Group Structure Model and on Linus Pauling's Close-Packed Spheron Model. Among Brightsen Model's predictions and applications we cite the fact that it derives the average number of prompt neutrons per fission event, it provides a theoretical way for understanding the low temperature / low energy reactions and for approaching the artificially induced fission, it predicts that forces within nucleon clusters are stronger than forces between such clusters within isotopes; it predicts the unmatter entities inside nuclei that result from stable and neutral union of matter and antimatter, and so on. But these predictions have to be tested in the future at the new CERN laboratory.
Improving clustering by imposing network information
Gerber, Susanne; Horenko, Illia
2015-01-01
Cluster analysis is one of the most popular data analysis tools in a wide range of applied disciplines. We propose and justify a computationally efficient and straightforward-to-implement way of imposing the available information from networks/graphs (a priori available in many application areas) on a broad family of clustering methods. The introduced approach is illustrated on the problem of a noninvasive unsupervised brain signal classification. This task is faced with several challenging difficulties such as nonstationary noisy signals and a small sample size, combined with a high-dimensional feature space and huge noise-to-signal ratios. Applying this approach results in an exact unsupervised classification of very short signals, opening new possibilities for clustering methods in the area of a noninvasive brain-computer interface. PMID:26601225
McGrath, L M; Mustanski, B; Metzger, A; Pine, D S; Kistner-Griffin, E; Cook, E; Wakschlag, L S
2012-08-01
This study illustrates the application of a latent modeling approach to genotype-phenotype relationships and gene × environment interactions, using a novel, multidimensional model of adult female problem behavior, including maternal prenatal smoking. The gene of interest is the monoamine oxidase A (MAOA) gene which has been well studied in relation to antisocial behavior. Participants were adult women (N = 192) who were sampled from a prospective pregnancy cohort of non-Hispanic, white individuals recruited from a neighborhood health clinic. Structural equation modeling was used to model a female problem behavior phenotype, which included conduct problems, substance use, impulsive-sensation seeking, interpersonal aggression, and prenatal smoking. All of the female problem behavior dimensions clustered together strongly, with the exception of prenatal smoking. A main effect of MAOA genotype and a MAOA × physical maltreatment interaction were detected with the Conduct Problems factor. Our phenotypic model showed that prenatal smoking is not simply a marker of other maternal problem behaviors. The risk variant in the MAOA main effect and interaction analyses was the high activity MAOA genotype, which is discrepant from consensus findings in male samples. This result contributes to an emerging literature on sex-specific interaction effects for MAOA.
Maximally Informative Hierarchical Representations of High-Dimensional Data
2015-05-11
will be considered dis- crete but the domain of the X i ’s is not restricted. Entropy is defined in the usual way as H(X) ⌘ E X [log 1/p(x)]. We use...natural logarithms so that the unit of information is nats. Higher-order entropies can be constructed in various ways from this standard definition. For...sense, not truly high-dimensional and can be charac- terized separately. On the other hand, the entropy of X, H(X), can naively be considered the
Hawking radiation of a high-dimensional rotating black hole
NASA Astrophysics Data System (ADS)
Ren, Zhao; Lichun, Zhang; Huaifan, Li; Yueqin, Wu
2010-01-01
We extend the classical Damour-Ruffini method and discuss Hawking radiation spectrum of high-dimensional rotating black hole using Tortoise coordinate transformation defined by taking the reaction of the radiation to the spacetime into consideration. Under the condition that the energy and angular momentum are conservative, taking self-gravitation action into account, we derive Hawking radiation spectrums which satisfy unitary principle in quantum mechanics. It is shown that the process that the black hole radiates particles with energy ω is a continuous tunneling process. We provide a theoretical basis for further studying the physical mechanism of black-hole radiation.
Grellmann, Claudia; Neumann, Jane; Bitzer, Sebastian; Kovacs, Peter; Tönjes, Anke; Westlye, Lars T.; Andreassen, Ole A.; Stumvoll, Michael; Villringer, Arno; Horstmann, Annette
2016-01-01
In recent years, the advent of great technological advances has produced a wealth of very high-dimensional data, and combining high-dimensional information from multiple sources is becoming increasingly important in an extending range of scientific disciplines. Partial Least Squares Correlation (PLSC) is a frequently used method for multivariate multimodal data integration. It is, however, computationally expensive in applications involving large numbers of variables, as required, for example, in genetic neuroimaging. To handle high-dimensional problems, dimension reduction might be implemented as pre-processing step. We propose a new approach that incorporates Random Projection (RP) for dimensionality reduction into PLSC to efficiently solve high-dimensional multimodal problems like genotype-phenotype associations. We name our new method PLSC-RP. Using simulated and experimental data sets containing whole genome SNP measures as genotypes and whole brain neuroimaging measures as phenotypes, we demonstrate that PLSC-RP is drastically faster than traditional PLSC while providing statistically equivalent results. We also provide evidence that dimensionality reduction using RP is data type independent. Therefore, PLSC-RP opens up a wide range of possible applications. It can be used for any integrative analysis that combines information from multiple sources. PMID:27375677
NASA Astrophysics Data System (ADS)
Rabbani, Masoud; Farrokhi-Asl, Hamed; Asgarian, Bahare
2017-10-01
It is observed that the separated design of location for depots and routing for servicing customers often reach a suboptimal solution. So, solving location and routing problem simultaneously could achieve better results. In this paper, waste collection problem is considered with regard to economic and societal objective functions. A non-dominated sorting genetic algorithm (NSGA-II) is used to locate depots and treatment facilities and design the routes starting from depots to serve customers. A new mathematical model is proposed and two objective functions including economic objective (opening cost of depots and treatment facility and transportation cost) and societal objective; that is, negative impact of treatment facilities which are close to towns are addressed in this study. A straightforward order based solution representation is applied for coding solutions of the problem and clustering approach is used to generate appropriate initial solutions. Moreover, three multi-objective decomposition methods including weighted sum, goal programming, and goal attainment are applied to validate the performance of the proposed algorithm. Number of test problems are conducted and the results obtained by algorithms are compared with respect to some comparison metrics. Finally, the experimental results show that the proposed hybrid NSGA-II outperforms all decomposition methods, but the computational times for decomposition methods are less than NSGA-II.
Rupp, Matthias; Schneider, Petra; Schneider, Gisbert
2009-11-15
Measuring the (dis)similarity of molecules is important for many cheminformatics applications like compound ranking, clustering, and property prediction. In this work, we focus on real-valued vector representations of molecules (as opposed to the binary spaces of fingerprints). We demonstrate the influence which the choice of (dis)similarity measure can have on results, and provide recommendations for such choices. We review the mathematical concepts used to measure (dis)similarity in vector spaces, namely norms, metrics, inner products, and, similarity coefficients, as well as the relationships between them, employing (dis)similarity measures commonly used in cheminformatics as examples. We present several phenomena (empty space phenomenon, sphere volume related phenomena, distance concentration) in high-dimensional descriptor spaces which are not encountered in two and three dimensions. These phenomena are theoretically characterized and illustrated on both artificial and real (bioactivity) data.
Exploring High-Dimensional Data Space: Identifying Optimal Process Conditions in Photovoltaics
Suh, C.; Biagioni, D.; Glynn, S.; Scharf, J.; Contreras, M. A.; Noufi, R.; Jones, W. B.
2011-01-01
We demonstrate how advanced exploratory data analysis coupled to data-mining techniques can be used to scrutinize the high-dimensional data space of photovoltaics in the context of thin films of Al-doped ZnO (AZO), which are essential materials as a transparent conducting oxide (TCO) layer in CuIn{sub x}Ga{sub 1-x}Se{sub 2} (CIGS) solar cells. AZO data space, wherein each sample is synthesized from a different process history and assessed with various characterizations, is transformed, reorganized, and visualized in order to extract optimal process conditions. The data-analysis methods used include parallel coordinates, diffusion maps, and hierarchical agglomerative clustering algorithms combined with diffusion map embedding.
Suh, C.; Glynn, S.; Scharf, J.; Contreras, M. A.; Noufi, R.; Jones, W. B.; Biagioni, D.
2011-07-01
We demonstrate how advanced exploratory data analysis coupled to data-mining techniques can be used to scrutinize the high-dimensional data space of photovoltaics in the context of thin films of Al-doped ZnO (AZO), which are essential materials as a transparent conducting oxide (TCO) layer in CuInxGa1-xSe2 (CIGS) solar cells. AZO data space, wherein each sample is synthesized from a different process history and assessed with various characterizations, is transformed, reorganized, and visualized in order to extract optimal process conditions. The data-analysis methods used include parallel coordinates, diffusion maps, and hierarchical agglomerative clustering algorithms combined with diffusion map embedding.
An adaptive ANOVA-based PCKF for high-dimensional nonlinear inverse modeling
Li, Weixuan; Lin, Guang; Zhang, Dongxiao
2014-02-01
The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos basis functions in the expansion helps to capture uncertainty more accurately but increases computational cost. Selection of basis functions is particularly important for high-dimensional stochastic problems because the number of polynomial chaos basis functions required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE basis functions are pre-set based on users' experience. Also, for sequential data assimilation problems, the basis functions kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE basis functions for different problems and automatically adjusts the number of basis functions in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm was tested with different examples and demonstrated
An Adaptive ANOVA-based PCKF for High-Dimensional Nonlinear Inverse Modeling
LI, Weixuan; Lin, Guang; Zhang, Dongxiao
2014-02-01
The probabilistic collocation-based Kalman filter (PCKF) is a recently developed approach for solving inverse problems. It resembles the ensemble Kalman filter (EnKF) in every aspect—except that it represents and propagates model uncertainty by polynomial chaos expansion (PCE) instead of an ensemble of model realizations. Previous studies have shown PCKF is a more efficient alternative to EnKF for many data assimilation problems. However, the accuracy and efficiency of PCKF depends on an appropriate truncation of the PCE series. Having more polynomial chaos bases in the expansion helps to capture uncertainty more accurately but increases computational cost. Bases selection is particularly important for high-dimensional stochastic problems because the number of polynomial chaos bases required to represent model uncertainty grows dramatically as the number of input parameters (random dimensions) increases. In classic PCKF algorithms, the PCE bases are pre-set based on users’ experience. Also, for sequential data assimilation problems, the bases kept in PCE expression remain unchanged in different Kalman filter loops, which could limit the accuracy and computational efficiency of classic PCKF algorithms. To address this issue, we present a new algorithm that adaptively selects PCE bases for different problems and automatically adjusts the number of bases in different Kalman filter loops. The algorithm is based on adaptive functional ANOVA (analysis of variance) decomposition, which approximates a high-dimensional function with the summation of a set of low-dimensional functions. Thus, instead of expanding the original model into PCE, we implement the PCE expansion on these low-dimensional functions, which is much less costly. We also propose a new adaptive criterion for ANOVA that is more suited for solving inverse problems. The new algorithm is tested with different examples and demonstrated great effectiveness in comparison with non-adaptive PCKF and En
Asymptotic Stability of High-dimensional Zakharov-Kuznetsov Solitons
NASA Astrophysics Data System (ADS)
Côte, Raphaël; Muñoz, Claudio; Pilod, Didier; Simpson, Gideon
2016-05-01
We prove that solitons (or solitary waves) of the Zakharov-Kuznetsov (ZK) equation, a physically relevant high dimensional generalization of the Korteweg-de Vries (KdV) equation appearing in Plasma Physics, and having mixed KdV and nonlinear Schrödinger (NLS) dynamics, are strongly asymptotically stable in the energy space. We also prove that the sum of well-arranged solitons is stable in the same space. Orbital stability of ZK solitons is well-known since the work of de Bouard [Proc R Soc Edinburgh 126:89-112, 1996]. Our proofs follow the ideas of Martel [SIAM J Math Anal 157:759-781, 2006] and Martel and Merle [Math Ann 341:391-427, 2008], applied for generalized KdV equations in one dimension. In particular, we extend to the high dimensional case several monotonicity properties for suitable half-portions of mass and energy; we also prove a new Liouville type property that characterizes ZK solitons, and a key Virial identity for the linear and nonlinear part of the ZK dynamics, obtained independently of the mixed KdV-NLS dynamics. This last Virial identity relies on a simple sign condition which is numerically tested for the two and three dimensional cases with no additional spectral assumptions required. Possible extensions to higher dimensions and different nonlinearities could be obtained after a suitable local well-posedness theory in the energy space, and the verification of a corresponding sign condition.
Power Enhancement in High Dimensional Cross-Sectional Tests
Fan, Jianqing; Liao, Yuan; Yao, Jiawei
2016-01-01
We propose a novel technique to boost the power of testing a high-dimensional vector H : θ = 0 against sparse alternatives where the null hypothesis is violated only by a couple of components. Existing tests based on quadratic forms such as the Wald statistic often suffer from low powers due to the accumulation of errors in estimating high-dimensional parameters. More powerful tests for sparse alternatives such as thresholding and extreme-value tests, on the other hand, require either stringent conditions or bootstrap to derive the null distribution and often suffer from size distortions due to the slow convergence. Based on a screening technique, we introduce a “power enhancement component”, which is zero under the null hypothesis with high probability, but diverges quickly under sparse alternatives. The proposed test statistic combines the power enhancement component with an asymptotically pivotal statistic, and strengthens the power under sparse alternatives. The null distribution does not require stringent regularity conditions, and is completely determined by that of the pivotal statistic. As specific applications, the proposed methods are applied to testing the factor pricing models and validating the cross-sectional independence in panel data models. PMID:26778846
Sample size requirements for training high-dimensional risk predictors.
Dobbin, Kevin K; Song, Xiao
2013-09-01
A common objective of biomarker studies is to develop a predictor of patient survival outcome. Determining the number of samples required to train a predictor from survival data is important for designing such studies. Existing sample size methods for training studies use parametric models for the high-dimensional data and cannot handle a right-censored dependent variable. We present a new training sample size method that is non-parametric with respect to the high-dimensional vectors, and is developed for a right-censored response. The method can be applied to any prediction algorithm that satisfies a set of conditions. The sample size is chosen so that the expected performance of the predictor is within a user-defined tolerance of optimal. The central method is based on a pilot dataset. To quantify uncertainty, a method to construct a confidence interval for the tolerance is developed. Adequacy of the size of the pilot dataset is discussed. An alternative model-based version of our method for estimating the tolerance when no adequate pilot dataset is available is presented. The model-based method requires a covariance matrix be specified, but we show that the identity covariance matrix provides adequate sample size when the user specifies three key quantities. Application of the sample size method to two microarray datasets is discussed.
New data assimilation system DNDAS for high-dimensional models
NASA Astrophysics Data System (ADS)
Qun-bo, Huang; Xiao-qun, Cao; Meng-bin, Zhu; Wei-min, Zhang; Bai-nian, Liu
2016-05-01
The tangent linear (TL) models and adjoint (AD) models have brought great difficulties for the development of variational data assimilation system. It might be impossible to develop them perfectly without great efforts, either by hand, or by automatic differentiation tools. In order to break these limitations, a new data assimilation system, dual-number data assimilation system (DNDAS), is designed based on the dual-number automatic differentiation principles. We investigate the performance of DNDAS with two different optimization schemes and subsequently give a discussion on whether DNDAS is appropriate for high-dimensional forecast models. The new data assimilation system can avoid the complicated reverse integration of the adjoint model, and it only needs the forward integration in the dual-number space to obtain the cost function and its gradient vector concurrently. To verify the correctness and effectiveness of DNDAS, we implemented DNDAS on a simple ordinary differential model and the Lorenz-63 model with different optimization methods. We then concentrate on the adaptability of DNDAS to the Lorenz-96 model with high-dimensional state variables. The results indicate that whether the system is simple or nonlinear, DNDAS can accurately reconstruct the initial condition for the forecast model and has a strong anti-noise characteristic. Given adequate computing resource, the quasi-Newton optimization method performs better than the conjugate gradient method in DNDAS. Project supported by the National Natural Science Foundation of China (Grant Nos. 41475094 and 41375113).
Elucidating high-dimensional cancer hallmark annotation via enriched ontology.
Yan, Shankai; Wong, Ka-Chun
2017-09-01
Cancer hallmark annotation is a promising technique that could discover novel knowledge about cancer from the biomedical literature. The automated annotation of cancer hallmarks could reveal relevant cancer transformation processes in the literature or extract the articles that correspond to the cancer hallmark of interest. It acts as a complementary approach that can retrieve knowledge from massive text information, advancing numerous focused studies in cancer research. Nonetheless, the high-dimensional nature of cancer hallmark annotation imposes a unique challenge. To address the curse of dimensionality, we compared multiple cancer hallmark annotation methods on 1580 PubMed abstracts. Based on the insights, a novel approach, UDT-RF, which makes use of ontological features is proposed. It expands the feature space via the Medical Subject Headings (MeSH) ontology graph and utilizes novel feature selections for elucidating the high-dimensional cancer hallmark annotation space. To demonstrate its effectiveness, state-of-the-art methods are compared and evaluated by a multitude of performance metrics, revealing the full performance spectrum on the full set of cancer hallmarks. Several case studies are conducted, demonstrating how the proposed approach could reveal novel insights into cancers. https://github.com/cskyan/chmannot. Copyright © 2017 Elsevier Inc. All rights reserved.
High-dimensional camera shake removal with given depth map.
Yue, Tao; Suo, Jinli; Dai, Qionghai
2014-06-01
Camera motion blur is drastically nonuniform for large depth-range scenes, and the nonuniformity caused by camera translation is depth dependent but not the case for camera rotations. To restore the blurry images of large-depth-range scenes deteriorated by arbitrary camera motion, we build an image blur model considering 6-degrees of freedom (DoF) of camera motion with a given scene depth map. To make this 6D depth-aware model tractable, we propose a novel parametrization strategy to reduce the number of variables and an effective method to estimate high-dimensional camera motion as well. The number of variables is reduced by temporal sampling motion function, which describes the 6-DoF camera motion by sampling the camera trajectory uniformly in time domain. To effectively estimate the high-dimensional camera motion parameters, we construct the probabilistic motion density function (PMDF) to describe the probability distribution of camera poses during exposure, and apply it as a unified constraint to guide the convergence of the iterative deblurring algorithm. Specifically, PMDF is computed through a back projection from 2D local blur kernels to 6D camera motion parameter space and robust voting. We conduct a series of experiments on both synthetic and real captured data, and validate that our method achieves better performance than existing uniform methods and nonuniform methods on large-depth-range scenes.
Mapping morphological shape as a high-dimensional functional curve.
Fu, Guifang; Huang, Mian; Bo, Wenhao; Hao, Han; Wu, Rongling
2017-01-06
Detecting how genes regulate biological shape has become a multidisciplinary research interest because of its wide application in many disciplines. Despite its fundamental importance, the challenges of accurately extracting information from an image, statistically modeling the high-dimensional shape and meticulously locating shape quantitative trait loci (QTL) affect the progress of this research. In this article, we propose a novel integrated framework that incorporates shape analysis, statistical curve modeling and genetic mapping to detect significant QTLs regulating variation of biological shape traits. After quantifying morphological shape via a radius centroid contour approach, each shape, as a phenotype, was characterized as a high-dimensional curve, varying as angle θ runs clockwise with the first point starting from angle zero. We then modeled the dynamic trajectories of three mean curves and variation patterns as functions of θ Our framework led to the detection of a few significant QTLs regulating the variation of leaf shape collected from a natural population of poplar, Populus szechuanica var tibetica This population, distributed at altitudes 2000-4500 m above sea level, is an evolutionarily important plant species. This is the first work in the quantitative genetic shape mapping area that emphasizes a sense of 'function' instead of decomposing the shape into a few discrete principal components, as the majority of shape studies do. © The Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
A reduced-order model from high-dimensional frictional hysteresis
Biswas, Saurabh; Chatterjee, Anindya
2014-01-01
Hysteresis in material behaviour includes both signum nonlinearities as well as high dimensionality. Available models for component-level hysteretic behaviour are empirical. Here, we derive a low-order model for rate-independent hysteresis from a high-dimensional massless frictional system. The original system, being given in terms of signs of velocities, is first solved incrementally using a linear complementarity problem formulation. From this numerical solution, to develop a reduced-order model, basis vectors are chosen using the singular value decomposition. The slip direction in generalized coordinates is identified as the minimizer of a dissipation-related function. That function includes terms for frictional dissipation through signum nonlinearities at many friction sites. Luckily, it allows a convenient analytical approximation. Upon solution of the approximated minimization problem, the slip direction is found. A final evolution equation for a few states is then obtained that gives a good match with the full solution. The model obtained here may lead to new insights into hysteresis as well as better empirical modelling thereof. PMID:24910522
Arif, Muhammad
2012-06-01
In pattern classification problems, feature extraction is an important step. Quality of features in discriminating different classes plays an important role in pattern classification problems. In real life, pattern classification may require high dimensional feature space and it is impossible to visualize the feature space if the dimension of feature space is greater than four. In this paper, we have proposed a Similarity-Dissimilarity plot which can project high dimensional space to a two dimensional space while retaining important characteristics required to assess the discrimination quality of the features. Similarity-dissimilarity plot can reveal information about the amount of overlap of features of different classes. Separable data points of different classes will also be visible on the plot which can be classified correctly using appropriate classifier. Hence, approximate classification accuracy can be predicted. Moreover, it is possible to know about whom class the misclassified data points will be confused by the classifier. Outlier data points can also be located on the similarity-dissimilarity plot. Various examples of synthetic data are used to highlight important characteristics of the proposed plot. Some real life examples from biomedical data are also used for the analysis. The proposed plot is independent of number of dimensions of the feature space.
A reduced-order model from high-dimensional frictional hysteresis.
Biswas, Saurabh; Chatterjee, Anindya
2014-06-08
Hysteresis in material behaviour includes both signum nonlinearities as well as high dimensionality. Available models for component-level hysteretic behaviour are empirical. Here, we derive a low-order model for rate-independent hysteresis from a high-dimensional massless frictional system. The original system, being given in terms of signs of velocities, is first solved incrementally using a linear complementarity problem formulation. From this numerical solution, to develop a reduced-order model, basis vectors are chosen using the singular value decomposition. The slip direction in generalized coordinates is identified as the minimizer of a dissipation-related function. That function includes terms for frictional dissipation through signum nonlinearities at many friction sites. Luckily, it allows a convenient analytical approximation. Upon solution of the approximated minimization problem, the slip direction is found. A final evolution equation for a few states is then obtained that gives a good match with the full solution. The model obtained here may lead to new insights into hysteresis as well as better empirical modelling thereof.
Centroid estimation in discrete high-dimensional spaces with applications in biology.
Carvalho, Luis E; Lawrence, Charles E
2008-03-04
Maximum likelihood estimators and other direct optimization-based estimators dominated statistical estimation and prediction for decades. Yet, the principled foundations supporting their dominance do not apply to the discrete high-dimensional inference problems of the 21st century. As it is well known, statistical decision theory shows that maximum likelihood and related estimators use data only to identify the single most probable solution. Accordingly, unless this one solution so dominates the immense ensemble of all solutions that its probability is near one, there is no principled reason to expect such an estimator to be representative of the posterior-weighted ensemble of solutions, and thus represent inferences drawn from the data. We employ statistical decision theory to find more representative estimators, centroid estimators, in a general high-dimensional discrete setting by using a family of loss functions with penalties that increase with the number of differences in components. We show that centroid estimates are obtained by maximizing the marginal probabilities of the solution components for unconstrained ensembles and for an important class of problems, including sequence alignment and the prediction of RNA secondary structure, whose ensembles contain exclusivity constraints. Three genomics examples are described that show that these estimators substantially improve predictions of ground-truth reference sets.
Node Detection Using High-Dimensional Fuzzy Parcellation Applied to the Insular Cortex.
Vercelli, Ugo; Diano, Matteo; Costa, Tommaso; Nani, Andrea; Duca, Sergio; Geminiani, Giuliano; Vercelli, Alessandro; Cauda, Franco
2016-01-01
Several functional connectivity approaches require the definition of a set of regions of interest (ROIs) that act as network nodes. Different methods have been developed to define these nodes and to derive their functional and effective connections, most of which are rather complex. Here we aim to propose a relatively simple "one-step" border detection and ROI estimation procedure employing the fuzzy c-mean clustering algorithm. To test this procedure and to explore insular connectivity beyond the two/three-region model currently proposed in the literature, we parcellated the insular cortex of 20 healthy right-handed volunteers scanned in a resting state. By employing a high-dimensional functional connectivity-based clustering process, we confirmed the two patterns of connectivity previously described. This method revealed a complex pattern of functional connectivity where the two previously detected insular clusters are subdivided into several other networks, some of which are not commonly associated with the insular cortex, such as the default mode network and parts of the dorsal attentional network. Furthermore, the detection of nodes was reliable, as demonstrated by the confirmative analysis performed on a replication group of subjects.
Node Detection Using High-Dimensional Fuzzy Parcellation Applied to the Insular Cortex
Vercelli, Ugo; Diano, Matteo; Costa, Tommaso; Nani, Andrea; Duca, Sergio; Geminiani, Giuliano; Vercelli, Alessandro; Cauda, Franco
2016-01-01
Several functional connectivity approaches require the definition of a set of regions of interest (ROIs) that act as network nodes. Different methods have been developed to define these nodes and to derive their functional and effective connections, most of which are rather complex. Here we aim to propose a relatively simple “one-step” border detection and ROI estimation procedure employing the fuzzy c-mean clustering algorithm. To test this procedure and to explore insular connectivity beyond the two/three-region model currently proposed in the literature, we parcellated the insular cortex of 20 healthy right-handed volunteers scanned in a resting state. By employing a high-dimensional functional connectivity-based clustering process, we confirmed the two patterns of connectivity previously described. This method revealed a complex pattern of functional connectivity where the two previously detected insular clusters are subdivided into several other networks, some of which are not commonly associated with the insular cortex, such as the default mode network and parts of the dorsal attentional network. Furthermore, the detection of nodes was reliable, as demonstrated by the confirmative analysis performed on a replication group of subjects. PMID:26881093
CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets.
Nowicka, Malgorzata; Krieg, Carsten; Weber, Lukas M; Hartmann, Felix J; Guglietta, Silvia; Becher, Burkhard; Levesque, Mitchell P; Robinson, Mark D
2017-01-01
High dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high throughput interrogation and characterization of cell populations.Here, we present an R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signaling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g. multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g. plots of aggregated signals).
Baker-Henningham, Helen; Scott, Stephen; Jones, Kelvyn; Walker, Susan
2012-01-01
Background There is an urgent need for effective, affordable interventions to prevent child mental health problems in low- and middle-income countries. Aims To determine the effects of a universal pre-school-based intervention on child conduct problems and social skills at school and at home. Method In a cluster randomised design, 24 community pre-schools in inner-city areas of Kingston, Jamaica, were randomly assigned to receive the Incredible Years Teacher Training intervention (n = 12) or to a control group (n = 12). Three children from each class with the highest levels of teacher-reported conduct problems were selected for evaluation, giving 225 children aged 3–6 years. The primary outcome was observed child behaviour at school. Secondary outcomes were child behaviour by parent and teacher report, child attendance and parents’ attitude to school. The study is registered as ISRCTN35476268. Results Children in intervention schools showed significantly reduced conduct problems (effect size (ES) = 0.42) and increased friendship skills (ES = 0.74) through observation, significant reductions to teacher-reported (ES = 0.47) and parent-reported (ES = 0.22) behaviour difficulties and increases in teacher-reported social skills (ES = 0.59) and child attendance (ES = 0.30). Benefits to parents’ attitude to school were not significant. Conclusions A low-cost, school-based intervention in a middle-income country substantially reduces child conduct problems and increases child social skills at home and at school. PMID:22500015
Nam, Julia EunJu; Mueller, Klaus
2013-02-01
Gaining a true appreciation of high-dimensional space remains difficult since all of the existing high-dimensional space exploration techniques serialize the space travel in some way. This is not so foreign to us since we, when traveling, also experience the world in a serial fashion. But we typically have access to a map to help with positioning, orientation, navigation, and trip planning. Here, we propose a multivariate data exploration tool that compares high-dimensional space navigation with a sightseeing trip. It decomposes this activity into five major tasks: 1) Identify the sights: use a map to identify the sights of interest and their location; 2) Plan the trip: connect the sights of interest along a specifyable path; 3) Go on the trip: travel along the route; 4) Hop off the bus: experience the location, look around, zoom into detail; and 5) Orient and localize: regain bearings in the map. We describe intuitive and interactive tools for all of these tasks, both global navigation within the map and local exploration of the data distributions. For the latter, we describe a polygonal touchpad interface which enables users to smoothly tilt the projection plane in high-dimensional space to produce multivariate scatterplots that best convey the data relationships under investigation. Motion parallax and illustrative motion trails aid in the perception of these transient patterns. We describe the use of our system within two applications: 1) the exploratory discovery of data configurations that best fit a personal preference in the presence of tradeoffs and 2) interactive cluster analysis via cluster sculpting in N-D.
Evangelista, Francesco A; Gauss, Jürgen
2010-07-28
In this communication we report the results of our studies on the orbital invariance properties of the state-specific multireference coupled cluster approach suggested by Mukherjee and co-workers (Mk-MRCC). In particular, we have gathered numerical evidence to show that even when the linear excitation manifold is modified in order to span the same space for each reference, the resulting method is not orbital invariant. In order to test this conjecture we have proposed a new truncation scheme (Mk-MRCCSDtq) which, in addition to full single and double excitations, contains partial triple and quadruple excitations. For a reference space generated by all possible combinations of two electrons in two orbitals, the linear excitation manifold of Mk-MRCCSDtq spans the same set for each reference determinant. Mk-MRCCSDtq is found to lack energy invariance for rotations among active molecular orbitals but it is less sensitive to orbital rotations than the conventional scheme which includes only singles and doubles (Mk-MRCCSD). Nevertheless, Mk-MRCCSDtq is a very accurate method, superior with respect to multireference configuration interaction approaches, and competitive with the active-space coupled cluster method and the MRexpT ansatz.
Problems in resumming interjet energy flows with k"t clustering [rapid communication
NASA Astrophysics Data System (ADS)
Banfi, A.; Dasgupta, M.
2005-11-01
We consider the energy flow into gaps between hard jets. It was previously believed that the accuracy of resummed predictions for such observables can be improved by employing the kt clustering procedure to define the gap energy in terms of a sum of energies of soft jets (rather than individual hadrons) in the gap. This significantly reduces the sensitivity to correlated soft large-angle radiation (non-global leading logs), numerically calculable only in the large Nc limit. While this is the case, as we demonstrate here, the use of kt clustering spoils the straightforward single-gluon Sudakov exponentiation that multiplies the non-global resummation. We carry out an O (αs2) calculation of the leading single-logarithmic terms and identify the piece that is omitted by straightforward exponentiation. We compare our results with the full O (αs2) result from the program EVENT2 to confirm our conclusions. For e+e- → 2 jets and DIS (1 + 1) jets one can numerically resum these additional contributions as we show, but for dijet photoproduction and hadron-hadron processes further studies are needed.
Nakano, Takashi; Otsuka, Makoto; Yoshimoto, Junichiro; Doya, Kenji
2015-01-01
A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the past, even though these are inevitable and constraining features of learning in real environments. This class of problem is formally known as partially observable reinforcement learning (PORL) problems. It provides a generalization of reinforcement learning to partially observable domains. In addition, observations in the real world tend to be rich and high-dimensional. In this work, we use a spiking neural network model to approximate the free energy of a restricted Boltzmann machine and apply it to the solution of PORL problems with high-dimensional observations. Our spiking network model solves maze tasks with perceptually ambiguous high-dimensional observations without knowledge of the true environment. An extended model with working memory also solves history-dependent tasks. The way spiking neural networks handle PORL problems may provide a glimpse into the underlying laws of neural information processing which can only be discovered through such a top-down approach.
Nakano, Takashi; Otsuka, Makoto; Yoshimoto, Junichiro; Doya, Kenji
2015-01-01
A theoretical framework of reinforcement learning plays an important role in understanding action selection in animals. Spiking neural networks provide a theoretically grounded means to test computational hypotheses on neurally plausible algorithms of reinforcement learning through numerical simulation. However, most of these models cannot handle observations which are noisy, or occurred in the past, even though these are inevitable and constraining features of learning in real environments. This class of problem is formally known as partially observable reinforcement learning (PORL) problems. It provides a generalization of reinforcement learning to partially observable domains. In addition, observations in the real world tend to be rich and high-dimensional. In this work, we use a spiking neural network model to approximate the free energy of a restricted Boltzmann machine and apply it to the solution of PORL problems with high-dimensional observations. Our spiking network model solves maze tasks with perceptually ambiguous high-dimensional observations without knowledge of the true environment. An extended model with working memory also solves history-dependent tasks. The way spiking neural networks handle PORL problems may provide a glimpse into the underlying laws of neural information processing which can only be discovered through such a top-down approach. PMID:25734662
Syllabification of Final Consonant Clusters: A Salient Pronunciation Problem of Kurdish EFL Learners
ERIC Educational Resources Information Center
Keshavarz, Mohammad Hossein
2017-01-01
While there is a plethora of research on pronunciation problems of EFL learners with different L1 backgrounds, published empirical studies on syllabification errors of Iraqi Kurdish EFL learners are scarce. Therefore, to contribute to this line of research, the present study set out to investigate difficulties of this group of learners in the…
Additivity Principle in High-Dimensional Deterministic Systems
NASA Astrophysics Data System (ADS)
Saito, Keiji; Dhar, Abhishek
2011-12-01
The additivity principle (AP), conjectured by Bodineau and Derrida [Phys. Rev. Lett. 92, 180601 (2004)PRLTAO0031-900710.1103/PhysRevLett.92.180601], is discussed for the case of heat conduction in three-dimensional disordered harmonic lattices to consider the effects of deterministic dynamics, higher dimensionality, and different transport regimes, i.e., ballistic, diffusive, and anomalous transport. The cumulant generating function (CGF) for heat transfer is accurately calculated and compared with the one given by the AP. In the diffusive regime, we find a clear agreement with the conjecture even if the system is high dimensional. Surprisingly, even in the anomalous regime the CGF is also well fitted by the AP. Lower-dimensional systems are also studied and the importance of three dimensionality for the validity is stressed.
Additivity principle in high-dimensional deterministic systems.
Saito, Keiji; Dhar, Abhishek
2011-12-16
The additivity principle (AP), conjectured by Bodineau and Derrida [Phys. Rev. Lett. 92, 180601 (2004)], is discussed for the case of heat conduction in three-dimensional disordered harmonic lattices to consider the effects of deterministic dynamics, higher dimensionality, and different transport regimes, i.e., ballistic, diffusive, and anomalous transport. The cumulant generating function (CGF) for heat transfer is accurately calculated and compared with the one given by the AP. In the diffusive regime, we find a clear agreement with the conjecture even if the system is high dimensional. Surprisingly, even in the anomalous regime the CGF is also well fitted by the AP. Lower-dimensional systems are also studied and the importance of three dimensionality for the validity is stressed.
A quantum router for high-dimensional entanglement
NASA Astrophysics Data System (ADS)
Erhard, Manuel; Malik, Mehul; Zeilinger, Anton
2017-03-01
In addition to being a workhorse for modern quantum technologies, entanglement plays a key role in fundamental tests of quantum mechanics. The entanglement of photons in multiple levels, or dimensions, explores the limits of how large an entangled state can be, while also greatly expanding its applications in quantum information. Here we show how a high-dimensional quantum state of two photons entangled in their orbital angular momentum can be split into two entangled states with a smaller dimensionality structure. Our work demonstrates that entanglement is a quantum property that can be subdivided into spatially separated parts. In addition, our technique has vast potential applications in quantum as well as classical communication systems.
Algorithmic Tools for Mining High-Dimensional Cytometry Data.
Chester, Cariad; Maecker, Holden T
2015-08-01
The advent of mass cytometry has led to an unprecedented increase in the number of analytes measured in individual cells, thereby increasing the complexity and information content of cytometric data. Although this technology is ideally suited to the detailed examination of the immune system, the applicability of the different methods for analyzing such complex data is less clear. Conventional data analysis by manual gating of cells in biaxial dot plots is often subjective, time consuming, and neglectful of much of the information contained in a highly dimensional cytometric dataset. Algorithmic data mining has the promise to eliminate these concerns, and several such tools have been applied recently to mass cytometry data. We review computational data mining tools that have been used to analyze mass cytometry data, outline their differences, and comment on their strengths and limitations. This review will help immunologists to identify suitable algorithmic tools for their particular projects. Copyright © 2015 by The American Association of Immunologists, Inc.
High-dimensional quantum nature of ghost angular Young's diffraction
Chen Lixiang; Leach, Jonathan; Jack, Barry; Padgett, Miles J.; Franke-Arnold, Sonja; She Weilong
2010-09-15
We propose a technique to characterize the dimensionality of entangled sources affected by any environment, including phase and amplitude masks or atmospheric turbulence. We illustrate this technique on the example of angular ghost diffraction using the orbital angular momentum (OAM) spectrum generated by a nonlocal double slit. We realize a nonlocal angular double slit by placing single angular slits in the paths of the signal and idler modes of the entangled light field generated by parametric down-conversion. Based on the observed OAM spectrum and the measured Shannon dimensionality spectrum of the possible quantum channels that contribute to Young's ghost diffraction, we calculate the associated dimensionality D{sub total}. The measured D{sub total} ranges between 1 and 2.74 depending on the opening angle of the angular slits. The ability to quantify the nature of high-dimensional entanglement is vital when considering quantum information protocols.
High dimensional reflectance analysis of soil organic matter
NASA Technical Reports Server (NTRS)
Henderson, T. L.; Baumgardner, M. F.; Franzmeier, D. P.; Stott, D. E.; Coster, D. C.
1992-01-01
Recent breakthroughs in remote-sensing technology have led to the development of high spectral resolution imaging sensors for observation of earth surface features. This research was conducted to evaluate the effects of organic matter content and composition on narrowband soil reflectance across the visible and reflective infrared spectral ranges. Organic matter from four Indiana agricultural soils, ranging in organic C content from 0.99 to 1.72 percent, was extracted, fractionated, and purified. Six components of each soil were isolated and prepared for spectral analysis. Reflectance was measured in 210 narrow bands in the 400- to 2500-nm wavelength range. Statistical analysis of reflectance values indicated the potential of high dimensional reflectance data in specific visible, near-infrared, and middle-infrared bands to provide information about soil organic C content, but not organic matter composition. These bands also responded significantly to Fe- and Mn-oxide content.
Modeling for Process Control: High-Dimensional Systems
Lev S. Tsimring
2008-09-15
Most of other technologically important systems (among them, powders and other granular systems) are intrinsically nonlinear. This project is focused on building the dynamical models for granular systems as a prototype for nonlinear high-dimensional systems exhibiting complex non-equilibrium phenomena. Granular materials present a unique opportunity to study these issues in a technologically important and yet fundamentally interesting setting. Granular systems exhibit a rich variety of regimes from gas-like to solid-like depending on the external excitation. Based the combination of the rigorous asymptotic analysis, available experimental data and nonlinear signal processing tools, we developed a multi-scale approach to the modeling of granular systems from detailed description of grain-grain interaction on a micro-scale to continuous modeling of large-scale granular flows with important geophysical applications.
Future of High-Dimensional Data-Driven Exoplanet Science
NASA Astrophysics Data System (ADS)
Ford, Eric B.
2016-03-01
The detection and characterization of exoplanets has come a long way since the 1990’s. For example, instruments specifically designed for Doppler planet surveys feature environmental controls to minimize instrumental effects and advanced calibration systems. Combining these instruments with powerful telescopes, astronomers have detected thousands of exoplanets. The application of Bayesian algorithms has improved the quality and reliability with which astronomers characterize the mass and orbits of exoplanets. Thanks to continued improvements in instrumentation, now the detection of extrasolar low-mass planets is limited primarily by stellar activity, rather than observational uncertainties. This presents a new set of challenges which will require cross-disciplinary research to combine improved statistical algorithms with an astrophysical understanding of stellar activity and the details of astronomical instrumentation. I describe these challenges and outline the roles of parameter estimation over high-dimensional parameter spaces, marginalizing over uncertainties in stellar astrophysics and machine learning for the next generation of Doppler planet searches.
High dimensional reflectance analysis of soil organic matter
NASA Technical Reports Server (NTRS)
Henderson, T. L.; Baumgardner, M. F.; Franzmeier, D. P.; Stott, D. E.; Coster, D. C.
1992-01-01
Recent breakthroughs in remote-sensing technology have led to the development of high spectral resolution imaging sensors for observation of earth surface features. This research was conducted to evaluate the effects of organic matter content and composition on narrowband soil reflectance across the visible and reflective infrared spectral ranges. Organic matter from four Indiana agricultural soils, ranging in organic C content from 0.99 to 1.72 percent, was extracted, fractionated, and purified. Six components of each soil were isolated and prepared for spectral analysis. Reflectance was measured in 210 narrow bands in the 400- to 2500-nm wavelength range. Statistical analysis of reflectance values indicated the potential of high dimensional reflectance data in specific visible, near-infrared, and middle-infrared bands to provide information about soil organic C content, but not organic matter composition. These bands also responded significantly to Fe- and Mn-oxide content.
Statistical validation of high-dimensional models of growing networks
NASA Astrophysics Data System (ADS)
Medo, Matúš
2014-03-01
The abundance of models of complex networks and the current insufficient validation standards make it difficult to judge which models are strongly supported by data and which are not. We focus here on likelihood maximization methods for models of growing networks with many parameters and compare their performance on artificial and real datasets. While high dimensionality of the parameter space harms the performance of direct likelihood maximization on artificial data, this can be improved by introducing a suitable penalization term. Likelihood maximization on real data shows that the presented approach is able to discriminate among available network models. To make large-scale datasets accessible to this kind of analysis, we propose a subset sampling technique and show that it yields substantial model evidence in a fraction of time necessary for the analysis of the complete data.
Orbital angular momentum analysis of high-dimensional entanglement
Peeters, W. H.; Exter, M. P. van; Verstegen, E. J. K.
2007-10-15
We describe a simple experiment that is ideally suited to analyze the high-dimensional entanglement contained in the orbital angular momenta (OAM) of entangled photon pairs. For this purpose we use a two-photon interferometer with a built-in image rotator and measure the two-photon visibility versus rotation angle. Mode selection with apertures allows one to tune the dimensionality of the entanglement; mode selection with spiral phase plates and fibers allows detection of a single OAM mode. The experiment is analyzed in two different ways: either via the continuous two-photon amplitude function or via a discrete modal (Schmidt) decomposition of this function. The latter approach proves to be very fruitful, as it provides a complete characterization of the OAM entanglement.
Spectral feature design in high dimensional multispectral data
NASA Technical Reports Server (NTRS)
Chen, Chih-Chien Thomas; Landgrebe, David A.
1988-01-01
The High resolution Imaging Spectrometer (HIRIS) is designed to acquire images simultaneously in 192 spectral bands in the 0.4 to 2.5 micrometers wavelength region. It will make possible the collection of essentially continuous reflectance spectra at a spectral resolution sufficient to extract significantly enhanced amounts of information from return signals as compared to existing systems. The advantages of such high dimensional data come at a cost of increased system and data complexity. For example, since the finer the spectral resolution, the higher the data rate, it becomes impractical to design the sensor to be operated continuously. It is essential to find new ways to preprocess the data which reduce the data rate while at the same time maintaining the information content of the high dimensional signal produced. Four spectral feature design techniques are developed from the Weighted Karhunen-Loeve Transforms: (1) non-overlapping band feature selection algorithm; (2) overlapping band feature selection algorithm; (3) Walsh function approach; and (4) infinite clipped optimal function approach. The infinite clipped optimal function approach is chosen since the features are easiest to find and their classification performance is the best. After the preprocessed data has been received at the ground station, canonical analysis is further used to find the best set of features under the criterion that maximal class separability is achieved. Both 100 dimensional vegetation data and 200 dimensional soil data were used to test the spectral feature design system. It was shown that the infinite clipped versions of the first 16 optimal features had excellent classification performance. The overall probability of correct classification is over 90 percent while providing for a reduced downlink data rate by a factor of 10.
Shaffer, Patrick; Valsson, Omar; Parrinello, Michele
2016-02-02
The capabilities of molecular simulations have been greatly extended by a number of widely used enhanced sampling methods that facilitate escaping from metastable states and crossing large barriers. Despite these developments there are still many problems which remain out of reach for these methods which has led to a vigorous effort in this area. One of the most important problems that remains unsolved is sampling high-dimensional free-energy landscapes and systems that are not easily described by a small number of collective variables. In this work we demonstrate a new way to compute free-energy landscapes of high dimensionality based on the previously introduced variationally enhanced sampling, and we apply it to the miniprotein chignolin.
Shaffer, Patrick; Valsson, Omar; Parrinello, Michele
2016-01-01
The capabilities of molecular simulations have been greatly extended by a number of widely used enhanced sampling methods that facilitate escaping from metastable states and crossing large barriers. Despite these developments there are still many problems which remain out of reach for these methods which has led to a vigorous effort in this area. One of the most important problems that remains unsolved is sampling high-dimensional free-energy landscapes and systems that are not easily described by a small number of collective variables. In this work we demonstrate a new way to compute free-energy landscapes of high dimensionality based on the previously introduced variationally enhanced sampling, and we apply it to the miniprotein chignolin. PMID:26787868
GX-Means: A model-based divide and merge algorithm for geospatial image clustering
Vatsavai, Raju; Symons, Christopher T; Chandola, Varun; Jun, Goo
2011-01-01
One of the practical issues in clustering is the specification of the appropriate number of clusters, which is not obvious when analyzing geospatial datasets, partly because they are huge (both in size and spatial extent) and high dimensional. In this paper we present a computationally efficient model-based split and merge clustering algorithm that incrementally finds model parameters and the number of clusters. Additionally, we attempt to provide insights into this problem and other data mining challenges that are encountered when clustering geospatial data. The basic algorithm we present is similar to the G-means and X-means algorithms; however, our proposed approach avoids certain limitations of these well-known clustering algorithms that are pertinent when dealing with geospatial data. We compare the performance of our approach with the G-means and X-means algorithms. Experimental evaluation on simulated data and on multispectral and hyperspectral remotely sensed image data demonstrates the effectiveness of our algorithm.
Franklin, Jessica M; Eddings, Wesley; Glynn, Robert J; Schneeweiss, Sebastian
2015-10-01
Selection and measurement of confounders is critical for successful adjustment in nonrandomized studies. Although the principles behind confounder selection are now well established, variable selection for confounder adjustment remains a difficult problem in practice, particularly in secondary analyses of databases. We present a simulation study that compares the high-dimensional propensity score algorithm for variable selection with approaches that utilize direct adjustment for all potential confounders via regularized regression, including ridge regression and lasso regression. Simulations were based on 2 previously published pharmacoepidemiologic cohorts and used the plasmode simulation framework to create realistic simulated data sets with thousands of potential confounders. Performance of methods was evaluated with respect to bias and mean squared error of the estimated effects of a binary treatment. Simulation scenarios varied the true underlying outcome model, treatment effect, prevalence of exposure and outcome, and presence of unmeasured confounding. Across scenarios, high-dimensional propensity score approaches generally performed better than regularized regression approaches. However, including the variables selected by lasso regression in a regular propensity score model also performed well and may provide a promising alternative variable selection method.
Yu, Hualong; Ni, Jun
2014-01-01
Training classifiers on skewed data can be technically challenging tasks, especially if the data is high-dimensional simultaneously, the tasks can become more difficult. In biomedicine field, skewed data type often appears. In this study, we try to deal with this problem by combining asymmetric bagging ensemble classifier (asBagging) that has been presented in previous work and an improved random subspace (RS) generation strategy that is called feature subspace (FSS). Specifically, FSS is a novel method to promote the balance level between accuracy and diversity of base classifiers in asBagging. In view of the strong generalization capability of support vector machine (SVM), we adopt it to be base classifier. Extensive experiments on four benchmark biomedicine data sets indicate that the proposed ensemble learning method outperforms many baseline approaches in terms of Accuracy, F-measure, G-mean and AUC evaluation criterions, thus it can be regarded as an effective and efficient tool to deal with high-dimensional and imbalanced biomedical data.
Using High-Dimensional Image Models to Perform Highly Undetectable Steganography
NASA Astrophysics Data System (ADS)
Pevný, Tomáš; Filler, Tomáš; Bas, Patrick
This paper presents a complete methodology for designing practical and highly-undetectable stegosystems for real digital media. The main design principle is to minimize a suitably-defined distortion by means of efficient coding algorithm. The distortion is defined as a weighted difference of extended state-of-the-art feature vectors already used in steganalysis. This allows us to "preserve" the model used by steganalyst and thus be undetectable even for large payloads. This framework can be efficiently implemented even when the dimensionality of the feature set used by the embedder is larger than 107. The high dimensional model is necessary to avoid known security weaknesses. Although high-dimensional models might be problem in steganalysis, we explain, why they are acceptable in steganography. As an example, we introduce HUGO, a new embedding algorithm for spatial-domain digital images and we contrast its performance with LSB matching. On the BOWS2 image database and in contrast with LSB matching, HUGO allows the embedder to hide 7× longer message with the same level of security level.
A method for analysis of phenotypic change for phenotypes described by high-dimensional data
Collyer, M L; Sekora, D J; Adams, D C
2015-01-01
The analysis of phenotypic change is important for several evolutionary biology disciplines, including phenotypic plasticity, evolutionary developmental biology, morphological evolution, physiological evolution, evolutionary ecology and behavioral evolution. It is common for researchers in these disciplines to work with multivariate phenotypic data. When phenotypic variables exceed the number of research subjects—data called ‘high-dimensional data'—researchers are confronted with analytical challenges. Parametric tests that require high observation to variable ratios present a paradox for researchers, as eliminating variables potentially reduces effect sizes for comparative analyses, yet test statistics require more observations than variables. This problem is exacerbated with data that describe ‘multidimensional' phenotypes, whereby a description of phenotype requires high-dimensional data. For example, landmark-based geometric morphometric data use the Cartesian coordinates of (potentially) many anatomical landmarks to describe organismal shape. Collectively such shape variables describe organism shape, although the analysis of each variable, independently, offers little benefit for addressing biological questions. Here we present a nonparametric method of evaluating effect size that is not constrained by the number of phenotypic variables, and motivate its use with example analyses of phenotypic change using geometric morphometric data. Our examples contrast different characterizations of body shape for a desert fish species, associated with measuring and comparing sexual dimorphism between two populations. We demonstrate that using more phenotypic variables can increase effect sizes, and allow for stronger inferences. PMID:25204302
The detection of globular clusters in galaxies as a data mining problem
NASA Astrophysics Data System (ADS)
Brescia, Massimo; Cavuoti, Stefano; Paolillo, Maurizio; Longo, Giuseppe; Puzia, Thomas
2012-04-01
We present an application of self-adaptive supervised learning classifiers derived from the machine learning paradigm to the identification of candidate globular clusters in deep, wide-field, single-band Hubble Space Telescope (HST) images. Several methods provided by the DAta Mining and Exploration (DAME) web application were tested and compared on the NGC 1399 HST data described by Paolillo and collaborators in a companion paper. The best results were obtained using a multilayer perceptron with quasi-Newton learning rule which achieved a classification accuracy of 98.3 per cent, with a completeness of 97.8 per cent and contamination of 1.6 per cent. An extensive set of experiments revealed that the use of accurate structural parameters (effective radius, central surface brightness) does improve the final result, but only by ˜5 per cent. It is also shown that the method is capable to retrieve also extreme sources (for instance, very extended objects) which are missed by more traditional approaches.
Random Walk on the High-Dimensional IIC
NASA Astrophysics Data System (ADS)
Heydenreich, Markus; van der Hofstad, Remco; Hulshof, Tim
2014-07-01
We study the asymptotic behavior of the exit times of random walk from Euclidean balls around the origin of the incipient infinite cluster in a manner inspired by Kumagai and Misumi (J Theor Probab 21:910-935,
NASA Technical Reports Server (NTRS)
Pinsonneault, Marc H.; Stauffer, John; Soderblom, David R.; King, Jeremy R.; Hanson, Robert B.
1998-01-01
Parallax data from the Hipparcos mission allow the direct distance to open clusters to be compared with the distance inferred from main-sequence (MS) fitting. There are surprising differences between the two distance measurements. indicating either the need for changes in the cluster compositions or reddening, underlying problems with the technique of MS fitting, or systematic errors in the Hipparcos parallaxes at the 1 mas level. We examine the different possibilities, focusing on MS fitting in both metallicity-sensitive B-V and metallicity-insensitive V-I for five well-studied systems (the Hyades, Pleiades, alpha Per, Praesepe, and Coma Ber). The Hipparcos distances to the Hyades and alpha Per are within 1 sigma of the MS-fitting distance in B-V and V-I, while the Hipparcos distances to Coma Ber and the Pleiades are in disagreement with the MS-fitting distance at more than the 3 sigma level. There are two Hipparcos measurements of the distance to Praesepe; one is in good agreement with the MS-fitting distance and the other disagrees at the 2 sigma level. The distance estimates from the different colors are in conflict with one another for Coma but in agreement for the Pleiades. Changes in the relative cluster metal abundances, age related effects, helium, and reddening are shown to be unlikely to explain the puzzling behavior of the Pleiades. We present evidence for spatially dependent systematic errors at the 1 mas level in the parallaxes of Pleiades stars. The implications of this result are discussed.
Visualization of High-Dimensional Point Clouds Using Their Density Distribution's Topology.
Oesterling, P; Heine, C; Janicke, H; Scheuermann, G; Heyer, G
2011-11-01
We present a novel method to visualize multidimensional point clouds. While conventional visualization techniques, like scatterplot matrices or parallel coordinates, have issues with either overplotting of entities or handling many dimensions, we abstract the data using topological methods before presenting it. We assume the input points to be samples of a random variable with a high-dimensional probability distribution which we approximate using kernel density estimates on a suitably reconstructed mesh. From the resulting scalar field we extract the join tree and present it as a topological landscape, a visualization metaphor that utilizes the human capability of understanding natural terrains. In this landscape, dense clusters of points show up as hills. The nesting of hills indicates the nesting of clusters. We augment the landscape with the data points to allow selection and inspection of single points and point sets. We also present optimizations to make our algorithm applicable to large data sets and to allow interactive adaption of our visualization to the kernel window width used in the density estimation.
Krause, Josua; Dasgupta, Aritra; Fekete, Jean-Daniel; Bertini, Enrico
2016-10-23
Dealing with the curse of dimensionality is a key challenge in high-dimensional data visualization. We present SeekAView to address three main gaps in the existing research literature. First, automated methods like dimensionality reduction or clustering suffer from a lack of transparency in letting analysts interact with their outputs in real-time to suit their exploration strategies. The results often suffer from a lack of interpretability, especially for domain experts not trained in statistics and machine learning. Second, exploratory visualization techniques like scatter plots or parallel coordinates suffer from a lack of visual scalability: it is difficult to present a coherent overview of interesting combinations of dimensions. Third, the existing techniques do not provide a flexible workflow that allows for multiple perspectives into the analysis process by automatically detecting and suggesting potentially interesting subspaces. In SeekAView we address these issues using suggestion based visual exploration of interesting patterns for building and refining multidimensional subspaces. Compared to the state-of-the-art in subspace search and visualization methods, we achieve higher transparency in showing not only the results of the algorithms, but also interesting dimensions calibrated against different metrics. We integrate a visually scalable design space with an iterative workflow guiding the analysts by choosing the starting points and letting them slice and dice through the data to find interesting subspaces and detect correlations, clusters, and outliers. We present two usage scenarios for demonstrating how SeekAView can be applied in real-world data analysis scenarios.
2012-01-01
Background Externalising and internalising problems affect one in seven school-aged children and are the single strongest predictor of mental health problems into early adolescence. As the burden of mental health problems persists globally, childhood prevention of mental health problems is paramount. Prevention can be offered to all children (universal) or to children at risk of developing mental health problems (targeted). The relative effectiveness and costs of a targeted only versus combined universal and targeted approach are unknown. This study aims to determine the effectiveness, costs and uptake of two approaches to early childhood prevention of mental health problems ie: a Combined universal-targeted approach, versus a Targeted only approach, in comparison to current primary care services (Usual care). Methods/design Three armed, population-level cluster randomised trial (2010–2014) within the universal, well child Maternal Child Health system, attended by more than 80% of families in Victoria, Australia at infant age eight months. Participants were families of eight month old children from nine participating local government areas. Randomised to one of three groups: Combined, Targeted or Usual care. The interventions comprises (a) the Combined universal and targeted program where all families are offered the universal Toddlers Without Tears group parenting program followed by the targeted Family Check-Up one-on-one program or (b) the Targeted Family Check-Up program. The Family Check-Up program is only offered to children at risk of behavioural problems. Participants will be analysed according to the trial arm to which they were randomised, using logistic and linear regression models to compare primary and secondary outcomes. An economic evaluation (cost consequences analysis) will compare incremental costs to all incremental outcomes from a societal perspective. Discussion This trial will inform public health policy by making recommendations about the
Williams, Kristine; Herman, Ruth; Bontempo, Daniel
2014-01-01
Purpose of the study Assisted living (AL) residents are at risk for cognitive and functional declines that eventually reduce their ability to care for themselves, thereby triggering nursing home placement. In developing a method to slow this decline, the efficacy of Reasoning Exercises in Assisted Living (REAL), a cognitive training intervention that teaches everyday reasoning and problem-solving skills to AL residents, was tested. Design and methods At thirteen randomized Midwestern facilities, AL residents whose Mini Mental State Examination scores ranged from 19–29 either were trained in REAL or a vitamin education attention control program or received no treatment at all. For 3 weeks, treated groups received personal training in their respective programs. Results Scores on the Every Day Problems Test for Cognitively Challenged Elders (EPCCE) and on the Direct Assessment of Functional Status (DAFS) showed significant increases only for the REAL group. For EPCCE, change from baseline immediately postintervention was +3.10 (P<0.01), and there was significant retention at the 3-month follow-up (d=2.71; P<0.01). For DAFS, change from baseline immediately postintervention was +3.52 (P<0.001), although retention was not as strong. Neither the attention nor the no-treatment control groups had significant gains immediately postintervention or at follow-up assessments. Post hoc across-group comparison of baseline change also highlights the benefits of REAL training. For EPCCE, the magnitude of gain was significantly larger in the REAL group versus the no-treatment control group immediately postintervention (d=3.82; P<0.01) and at the 3-month follow-up (d=3.80; P<0.01). For DAFS, gain magnitude immediately postintervention for REAL was significantly greater compared with in the attention control group (d=4.73; P<0.01). Implications REAL improves skills in everyday problem solving, which may allow AL residents to maintain self-care and extend AL residency. This benefit
A qualitative numerical study of high dimensional dynamical systems
NASA Astrophysics Data System (ADS)
Albers, David James
Since Poincare, the father of modern mathematical dynamical systems, much effort has been exerted to achieve a qualitative understanding of the physical world via a qualitative understanding of the functions we use to model the physical world. In this thesis, we construct a numerical framework suitable for a qualitative, statistical study of dynamical systems using the space of artificial neural networks. We analyze the dynamics along intervals in parameter space, separating the set of neural networks into roughly four regions: the fixed point to the first bifurcation; the route to chaos; the chaotic region; and a transition region between chaos and finite-state neural networks. The study is primarily with respect to high-dimensional dynamical systems. We make the following general conclusions as the dimension of the dynamical system is increased: the probability of the first bifurcation being of type Neimark-Sacker is greater than ninety-percent; the most probable route to chaos is via a cascade of bifurcations of high-period periodic orbits, quasi-periodic orbits, and 2-tori; there exists an interval of parameter space such that hyperbolicity is violated on a countable, Lebesgue measure 0, "increasingly dense" subset; chaos is much more likely to persist with respect to parameter perturbation in the chaotic region of parameter space as the dimension is increased; moreover, as the number of positive Lyapunov exponents is increased, the likelihood that any significant portion of these positive exponents can be perturbed away decreases with increasing dimension. The maximum Kaplan-Yorke dimension and the maximum number of positive Lyapunov exponents increases linearly with dimension. The probability of a dynamical system being chaotic increases exponentially with dimension. The results with respect to the first bifurcation and the route to chaos comment on previous results of Newhouse, Ruelle, Takens, Broer, Chenciner, and Iooss. Moreover, results regarding the high-dimensional
NASA Astrophysics Data System (ADS)
Borrás-Almenar, J. J.; Clemente, J. M.; Coronado, E.; Tsukerblat, B. S.
1995-06-01
A general approach to the vibronic problem of delocalized electronic pairs in mixed-valence compounds is developed and applied to understand the ways of electron delocalization in dodecanuclear polyoxometalate clusters containing two moving electrons. The interplay between electronic and vibronic interactions is examined. The electronic spectrum is shown to consist of two spin triplets 3T 1 and 3T 2 and three spin singlets 1A 1, 1E and 1T 2 levels determined by the double-transfer processes (parameter P). Jahn-Teller and pseudo-Jahn-Teller problems ( 3T1 + 3T2) ⊗ ( e + t2) and ( 1A1 + 1E + 1T2) ⊗ ( e + t2) have been considered in the framework of the Piepho-Krausz-Schatz model dealing with the only vibronic parameter. Several kinds of spatial electronic distribution have been found corresponding to the stable points of the energy surfaces. For spin-triplet states, potential surfaces contain six minima in e space corresponding to partially delocalized electronic pairs over four sides of the T d structure (limiting case of weak coupling), or delocalized over two opposite sides (limiting case of strong coupling). The former situation restricts electron delocalization to two of the three metal octahedra of each M 3O 12 triad in such a way that each electron moves over a tetrameric unit in which the metal sites are alternatively sharing edges and corners. In the t 2 space the electronic pair can be either delocalized over three sides, giving rise to a trigonal-type distortion of the cluster and a partial electron delocalization over two opposite M 3O 12 triads (four trigonal minima in the case of strong transfer or relatively weak vibronic interaction), or be completely localized (case of strong vibronic coupling). For spin-singlet states the system possesses a stable point in the high-symmetrical nuclear configuration, corresponding to a full delocalization of the electronic pairs in the Keggin cluster. The influence of vibronic interaction on the nature of the
Sparse meta-analysis with high-dimensional data.
He, Qianchuan; Zhang, Hao Helen; Avery, Christy L; Lin, D Y
2016-04-01
Meta-analysis plays an important role in summarizing and synthesizing scientific evidence derived from multiple studies. With high-dimensional data, the incorporation of variable selection into meta-analysis improves model interpretation and prediction. Existing variable selection methods require direct access to raw data, which may not be available in practical situations. We propose a new approach, sparse meta-analysis (SMA), in which variable selection for meta-analysis is based solely on summary statistics and the effect sizes of each covariate are allowed to vary among studies. We show that the SMA enjoys the oracle property if the estimated covariance matrix of the parameter estimators from each study is available. We also show that our approach achieves selection consistency and estimation consistency even when summary statistics include only the variance estimators or no variance/covariance information at all. Simulation studies and applications to high-throughput genomics studies demonstrate the usefulness of our approach. © The Author 2015. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
HASE: Framework for efficient high-dimensional association analyses
Roshchupkin, G. V.; Adams, H. H. H.; Vernooij, M. W.; Hofman, A.; Van Duijn, C. M.; Ikram, M. A.; Niessen, W. J.
2016-01-01
High-throughput technology can now provide rich information on a person’s biological makeup and environmental surroundings. Important discoveries have been made by relating these data to various health outcomes in fields such as genomics, proteomics, and medical imaging. However, cross-investigations between several high-throughput technologies remain impractical due to demanding computational requirements (hundreds of years of computing resources) and unsuitability for collaborative settings (terabytes of data to share). Here we introduce the HASE framework that overcomes both of these issues. Our approach dramatically reduces computational time from years to only hours and also requires several gigabytes to be exchanged between collaborators. We implemented a novel meta-analytical method that yields identical power as pooled analyses without the need of sharing individual participant data. The efficiency of the framework is illustrated by associating 9 million genetic variants with 1.5 million brain imaging voxels in three cohorts (total N = 4,034) followed by meta-analysis, on a standard computational infrastructure. These experiments indicate that HASE facilitates high-dimensional association studies enabling large multicenter association studies for future discoveries. PMID:27782180
Mapping the High-Dimensional ISM with Kinetic Tomography
NASA Astrophysics Data System (ADS)
Zasowski, Gail; Peek, Joshua Eli Goldston; Tchernyshyov, Kirill
2017-01-01
The interstellar medium (ISM) of a galaxy plays a critical role in its chemical evolution, via flows of enriched material into and out of star-forming molecular clouds, and even more expansive flows on kiloparsec scales through the disk and halo. The Milky Way is the only large galaxy in which we can resolve these motions at the level of individual molecular clouds, measure the kinematics of interstellar dust, and map the full three-dimensional velocity field of multiple ISM components; all of these are necessary to understand the evolution of spiral arms and molecular clouds, along with the redistribution of heavy elements throughout the Galaxy. I will present early results from a novel technique called "kinetic tomography", in which we combine stellar reddening (from Pan-STARRS), interstellar emission (CO and HI), and interstellar absorption (from APOGEE) data into high-dimensional datasets, and then extract distance-resolved kinematic information on multiple phases of the ISM. These methods are providing new views on the evolution of molecular clouds and chemical mixing in the ISM.
Supervised Bayesian latent class models for high-dimensional data
Desantis, Stacia M.; Houseman, E. Andrés; Coull, Brent A.; Nutt, Catherine L.; Betensky, Rebecca A.
2013-01-01
High-grade gliomas are the most common primary brain tumors in adults and are typically diagnosed using histopathology. However, these diagnostic categories are highly heterogeneous and do not always correlate well with survival. In an attempt to refine these diagnoses, we make several immunohistochemical measurements of YKL-40, a gene previously shown to be differentially expressed between diagnostic groups. We propose two latent class models for classification and variable selection in the presence of high-dimensional binary data, fit by using Bayesian Markov chain Monte Carlo techniques. Penalization and model selection are incorporated in this setting via prior distributions on the unknown parameters. The methods provide valid parameter estimates under conditions in which standard supervised latent class models do not, and outperform two-stage approaches to variable selection and parameter estimation in a variety of settings. We study the properties of these methods in simulations, and apply these methodologies to the glioma study for which identifiable three-class parameter estimates cannot be obtained without penalization. With penalization, the resulting latent classes correlate well with clinical tumor grade and offer additional information on survival prognosis that is not captured by clinical diagnosis alone. The inclusion of YKL-40 features also increases the precision of survival estimates. Fitting models with and without YKL-40 highlights a subgroup of patients who have glioblastoma (GBM) diagnosis but appear to have better prognosis than the typical GBM patient. PMID:22495652
The Sparse Laplacian Shrinkage Estimator for High-Dimensional Regression
Huang, Jian; Ma, Shuangge; Li, Hongzhe; Zhang, Cun-Hui
2011-01-01
We propose a new penalized method for variable selection and estimation that explicitly incorporates the correlation patterns among predictors. This method is based on a combination of the minimax concave penalty and Laplacian quadratic associated with a graph as the penalty function. We call it the sparse Laplacian shrinkage (SLS) method. The SLS uses the minimax concave penalty for encouraging sparsity and Laplacian quadratic penalty for promoting smoothness among coefficients associated with the correlated predictors. The SLS has a generalized grouping property with respect to the graph represented by the Laplacian quadratic. We show that the SLS possesses an oracle property in the sense that it is selection consistent and equal to the oracle Laplacian shrinkage estimator with high probability. This result holds in sparse, high-dimensional settings with p ≫ n under reasonable conditions. We derive a coordinate descent algorithm for computing the SLS estimates. Simulation studies are conducted to evaluate the performance of the SLS method and a real data example is used to illustrate its application. PMID:22102764
Multigroup Equivalence Analysis for High-Dimensional Expression Data
Yang, Celeste; Bartolucci, Alfred A.; Cui, Xiangqin
2015-01-01
Hypothesis tests of equivalence are typically known for their application in bioequivalence studies and acceptance sampling. Their application to gene expression data, in particular high-dimensional gene expression data, has only recently been studied. In this paper, we examine how two multigroup equivalence tests, the F-test and the range test, perform when applied to microarray expression data. We adapted these tests to a well-known equivalence criterion, the difference ratio. Our simulation results showed that both tests can achieve moderate power while controlling the type I error at nominal level for typical expression microarray studies with the benefit of easy-to-interpret equivalence limits. For the range of parameters simulated in this paper, the F-test is more powerful than the range test. However, for comparing three groups, their powers are similar. Finally, the two multigroup tests were applied to a prostate cancer microarray dataset to identify genes whose expression follows a prespecified trajectory across five prostate cancer stages. PMID:26628859
Multigroup Equivalence Analysis for High-Dimensional Expression Data.
Yang, Celeste; Bartolucci, Alfred A; Cui, Xiangqin
2015-01-01
Hypothesis tests of equivalence are typically known for their application in bioequivalence studies and acceptance sampling. Their application to gene expression data, in particular high-dimensional gene expression data, has only recently been studied. In this paper, we examine how two multigroup equivalence tests, the F-test and the range test, perform when applied to microarray expression data. We adapted these tests to a well-known equivalence criterion, the difference ratio. Our simulation results showed that both tests can achieve moderate power while controlling the type I error at nominal level for typical expression microarray studies with the benefit of easy-to-interpret equivalence limits. For the range of parameters simulated in this paper, the F-test is more powerful than the range test. However, for comparing three groups, their powers are similar. Finally, the two multigroup tests were applied to a prostate cancer microarray dataset to identify genes whose expression follows a prespecified trajectory across five prostate cancer stages.
The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R
Li, Xingguo; Zhao, Tuo; Yuan, Xiaoming; Liu, Han
2016-01-01
This paper describes an R package named flare, which implements a family of new high dimensional regression methods (LAD Lasso, SQRT Lasso, ℓq Lasso, and Dantzig selector) and their extensions to sparse precision matrix estimation (TIGER and CLIME). These methods exploit different nonsmooth loss functions to gain modeling exibility, estimation robustness, and tuning insensitiveness. The developed solver is based on the alternating direction method of multipliers (ADMM), which is further accelerated by the multistage screening approach. The package flare is coded in double precision C, and called from R by a user-friendly interface. The memory usage is optimized by using the sparse matrix output. The experiments show that flare is efficient and can scale up to large problems.
Feature Selection Based on High Dimensional Model Representation for Hyperspectral Images.
Taskin Kaya, Gulsen; Kaya, Huseyin; Bruzzone, Lorenzo
2017-03-24
In hyperspectral image analysis, the classification task has generally been addressed jointly with dimensionality reduction due to both the high correlation between the spectral features and the noise present in spectral bands which might significantly degrade classification performance. In supervised classification, limited training instances in proportion to the number of spectral features have negative impacts on the classification accuracy, which has known as Hughes effects or curse of dimensionality in the literature. In this paper, we focus on dimensionality reduction problem, and propose a novel feature-selection algorithm which is based on the method called High Dimensional Model Representation. The proposed algorithm is tested on some toy examples and hyperspectral datasets in comparison to conventional feature-selection algorithms in terms of classification accuracy, stability of the selected features and computational time. The results showed that the proposed approach provides both high classification accuracy and robust features with a satisfactory computational time.
A Hyperspherical Adaptive Sparse-Grid Method for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max D.; Burkardt, John V.
2015-06-24
This study proposes and analyzes a hyperspherical adaptive hierarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces. The method is motivated by the theoretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a function representation of the discontinuity hypersurface of an N-dimensional discontinuous quantity of interest, by virtue of a hyperspherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyperspherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. In addition, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous complexity analyses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.
Neuro-Adaptive Consensus Tracking of Multiagent Systems With a High-Dimensional Leader.
Wen, Guanghui; Yu, Wenwu; Li, Zhongkui; Yu, Xinghuo; Cao, Jinde
2016-05-05
\\looseness1This paper is concerned with the distributed consensus tracking problem of uncertain multiagent systems with directed communication topology and a single high-dimensional leader. Compared with existing related works, the dynamics of each follower in the present framework are subject to unmodeled dynamics and unknown external disturbances, which is more practical in various applications. Furthermore, the dimensions of leader's dynamics may be different with those of the followers' dynamics. Under the mild assumption that each follower can directly or indirectly sense the output information of the leader, a distributed robust adaptive neural network controller together with a local observer are designed to each follower to ensure that the states of each follower ultimately synchronize to the leader's output with bounded residual errors under a fixed topology. By appropriately constructing some multiple Lyapunov functions, the derived results are further extended to consensus tracking with switching directed communication topologies. The effectiveness of the analytical results is demonstrated via numerical simulations.
Change point estimation in high dimensional Markov random-field models.
Roy, Sandipan; Atchadé, Yves; Michailidis, George
2017-09-01
This paper investigates a change-point estimation problem in the context of high-dimensional Markov random field models. Change-points represent a key feature in many dynamically evolving network structures. The change-point estimate is obtained by maximizing a profile penalized pseudo-likelihood function under a sparsity assumption. We also derive a tight bound for the estimate, up to a logarithmic factor, even in settings where the number of possible edges in the network far exceeds the sample size. The performance of the proposed estimator is evaluated on synthetic data sets and is also used to explore voting patterns in the US Senate in the 1979-2012 period.
Huang, Shuai; Li, Jing; Ye, Jieping; Fleisher, Adam; Chen, Kewei; Wu, Teresa; Reiman, Eric
2013-06-01
Structure learning of Bayesian Networks (BNs) is an important topic in machine learning. Driven by modern applications in genetics and brain sciences, accurate and efficient learning of large-scale BN structures from high-dimensional data becomes a challenging problem. To tackle this challenge, we propose a Sparse Bayesian Network (SBN) structure learning algorithm that employs a novel formulation involving one L1-norm penalty term to impose sparsity and another penalty term to ensure that the learned BN is a Directed Acyclic Graph--a required property of BNs. Through both theoretical analysis and extensive experiments on 11 moderate and large benchmark networks with various sample sizes, we show that SBN leads to improved learning accuracy, scalability, and efficiency as compared with 10 existing popular BN learning algorithms. We apply SBN to a real-world application of brain connectivity modeling for Alzheimer's disease (AD) and reveal findings that could lead to advancements in AD research.
Linear mapping functions for high-dimensional indexing in image databases
NASA Astrophysics Data System (ADS)
Sumanasekara, Santha; Ramakrishna, Medahalli V.
2000-10-01
The problem of indexing high-dimensional data has received renewed interest because of its necessity in emerging multimedia databases. The limitations of traditional tree-based indexing and the dimensionality curse is well known. We have proposed a hierarchical indexing structure based on linear mapping functions, which is not tree-based, and not necessarily balanced. At each level of the hierarchy, a linear mapping function is used to distribute the data among buckets. The feasibility and the performance of the indexing structure is dependent on finding appropriate mapping functions for any given data set. In this paper we present the approach taken in arriving at a few classes of mapping functions. We have given a heuristic algorithm to determine the choice of the most appropriate mapping function for a given data set. The results of experiments with real life data are presented and they indicate that the proposed indexing structure with linear mapping functions is indeed practical.
High dimensional linear regression models under long memory dependence and measurement error
NASA Astrophysics Data System (ADS)
Kaul, Abhishek
This dissertation consists of three chapters. The first chapter introduces the models under consideration and motivates problems of interest. A brief literature review is also provided in this chapter. The second chapter investigates the properties of Lasso under long range dependent model errors. Lasso is a computationally efficient approach to model selection and estimation, and its properties are well studied when the regression errors are independent and identically distributed. We study the case, where the regression errors form a long memory moving average process. We establish a finite sample oracle inequality for the Lasso solution. We then show the asymptotic sign consistency in this setup. These results are established in the high dimensional setup (p> n) where p can be increasing exponentially with n. Finally, we show the consistency, n½ --d-consistency of Lasso, along with the oracle property of adaptive Lasso, in the case where p is fixed. Here d is the memory parameter of the stationary error sequence. The performance of Lasso is also analysed in the present setup with a simulation study. The third chapter proposes and investigates the properties of a penalized quantile based estimator for measurement error models. Standard formulations of prediction problems in high dimension regression models assume the availability of fully observed covariates and sub-Gaussian and homogeneous model errors. This makes these methods inapplicable to measurement errors models where covariates are unobservable and observations are possibly non sub-Gaussian and heterogeneous. We propose weighted penalized corrected quantile estimators for the regression parameter vector in linear regression models with additive measurement errors, where unobservable covariates are nonrandom. The proposed estimators forgo the need for the above mentioned model assumptions. We study these estimators in both the fixed dimension and high dimensional sparse setups, in the latter setup, the
Orion Multi-Purpose Crew Vehicle Solving and Mitigating the Two Main Cluster Pendulum Problem
NASA Technical Reports Server (NTRS)
Ali, Yasmin; Sommer, Bruce; Troung, Tuan; Anderson, Brian; Madsen, Christopher
2017-01-01
The Orion Multi-purpose Crew Vehicle (MPCV) Orion spacecraft will return humans from beyond earth's orbit, including Mars and will be required to land 20,000 pounds of mass safely in the ocean. The parachute system nominally lands under 3 main parachutes, but the system is designed to be fault tolerant and land under 2 main parachutes. During several of the parachute development tests, it was observed that a pendulum, or swinging, motion could develop while the Crew Module (CM) was descending under two parachutes. This pendulum effect had not been previously predicted by modeling. Landing impact analysis showed that the landing loads would double in some places across the spacecraft. The CM structural design limits would be exceeded upon landing if this pendulum motion were to occur. The Orion descent and landing team was faced with potentially millions of dollars in structural modifications and a severe mass increase. A multidisciplinary team was formed to determine root cause, model the pendulum motion, study alternate canopy planforms and assess alternate operational vehicle controls & operations providing mitigation options resulting in a reliability level deemed safe for human spaceflight. The problem and solution is a balance of risk to a known solution versus a chance to improve the landing performance for the next human-rated spacecraft.
Representing potential energy surfaces by high-dimensional neural network potentials.
Behler, J
2014-05-07
The development of interatomic potentials employing artificial neural networks has seen tremendous progress in recent years. While until recently the applicability of neural network potentials (NNPs) has been restricted to low-dimensional systems, this limitation has now been overcome and high-dimensional NNPs can be used in large-scale molecular dynamics simulations of thousands of atoms. NNPs are constructed by adjusting a set of parameters using data from electronic structure calculations, and in many cases energies and forces can be obtained with very high accuracy. Therefore, NNP-based simulation results are often very close to those gained by a direct application of first-principles methods. In this review, the basic methodology of high-dimensional NNPs will be presented with a special focus on the scope and the remaining limitations of this approach. The development of NNPs requires substantial computational effort as typically thousands of reference calculations are required. Still, if the problem to be studied involves very large systems or long simulation times this overhead is regained quickly. Further, the method is still limited to systems containing about three or four chemical elements due to the rapidly increasing complexity of the configuration space, although many atoms of each species can be present. Due to the ability of NNPs to describe even extremely complex atomic configurations with excellent accuracy irrespective of the nature of the atomic interactions, they represent a general and therefore widely applicable technique, e.g. for addressing problems in materials science, for investigating properties of interfaces, and for studying solvation processes.
Graph Based Models for Unsupervised High Dimensional Data Clustering and Network Analysis
2015-01-01
developed and a Lyapunov functional is proven to decrease as the algorithm proceeds. Furthermore, to reduce the computational cost for large datasets...version of Mumford-Shah model using the MBO scheme. Theoretical analysis is developed and a Lyapunov functional is proven to decrease as the algorithm...Mumford-Shah in Hyperspectral Data . . . . . . . . . . . 73 5.1 Mumford-Shah MBO and Lyapunov functional . . . . . . . . . . . 74 5.1.1 Mumford-Shah MBO
Chen, Yi; Jakeman, John; Gittelson, Claude; Xiu, Dongbin
2015-01-08
In this paper we present a localized polynomial chaos expansion for partial differential equations (PDE) with random inputs. In particular, we focus on time independent linear stochastic problems with high dimensional random inputs, where the traditional polynomial chaos methods, and most of the existing methods, incur prohibitively high simulation cost. Furthermore, the local polynomial chaos method employs a domain decomposition technique to approximate the stochastic solution locally. In each subdomain, a subdomain problem is solved independently and, more importantly, in a much lower dimensional random space. In a postprocesing stage, accurate samples of the original stochastic problems are obtained from the samples of the local solutions by enforcing the correct stochastic structure of the random inputs and the coupling conditions at the interfaces of the subdomains. Overall, the method is able to solve stochastic PDEs in very large dimensions by solving a collection of low dimensional local problems and can be highly efficient. In our paper we present the general mathematical framework of the methodology and use numerical examples to demonstrate the properties of the method.
Pfeiffer, Klaus; Hautzinger, Martin; Patak, Margarete; Grünwald, Julia; Becker, Clemens; Albrecht, Diana
2017-03-06
Despite the positive evaluation of various caregiver interventions over the past 3 decades, only very few intervention protocols have been translated to delivery in service contexts. The purpose of this study is to train care counsellors of statutory long term care insurances in problem-solving and to evaluate this approach as an additional component in the statutory care counselling in Germany. A pragmatic cluster randomized controlled trial in which 38 sites with 58 care counsellors are randomly assigned to provide either routine counselling plus additional problem-solving for caregivers or routine counselling alone. The counsellor training comprises an initial 2-day training, a follow-up day after 4 months, and biweekly supervision contacts with a psychotherapist for 6 months over the phone. The agreed minimum counselling intensity is one initial face-to-face contact including a caregiver assessment and at least one telephone follow-up contact. Caregivers who are positively screened for significant strain in their role are followed up at 3 and 6 months after baseline assessment. Main outcome are caregivers' depressive symptoms. While it is unclear if the expected very low amount of additional counselling time is sufficient to yield any additional effects on caregiver depression, it is also unclear if the additional problem-solving component yields to synergies with routine counselling that is based on information and case management. There are different potential individual and organisational barriers to a consistent intervention delivery like gratification for participation, time for extra work or internal motivation to participate. ( ISRCTN23635523 ).
Genuinely high-dimensional nonlocality optimized by complementary measurements
NASA Astrophysics Data System (ADS)
Lim, James; Ryu, Junghee; Yoo, Seokwon; Lee, Changhyoup; Bang, Jeongho; Lee, Jinhyoung
2010-10-01
Qubits exhibit extreme nonlocality when their state is maximally entangled and this is observed by mutually unbiased local measurements. This criterion does not hold for the Bell inequalities of high-dimensional systems (qudits), recently proposed by Collins-Gisin-Linden-Massar-Popescu and Son-Lee-Kim. Taking an alternative approach, called the quantum-to-classical approach, we derive a series of Bell inequalities for qudits that satisfy the criterion as for the qubits. In the derivation each d-dimensional subsystem is assumed to be measured by one of d possible measurements with d being a prime integer. By applying to two qubits (d=2), we find that a derived inequality is reduced to the Clauser-Horne-Shimony-Holt inequality when the degree of nonlocality is optimized over all the possible states and local observables. Further applying to two and three qutrits (d=3), we find Bell inequalities that are violated for the three-dimensionally entangled states but are not violated by any two-dimensionally entangled states. In other words, the inequalities discriminate three-dimensional (3D) entanglement from two-dimensional (2D) entanglement and in this sense they are genuinely 3D. In addition, for the two qutrits we give a quantitative description of the relations among the three degrees of complementarity, entanglement and nonlocality. It is shown that the degree of complementarity jumps abruptly to very close to its maximum as nonlocality starts appearing. These characteristics imply that complementarity plays a more significant role in the present inequality compared with the previously proposed inequality.
Packing hyperspheres in high-dimensional Euclidean spaces.
Skoge, Monica; Donev, Aleksandar; Stillinger, Frank H; Torquato, Salvatore
2006-10-01
We present a study of disordered jammed hard-sphere packings in four-, five-, and six-dimensional Euclidean spaces. Using a collision-driven packing generation algorithm, we obtain the first estimates for the packing fractions of the maximally random jammed (MRJ) states for space dimensions d=4, 5, and 6 to be phi(MRJ) approximately 0.46, 0.31, and 0.20, respectively. To a good approximation, the MRJ density obeys the scaling form phi(MRJ)=c1/2(d)+(c2d)/2d, where c1=-2.72 and c2=2.56, which appears to be consistent with the high-dimensional asymptotic limit, albeit with different coefficients. Calculations of the pair correlation function g2(r) and structure factor S(k) for these states show that short-range ordering appreciably decreases with increasing dimension, consistent with a recently proposed "decorrelation principle," which, among other things, states that unconstrained correlations diminish as the dimension increases and vanish entirely in the limit d-->infinity. As in three dimensions (where phi(MRJ) approximately 0.64), the packings show no signs of crystallization, are isostatic, and have a power-law divergence in g2(r) at contact with power-law exponent approximately 0.4. Across dimensions, the cumulative number of neighbors equals the kissing number of the conjectured densest packing close to where g2(r) has its first minimum. Additionally, we obtain estimates for the freezing and melting packing fractions for the equilibrium hard-sphere fluid-solid transition, phi(F) approximately 0.32 and phi(M) approximately 0.39, respectively, for d=4, and phi(F) approximately 0.20 and phi(M) approximately 0.25, respectively, for d=5. Although our results indicate the stable phase at high density is a crystalline solid, nucleation appears to be strongly suppressed with increasing dimension.
Unsupervised clustering of over-the-counter healthcare products into product categories.
Wallstrom, Garrick L; Hogan, William R
2007-12-01
A general problem in biosurveillance is finding appropriate aggregates of elemental data to monitor for the detection of disease outbreaks. We developed an unsupervised clustering algorithm for aggregating over-the-counter healthcare (OTC) products into categories. This algorithm employs MCMC over hundreds of parameters in a Bayesian model to place products into clusters. Despite the high dimensionality, it still performs fast on hundreds of time series. The procedure was able to uncover a clinically significant distinction between OTC products intended for the treatment of allergy and OTC products intended for the treatment of cough, cold, and influenza symptoms.
High dimensional spatial modeling of extremes with applications to United States Rainfalls
NASA Astrophysics Data System (ADS)
Zhou, Jie
2007-12-01
Spatial statistical models are used to predict unobserved variables based on observed variables and to estimate unknown model parameters. Extreme value theory(EVT) is used to study large or small observations from a random phenomenon. Both spatial statistics and extreme value theory have been studied in a lot of areas such as agriculture, finance, industry and environmental science. This dissertation proposes two spatial statistical models which concentrate on non-Gaussian probability densities with general spatial covariance structures. The two models are also applied in analyzing United States Rainfalls and especially, rainfall extremes. When the data set is not too large, the first model is used. The model constructs a generalized linear mixed model(GLMM) which can be considered as an extension of Diggle's model-based geostatistical approach(Diggle et al. 1998). The approach improves conventional kriging with a form of generalized linear mixed structure. As for high dimensional problems, two different methods are established to improve the computational efficiency of Markov Chain Monte Carlo(MCMC) implementation. The first method is based on spectral representation of spatial dependence structures which provides good approximations on each MCMC iteration. The other method embeds high dimensional covariance matrices in matrices with block circulant structures. The eigenvalues and eigenvectors of block circulant matrices can be calculated exactly by Fast Fourier Transforms(FFT). The computational efficiency is gained by transforming the posterior matrices into lower dimensional matrices. This method gives us exact update on each MCMC iteration. Future predictions are also made by keeping spatial dependence structures fixed and using the relationship between present days and future days provided by some Global Climate Model(GCM). The predictions are refined by sampling techniques. Both ways of handling high dimensional covariance matrices are novel to analyze large
NASA Astrophysics Data System (ADS)
Taşkin Kaya, Gülşen
2013-10-01
-output relationships in high-dimensional systems for many problems in science and engineering. The HDMR method is developed to improve the efficiency of the deducing high dimensional behaviors. The method is formed by a particular organization of low dimensional component functions, in which each function is the contribution of one or more input variables to the output variables.
a Probabilistic Embedding Clustering Method for Urban Structure Detection
NASA Astrophysics Data System (ADS)
Lin, X.; Li, H.; Zhang, Y.; Gao, L.; Zhao, L.; Deng, M.
2017-09-01
Urban structure detection is a basic task in urban geography. Clustering is a core technology to detect the patterns of urban spatial structure, urban functional region, and so on. In big data era, diverse urban sensing datasets recording information like human behaviour and human social activity, suffer from complexity in high dimension and high noise. And unfortunately, the state-of-the-art clustering methods does not handle the problem with high dimension and high noise issues concurrently. In this paper, a probabilistic embedding clustering method is proposed. Firstly, we come up with a Probabilistic Embedding Model (PEM) to find latent features from high dimensional urban sensing data by "learning" via probabilistic model. By latent features, we could catch essential features hidden in high dimensional data known as patterns; with the probabilistic model, we can also reduce uncertainty caused by high noise. Secondly, through tuning the parameters, our model could discover two kinds of urban structure, the homophily and structural equivalence, which means communities with intensive interaction or in the same roles in urban structure. We evaluated the performance of our model by conducting experiments on real-world data and experiments with real data in Shanghai (China) proved that our method could discover two kinds of urban structure, the homophily and structural equivalence, which means clustering community with intensive interaction or under the same roles in urban space.
Fast and accurate probability density estimation in large high dimensional astronomical datasets
NASA Astrophysics Data System (ADS)
Gupta, Pramod; Connolly, Andrew J.; Gardner, Jeffrey P.
2015-01-01
Astronomical surveys will generate measurements of hundreds of attributes (e.g. color, size, shape) on hundreds of millions of sources. Analyzing these large, high dimensional data sets will require efficient algorithms for data analysis. An example of this is probability density estimation that is at the heart of many classification problems such as the separation of stars and quasars based on their colors. Popular density estimation techniques use binning or kernel density estimation. Kernel density estimation has a small memory footprint but often requires large computational resources. Binning has small computational requirements but usually binning is implemented with multi-dimensional arrays which leads to memory requirements which scale exponentially with the number of dimensions. Hence both techniques do not scale well to large data sets in high dimensions. We present an alternative approach of binning implemented with hash tables (BASH tables). This approach uses the sparseness of data in the high dimensional space to ensure that the memory requirements are small. However hashing requires some extra computation so a priori it is not clear if the reduction in memory requirements will lead to increased computational requirements. Through an implementation of BASH tables in C++ we show that the additional computational requirements of hashing are negligible. Hence this approach has small memory and computational requirements. We apply our density estimation technique to photometric selection of quasars using non-parametric Bayesian classification and show that the accuracy of the classification is same as the accuracy of earlier approaches. Since the BASH table approach is one to three orders of magnitude faster than the earlier approaches it may be useful in various other applications of density estimation in astrostatistics.
Song, Yang; Cai, Weidong; Huang, Heng; Feng, Dagan; Wang, Yue; Chen, Mei
2016-11-16
Bioimage classification is a fundamental problem for many important biological studies that require accurate cell phenotype recognition, subcellular localization, and histopathological classification. In this paper, we present a new bioimage classification method that can be generally applicable to a wide variety of classification problems. We propose to use a high-dimensional multi-modal descriptor that combines multiple texture features. We also design a novel subcategory discriminant transform (SDT) algorithm to further enhance the discriminative power of descriptors by learning convolution kernels to reduce the within-class variation and increase the between-class difference. We evaluate our method on eight different bioimage classification tasks using the publicly available IICBU 2008 database. Each task comprises a separate dataset, and the collection represents typical subcellular, cellular, and tissue level classification problems. Our method demonstrates improved classification accuracy (0.9 to 9%) on six tasks when compared to state-of-the-art approaches. We also find that SDT outperforms the well-known dimension reduction techniques, with for example 0.2 to 13% improvement over linear discriminant analysis. We present a general bioimage classification method, which comprises a highly descriptive visual feature representation and a learning-based discriminative feature transformation algorithm. Our evaluation on the IICBU 2008 database demonstrates improved performance over the state-of-the-art for six different classification tasks.
Two representations of a high-dimensional perceptual space.
Victor, Jonathan D; Rizvi, Syed M; Conte, Mary M
2017-08-01
A perceptual space is a mental workspace of points in a sensory domain that supports similarity and difference judgments and enables further processing such as classification and naming. Perceptual spaces are present across sensory modalities; examples include colors, faces, auditory textures, and odors. Color is perhaps the best-studied perceptual space, but it is atypical in two respects. First, the dimensions of color space are directly linked to the three cone absorption spectra, but the dimensions of generic perceptual spaces are not as readily traceable to single-neuron properties. Second, generic perceptual spaces have more than three dimensions. This is important because representing each distinguishable point in a high-dimensional space by a separate neuron or population is unwieldy; combinatorial strategies may be needed to overcome this hurdle. To study the representation of a complex perceptual space, we focused on a well-characterized 10-dimensional domain of visual textures. Within this domain, we determine perceptual distances in a threshold task (segmentation) and a suprathreshold task (border salience comparison). In N=4 human observers, we find both quantitative and qualitative differences between these sets of measurements. Quantitatively, observers' segmentation thresholds were inconsistent with their uncertainty determined from border salience comparisons. Qualitatively, segmentation thresholds suggested that distances are determined by a coordinate representation with Euclidean geometry. Border salience comparisons, in contrast, indicated a global curvature of the space, and that distances are determined by activity patterns across broadly tuned elements. Thus, our results indicate two representations of this perceptual space, and suggest that they use differing combinatorial strategies. To move from sensory signals to decisions and actions, the brain carries out a sequence of transformations. An important stage in this process is the
Unfold High-Dimensional Clouds for Exhaustive Gating of Flow Cytometry Data.
Qiu, Peng
2014-01-01
Flow cytometry is able to measure the expressions of multiple proteins simultaneously at the single-cell level. A flow cytometry experiment on one biological sample provides measurements of several protein markers on or inside a large number of individual cells in that sample. Analysis of such data often aims to identify subpopulations of cells with distinct phenotypes. Currently, the most widely used analytical approach in the flow cytometry community is manual gating on a sequence of nested biaxial plots, which is highly subjective, labor intensive, and not exhaustive. To address those issues, a number of methods have been developed to automate the gating analysis by clustering algorithms. However, completely removing the subjectivity can be quite challenging. This paper describes an alternative approach. Instead of automating the analysis, we develop novel visualizations to facilitate manual gating. The proposed method views single-cell data of one biological sample as a high-dimensional point cloud of cells, derives the skeleton of the cloud, and unfolds the skeleton to generate 2D visualizations. We demonstrate the utility of the proposed visualization using real data, and provide quantitative comparison to visualizations generated from principal component analysis and multidimensional scaling.
NASA Astrophysics Data System (ADS)
Campbell, S. W.; Lattanzio, J. C.; Elliott, L. M.
On reading an old paper about galactic globular cluster abundance observations (of NGC 6752) we came across an intriguing result. \\citet{NCF81} found that there was a distinct lack of cyanogen-strong (CN-strong) stars in their sample of AGB stars, as compared to their sample of RGB stars (which had roughly equal numbers of CN-normal and CN-strong stars). Further reading revealed that similar features have been discovered in the AGB populations of other clusters. Recently, \\citet{SIK00} followed up on this possibility (and considered other proton-capture products) by compiling the existing data at the time and came to a similar conclusion for two more clusters. Unfortunately all of these studies suffer from low AGB star counts so the conclusions are not necessarily robust -- larger, statistically significant, sample sizes are needed. In this conference paper, presented at the Eighth Torino Workshop on Nucleosynthesis in AGB Stars (Universidad de Granada, Spain, 2006), we outline the results of a literature search for relevant CN observations and describe our observing proposal to test the suggestion that there are substantial abundance differences between the AGB and RGB in galactic globular clusters. The literature search revealed that the AGB star counts for all studies (which are not, in general, studies about AGB stars in particular) are low, usually being < 10. The search also revealed that the picture may not be consistent between clusters. Although most clusters appear to have CN-weak AGBs, at least two seem to have CN-strong AGBs (M5 & 47 Tuc). To further complicate the picture, clusters often appear to have a combination of both CN-strong and CN-weak stars on their AGBs - although one population tends to dominate. Again, all these assertions are however based on small sample sizes. We aim to increase the sample sizes by an order of magnitude using existing high quality photometry in which the AGB and RGB can be reliably separated. For the observations we
Gude, Tore; Hoffart, Asle
2008-04-01
The aim was to study whether patients with panic disorder with agoraphobia and co-occurring Cluster C traits would respond differently regarding change in interpersonal problems as part of their personality functioning when receiving two different treatment modalities. Two cohorts of patients were followed through three months' in-patient treatment programs and assessed at follow-up one year after end of treatment. The one cohort comprised 18 patients treated with "treatment as usual" according to psychodynamic principles, the second comprised 24 patients treated in a cognitive agoraphobia and schema-focused therapy program. Patients in the cognitive condition showed greater improvement in interpersonal problems than patients in the treatment as usual condition. Although this quasi-experimental study has serious limitations, the results may indicate that agoraphobic patients with Cluster C traits should be treated in cognitive agoraphobia and schema-focused programs rather than in psychodynamic treatment as usual programs in order to reduce their level of interpersonal problems.
Ge, Yongchao; Sealfon, Stuart C
2012-08-01
For flow cytometry data, there are two common approaches to the unsupervised clustering problem: one is based on the finite mixture model and the other on spatial exploration of the histograms. The former is computationally slow and has difficulty to identify clusters of irregular shapes. The latter approach cannot be applied directly to high-dimensional data as the computational time and memory become unmanageable and the estimated histogram is unreliable. An algorithm without these two problems would be very useful. In this article, we combine ideas from the finite mixture model and histogram spatial exploration. This new algorithm, which we call flowPeaks, can be applied directly to high-dimensional data and identify irregular shape clusters. The algorithm first uses K-means algorithm with a large K to partition the cell population into many small clusters. These partitioned data allow the generation of a smoothed density function using the finite mixture model. All local peaks are exhaustively searched by exploring the density function and the cells are clustered by the associated local peak. The algorithm flowPeaks is automatic, fast and reliable and robust to cluster shape and outliers. This algorithm has been applied to flow cytometry data and it has been compared with state of the art algorithms, including Misty Mountain, FLOCK, flowMeans, flowMerge and FLAME. The R package flowPeaks is available at https://github.com/yongchao/flowPeaks. yongchao.ge@mssm.edu Supplementary data are available at Bioinformatics online.
Flamm, Christoph; Graef, Andreas; Pirker, Susanne; Baumgartner, Christoph; Deistler, Manfred
2013-01-01
Granger causality is a useful concept for studying causal relations in networks. However, numerical problems occur when applying the corresponding methodology to high-dimensional time series showing co-movement, e.g. EEG recordings or economic data. In order to deal with these shortcomings, we propose a novel method for the causal analysis of such multivariate time series based on Granger causality and factor models. We present the theoretical background, successfully assess our methodology with the help of simulated data and show a potential application in EEG analysis of epileptic seizures. PMID:23354014
Kane, J C; Luitel, N P; Jordans, M J D; Kohrt, B A; Weissbecker, I; Tol, W A
2017-01-09
Two large earthquakes in 2015 caused widespread destruction in Nepal. This study aimed to examine frequency of common mental health and psychosocial problems and their correlates following the earthquakes. A stratified multi-stage cluster sampling design was employed to randomly select 513 participants (aged 16 and above) from three earthquake-affected districts in Nepal: Kathmandu, Gorkha and Sindhupalchowk, 4 months after the second earthquake. Outcomes were selected based on qualitative preparatory research and included symptoms of depression and anxiety (Hopkins Symptom Checklist-25); post-traumatic stress disorder (PTSD Checklist-Civilian); hazardous alcohol use (AUDIT-C); symptoms indicating severe psychological distress (WHO-UNHCR Assessment Schedule of Serious Symptoms in Humanitarian Settings (WASSS)); suicidal ideation (Composite International Diagnostic Interview); perceived needs (Humanitarian Emergency Settings Perceived Needs Scale (HESPER)); and functional impairment (locally developed scale). A substantial percentage of participants scored above validated cut-off scores for depression (34.3%, 95% CI 28.4-40.4) and anxiety (33.8%, 95% CI 27.6-40.6). Hazardous alcohol use was reported by 20.4% (95% CI 17.1-24.3) and 10.9% (95% CI 8.8-13.5) reported suicidal ideation. Forty-two percent reported that 'distress' was a serious problem in their community. Anger that was out of control (symptom from the WASSS) was reported by 33.7% (95% CI 29.5-38.2). Fewer people had elevated rates of PTSD symptoms above a validated cut-off score (5.2%, 95% CI 3.9-6.8), and levels of functional impairment were also relatively low. Correlates of elevated symptom scores were female gender, lower caste and greater number of perceived needs. Residing in Gorkha and Sindhupalchowk districts and lower caste were also associated with greater perceived needs. Higher levels of impaired functioning were associated with greater odds of depression and anxiety symptoms; impaired
Adaptive Virtual Support Vector Machine for the Reliability Analysis of High-Dimensional Problems
2011-08-01
eds.), Academic Press, 1978. [32] Byrd, R.H., Gilbert, J.C., and Nocedal , J., "A Trust Region Method Based on Interior Point Techniques for...Nonlinear Programming," Mathematical Programming, Vol.89, No.1, pp.149-185, 2000. [33] Waltz, R.A., Morales, J.L., Nocedal , J., and Orban, D., "An
Teschke, Rolf; Schwarzenboeck, Alexander; Frenzel, Christian; Schulze, Johannes; Eickhoff, Axel; Wolff, Albrecht
2016-01-01
In the fall of 2013, the US Centers for Disease Control and Prevention (CDC) published a preliminary report on a cluster of liver disease cases that emerged in Hawaii in the summer 2013. This report claimed a temporal association as sufficient evidence that OxyELITE Pro (OEP), a dietary supplement (DS) mainly for weight loss, was the cause of this mysterious cluster. However, the presented data were inconsistent and required a thorough reanalysis. To further investigate the cause(s) of this cluster, we critically evaluated redacted raw clinical data of the cluster patients, as the CDC report received tremendous publicity in local and nationwide newspapers and television. This attention put regulators and physicians from the medical center in Honolulu that reported the cluster, under enormous pressure to succeed, risking biased evaluations and hasty conclusions. We noted pervasive bias in the documentation, conclusions, and public statements, also poor quality of case management. Among the cases we reviewed, many causes unrelated to any DS were evident, including decompensated liver cirrhosis, acute liver failure by acetaminophen overdose, acute cholecystitis with gallstones, resolving acute hepatitis B, acute HSV and VZV hepatitis, hepatitis E suspected after consumption of wild hog meat, and hepatotoxicity by acetaminophen or ibuprofen. Causality assessments based on the updated CIOMS scale confirmed the lack of evidence for any DS including OEP as culprit for the cluster. Thus, the Hawaii liver disease cluster is now best explained by various liver diseases rather than any DS, including OEP.
Arif, Muhammad; Basalamah, Saleh
2013-06-01
In real life biomedical classification applications, it is difficult to visualize the feature space due to high dimensionality of the feature space. In this paper, we have proposed 3D similarity-dissimilarity plot to project the high dimensional space to a three dimensional space in which important information about the feature space can be extracted in the context of pattern classification. In this plot it is possible to visualize good data points (data points near to their own class as compared to other classes) and bad data points (data points far away from their own class) and outlier points (data points away from both their own class and other classes). Hence separation of classes can easily be visualized. Density of the data points near each other can provide some useful information about the compactness of the clusters within certain class. Moreover, an index called percentage of data points above the similarity-dissimilarity line (PAS) is proposed which is the fraction of data points above the similarity-dissimilarity line. Several synthetic and real life biomedical datasets are used to show the effectiveness of the proposed 3D similarity-dissimilarity plot.
NASA Astrophysics Data System (ADS)
Friedenberg, David
2010-10-01
the rate of falsely detected active regions. Additionally we examine the more general field of clustering and develop a framework for clustering algorithms based around diffusion maps. Diffusion maps can be used to project high-dimensional data into a lower dimensional space while preserving much of the structure in the data. We demonstrate how diffusion maps can be used to solve clustering problems and examine the influence of tuning parameters on the results. We introduce two novel methods, the self-tuning diffusion map which replaces the global scaling parameter in the typical diffusion map framework with a local scaling parameter and an algorithm for automatically selecting tuning parameters based on a cross-validation style score called prediction strength. The methods are tested on several example datasets.
Finite-key analysis of a practical decoy-state high-dimensional quantum key distribution
NASA Astrophysics Data System (ADS)
Bao, Haize; Bao, Wansu; Wang, Yang; Zhou, Chun; Chen, Ruike
2016-05-01
Compared with two-level quantum key distribution (QKD), high-dimensional QKD enables two distant parties to share a secret key at a higher rate. We provide a finite-key security analysis for the recently proposed practical high-dimensional decoy-state QKD protocol based on time-energy entanglement. We employ two methods to estimate the statistical fluctuation of the postselection probability and give a tighter bound on the secure-key capacity. By numerical evaluation, we show the finite-key effect on the secure-key capacity in different conditions. Moreover, our approach could be used to optimize parameters in practical implementations of high-dimensional QKD.
The problem of the structure (state of helium) in small He{sub N}-CO clusters
Potapov, A. V. Panfilov, V. A.; Surin, L. A.; Dumesh, B. S.
2010-11-15
A second-order perturbation theory, developed for calculating the energy levels of the He-CO binary complex, is applied to small He{sub N}-CO clusters with N = 2-4, the helium atoms being considered as a single bound object. The interaction potential between the CO molecule and HeN is represented as a linear expansion in Legendre polynomials, in which the free rotation limit is chosen as the zero approximation and the angular dependence of the interaction is considered as a small perturbation. By fitting calculated rotational transitions to experimental values it was possible to determine the optimal parameters of the potential and to achieve good agreement (to within less than 1%) between calculated and experimental energy levels. As a result, the shape of the angular anisotropy of the interaction potential is obtained for various clusters. It turns out that the minimum of the potential energy is smoothly shifted from an angle between the axes of the CO molecule and the cluster of {theta} = 100{sup o} in He-CO to {theta} = 180{sup o} (the oxygen end) in He{sub 3}-CO and He{sub 4}-CO clusters. Under the assumption that the distribution of helium atoms with respect to the cluster axis is cylindrically symmetric, the structure of the cluster can be represented as a pyramid with the CO molecule at the vertex.
Defining and Evaluating Classification Algorithm for High-Dimensional Data Based on Latent Topics
Luo, Le; Li, Li
2014-01-01
Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications. PMID:24416136
A decision-theory approach to interpretable set analysis for high-dimensional data.
Boca, Simina M; Bravo, Héctor Céorrada; Caffo, Brian; Leek, Jeffrey T; Parmigiani, Giovanni
2013-09-01
A key problem in high-dimensional significance analysis is to find pre-defined sets that show enrichment for a statistical signal of interest; the classic example is the enrichment of gene sets for differentially expressed genes. Here, we propose a new decision-theory approach to the analysis of gene sets which focuses on estimating the fraction of non-null variables in a set. We introduce the idea of "atoms," non-overlapping sets based on the original pre-defined set annotations. Our approach focuses on finding the union of atoms that minimizes a weighted average of the number of false discoveries and missed discoveries. We introduce a new false discovery rate for sets, called the atomic false discovery rate (afdr), and prove that the optimal estimator in our decision-theory framework is to threshold the afdr. These results provide a coherent and interpretable framework for the analysis of sets that addresses the key issues of overlapping annotations and difficulty in interpreting p values in both competitive and self-contained tests. We illustrate our method and compare it to a popular existing method using simulated examples, as well as gene-set and brain ROI data analyses. © 2013, The International Biometric Society.
Modeling of stochastic dynamics of time-dependent flows under high-dimensional random forcing
NASA Astrophysics Data System (ADS)
Babaee, Hessam; Karniadakis, George
2016-11-01
In this numerical study the effect of high-dimensional stochastic forcing in time-dependent flows is investigated. To efficiently quantify the evolution of stochasticity in such a system, the dynamically orthogonal method is used. In this methodology, the solution is approximated by a generalized Karhunen-Loeve (KL) expansion in the form of u (x , t ω) = u ̲ (x , t) + ∑ i = 1 N yi (t ω)ui (x , t) , in which u ̲ (x , t) is the stochastic mean, the set of ui (x , t) 's is a deterministic orthogonal basis and yi (t ω) 's are the stochastic coefficients. Explicit evolution equations for u ̲ , ui and yi are formulated. The elements of the basis ui (x , t) 's remain orthogonal for all times and they evolve according to the system dynamics to capture the energetically dominant stochastic subspace. We consider two classical fluid dynamics problems: (1) flow over a cylinder, and (2) flow over an airfoil under up to one-hundred dimensional random forcing. We explore the interaction of intrinsic with extrinsic stochasticity in these flows. DARPA N66001-15-2-4055, Office of Naval Research N00014-14-1-0166.
A Hyperspherical Adaptive Sparse-Grid Method for High-Dimensional Discontinuity Detection
Zhang, Guannan; Webster, Clayton G.; Gunzburger, Max D.; ...
2015-06-24
This study proposes and analyzes a hyperspherical adaptive hierarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces. The method is motivated by the theoretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a function representation of the discontinuity hypersurface of an N-dimensional discontinuous quantity of interest, by virtue of a hyperspherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyperspherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smoothness of the hypersurface, the newmore » technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. In addition, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous complexity analyses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.« less
Defining and evaluating classification algorithm for high-dimensional data based on latent topics.
Luo, Le; Li, Li
2014-01-01
Automatic text categorization is one of the key techniques in information retrieval and the data mining field. The classification is usually time-consuming when the training dataset is large and high-dimensional. Many methods have been proposed to solve this problem, but few can achieve satisfactory efficiency. In this paper, we present a method which combines the Latent Dirichlet Allocation (LDA) algorithm and the Support Vector Machine (SVM). LDA is first used to generate reduced dimensional representation of topics as feature in VSM. It is able to reduce features dramatically but keeps the necessary semantic information. The Support Vector Machine (SVM) is then employed to classify the data based on the generated features. We evaluate the algorithm on 20 Newsgroups and Reuters-21578 datasets, respectively. The experimental results show that the classification based on our proposed LDA+SVM model achieves high performance in terms of precision, recall and F1 measure. Further, it can achieve this within a much shorter time-frame. Our process improves greatly upon the previous work in this field and displays strong potential to achieve a streamlined classification process for a wide range of applications.
Huang, Shuai; Li, Jing; Ye, Jieping; Fleisher, Adam; Chen, Kewei; Wu, Teresa; Reiman, Eric
2014-01-01
Structure learning of Bayesian Networks (BNs) is an important topic in machine learning. Driven by modern applications in genetics and brain sciences, accurate and efficient learning of large-scale BN structures from high-dimensional data becomes a challenging problem. To tackle this challenge, we propose a Sparse Bayesian Network (SBN) structure learning algorithm that employs a novel formulation involving one L1-norm penalty term to impose sparsity and another penalty term to ensure that the learned BN is a Directed Acyclic Graph (DAG)—a required property of BNs. Through both theoretical analysis and extensive experiments on 11 moderate and large benchmark networks with various sample sizes, we show that SBN leads to improved learning accuracy, scalability, and efficiency as compared with 10 existing popular BN learning algorithms. We apply SBN to a real-world application of brain connectivity modeling for Alzheimer’s disease (AD) and reveal findings that could lead to advancements in AD research. PMID:22665720
Wang, Ying; Fan, Yong; Bhatt, Priyanka; Davatzikos, Christos
2010-01-01
This paper presents a general methodology for high-dimensional pattern regression on medical images via machine learning techniques. Compared with pattern classification studies, pattern regression considers the problem of estimating continuous rather than categorical variables, and can be more challenging. It is also clinically important, since it can be used to estimate disease stage and predict clinical progression from images. In this work, adaptive regional feature extraction approach is used along with other common feature extraction methods, and feature selection technique is adopted to produce a small number of discriminative features for optimal regression performance. Then the Relevance Vector Machine (RVM) is used to build regression models based on selected features. To get stable regression models from limited training samples, a bagging framework is adopted to build ensemble basis regressors derived from multiple bootstrap training samples, and thus to alleviate the effects of outliers as well as facilitate the optimal model parameter selection. Finally, this regression scheme is tested on simulated data and real data via cross-validation. Experimental results demonstrate that this regression scheme achieves higher estimation accuracy and better generalizing ability than Support Vector Regression (SVR). PMID:20056158
Individual-based models for adaptive diversification in high-dimensional phenotype spaces.
Ispolatov, Iaroslav; Madhok, Vaibhav; Doebeli, Michael
2016-02-07
Most theories of evolutionary diversification are based on equilibrium assumptions: they are either based on optimality arguments involving static fitness landscapes, or they assume that populations first evolve to an equilibrium state before diversification occurs, as exemplified by the concept of evolutionary branching points in adaptive dynamics theory. Recent results indicate that adaptive dynamics may often not converge to equilibrium points and instead generate complicated trajectories if evolution takes place in high-dimensional phenotype spaces. Even though some analytical results on diversification in complex phenotype spaces are available, to study this problem in general we need to reconstruct individual-based models from the adaptive dynamics generating the non-equilibrium dynamics. Here we first provide a method to construct individual-based models such that they faithfully reproduce the given adaptive dynamics attractor without diversification. We then show that a propensity to diversify can be introduced by adding Gaussian competition terms that generate frequency dependence while still preserving the same adaptive dynamics. For sufficiently strong competition, the disruptive selection generated by frequency-dependence overcomes the directional evolution along the selection gradient and leads to diversification in phenotypic directions that are orthogonal to the selection gradient. Copyright © 2015 Elsevier Ltd. All rights reserved.
Wang, Ying; Fan, Yong; Bhatt, Priyanka; Davatzikos, Christos
2010-05-01
This paper presents a general methodology for high-dimensional pattern regression on medical images via machine learning techniques. Compared with pattern classification studies, pattern regression considers the problem of estimating continuous rather than categorical variables, and can be more challenging. It is also clinically important, since it can be used to estimate disease stage and predict clinical progression from images. In this work, adaptive regional feature extraction approach is used along with other common feature extraction methods, and feature selection technique is adopted to produce a small number of discriminative features for optimal regression performance. Then the Relevance Vector Machine (RVM) is used to build regression models based on selected features. To get stable regression models from limited training samples, a bagging framework is adopted to build ensemble basis regressors derived from multiple bootstrap training samples, and thus to alleviate the effects of outliers as well as facilitate the optimal model parameter selection. Finally, this regression scheme is tested on simulated data and real data via cross-validation. Experimental results demonstrate that this regression scheme achieves higher estimation accuracy and better generalizing ability than Support Vector Regression (SVR). 2009 Elsevier Inc. All rights reserved.
A hyper-spherical adaptive sparse-grid method for high-dimensional discontinuity detection
Zhang, Guannan; Webster, Clayton G; Gunzburger, Max D; Burkardt, John V
2014-03-01
This work proposes and analyzes a hyper-spherical adaptive hi- erarchical sparse-grid method for detecting jump discontinuities of functions in high-dimensional spaces is proposed. The method is motivated by the the- oretical and computational inefficiencies of well-known adaptive sparse-grid methods for discontinuity detection. Our novel approach constructs a func- tion representation of the discontinuity hyper-surface of an N-dimensional dis- continuous quantity of interest, by virtue of a hyper-spherical transformation. Then, a sparse-grid approximation of the transformed function is built in the hyper-spherical coordinate system, whose value at each point is estimated by solving a one-dimensional discontinuity detection problem. Due to the smooth- ness of the hyper-surface, the new technique can identify jump discontinuities with significantly reduced computational cost, compared to existing methods. Moreover, hierarchical acceleration techniques are also incorporated to further reduce the overall complexity. Rigorous error estimates and complexity anal- yses of the new method are provided as are several numerical examples that illustrate the effectiveness of the approach.
A simple new filter for nonlinear high-dimensional data assimilation
NASA Astrophysics Data System (ADS)
Tödter, Julian; Kirchgessner, Paul; Ahrens, Bodo
2015-04-01
performance with a realistic ensemble size. The results confirm that, in principle, it can be applied successfully and as simple as the ETKF in high-dimensional problems without further modifications of the algorithm, even though it is only based on the particle weights. This proves that the suggested method constitutes a useful filter for nonlinear, high-dimensional data assimilation, and is able to overcome the curse of dimensionality even in deterministic systems.
High-dimensional Cox models: the choice of penalty as part of the model building process.
Benner, Axel; Zucknick, Manuela; Hielscher, Thomas; Ittrich, Carina; Mansmann, Ulrich
2010-02-01
The Cox proportional hazards regression model is the most popular approach to model covariate information for survival times. In this context, the development of high-dimensional models where the number of covariates is much larger than the number of observations (p>n) is an ongoing challenge. A practicable approach is to use ridge penalized Cox regression in such situations. Beside focussing on finding the best prediction rule, one is often interested in determining a subset of covariates that are the most important ones for prognosis. This could be a gene set in the biostatistical analysis of microarray data. Covariate selection can then, for example, be done by L(1)-penalized Cox regression using the lasso (Tibshirani (1997). Statistics in Medicine 16, 385-395). Several approaches beyond the lasso, that incorporate covariate selection, have been developed in recent years. This includes modifications of the lasso as well as nonconvex variants such as smoothly clipped absolute deviation (SCAD) (Fan and Li (2001). Journal of the American Statistical Association 96, 1348-1360; Fan and Li (2002). The Annals of Statistics 30, 74-99). The purpose of this article is to implement them practically into the model building process when analyzing high-dimensional data with the Cox proportional hazards model. To evaluate penalized regression models beyond the lasso, we included SCAD variants and the adaptive lasso (Zou (2006). Journal of the American Statistical Association 101, 1418-1429). We compare them with "standard" applications such as ridge regression, the lasso, and the elastic net. Predictive accuracy, features of variable selection, and estimation bias will be studied to assess the practical use of these methods. We observed that the performance of SCAD and adaptive lasso is highly dependent on nontrivial preselection procedures. A practical solution to this problem does not yet exist. Since there is high risk of missing relevant covariates when using SCAD or
ERIC Educational Resources Information Center
Jitendra, Asha K.; Harwell, Michael R.; Dupuis, Danielle N.; Karl, Stacy R.; Lein, Amy E.; Simonson, Gregory; Slater, Susan C.
2015-01-01
This experimental study evaluated the effectiveness of a research-based intervention, schema-based instruction (SBI), on students' proportional problem solving. SBI emphasizes the underlying mathematical structure of problems, uses schematic diagrams to represent information in the problem text, provides explicit problem-solving and metacognitive…
ERIC Educational Resources Information Center
Jitendra, Asha K.; Harwell, Michael R.; Dupuis, Danielle N.; Karl, Stacy R.; Lein, Amy E.; Simonson, Gregory; Slater, Susan C.
2015-01-01
This experimental study evaluated the effectiveness of a research-based intervention, schema-based instruction (SBI), on students' proportional problem solving. SBI emphasizes the underlying mathematical structure of problems, uses schematic diagrams to represent information in the problem text, provides explicit problem-solving and metacognitive…
ERIC Educational Resources Information Center
Jitendra, Asha K.; Harwell, Michael R.; Dupuis, Danielle N.; Karl, Stacy R.; Lein, Amy E.; Simonson, Gregory; Slater, Susan C.
2015-01-01
This experimental study evaluated the effectiveness of a research-based intervention, schema-based instruction (SBI), on students' proportional problem solving. SBI emphasizes the underlying mathematical structure of problems, uses schematic diagrams to represent information in the problem text, provides explicit problem solving and metacognitive…
Charte, Francisco; Rivera, Antonio J; del Jesus, María J; Herrera, Francisco
2014-10-01
Multilabel classification (MLC) has generated considerable research interest in recent years, as a technique that can be applied to many real-world scenarios. To process them with binary or multiclass classifiers, methods for transforming multilabel data sets (MLDs) have been proposed, as well as adapted algorithms able to work with this type of data sets. However, until now, few studies have addressed the problem of how to deal with MLDs having a large number of labels. This characteristic can be defined as high dimensionality in the label space (output attributes), in contrast to the traditional high dimensionality problem, which is usually focused on the feature space (by means of feature selection) or sample space (by means of instance selection). The purpose of this paper is to analyze dimensionality in the label space in MLDs, and to present a transformation methodology based on the use of association rules to discover label dependencies. These dependencies are used to reduce the label space, to ease the work of any MLC algorithm, and to infer the deleted labels in a final postprocessing stage. The proposed process is validated in an extensive experimentation with several MLDs and classification algorithms, resulting in a statistically significant improvement of performance in some cases, as will be shown.
NASA Technical Reports Server (NTRS)
Soderblom, David R.; King, Jeremy R.; Hanson, Robert B.; Jones, Burton F.; Fischer, Debra; Stauffer, John R.; Pinsonneault, Marc H.
1998-01-01
This paper examines the discrepancy between distances to nearby open clusters as determined by parallaxes from Hipparcos compared to traditional main-sequence fitting. The biggest difference is seen for the Pleiades, and our hypothesis is that if the Hipparcos distance to the Pleiades is correct, then similar subluminous zero-age main-sequence (ZAMS) stars should exist elsewhere, including in the immediate solar neighborhood. We examine a color-magnitude diagram of very young and nearby solar-type stars and show that none of them lie below the traditional ZAMS, despite the fact that the Hipparcos Pleiades parallax would place its members 0.3 mag below that ZAMS. We also present analyses and observations of solar-type stars that do lie below the ZAMS, and we show that they are subluminous because of low metallicity and that they have the kinematics of old stars.
Approximating high-dimensional dynamics by barycentric coordinates with linear programming
Hirata, Yoshito Aihara, Kazuyuki; Suzuki, Hideyuki; Shiro, Masanori; Takahashi, Nozomu; Mas, Paloma
2015-01-15
The increasing development of novel methods and techniques facilitates the measurement of high-dimensional time series but challenges our ability for accurate modeling and predictions. The use of a general mathematical model requires the inclusion of many parameters, which are difficult to be fitted for relatively short high-dimensional time series observed. Here, we propose a novel method to accurately model a high-dimensional time series. Our method extends the barycentric coordinates to high-dimensional phase space by employing linear programming, and allowing the approximation errors explicitly. The extension helps to produce free-running time-series predictions that preserve typical topological, dynamical, and/or geometric characteristics of the underlying attractors more accurately than the radial basis function model that is widely used. The method can be broadly applied, from helping to improve weather forecasting, to creating electronic instruments that sound more natural, and to comprehensively understanding complex biological data.
Approximating high-dimensional dynamics by barycentric coordinates with linear programming.
Hirata, Yoshito; Shiro, Masanori; Takahashi, Nozomu; Aihara, Kazuyuki; Suzuki, Hideyuki; Mas, Paloma
2015-01-01
The increasing development of novel methods and techniques facilitates the measurement of high-dimensional time series but challenges our ability for accurate modeling and predictions. The use of a general mathematical model requires the inclusion of many parameters, which are difficult to be fitted for relatively short high-dimensional time series observed. Here, we propose a novel method to accurately model a high-dimensional time series. Our method extends the barycentric coordinates to high-dimensional phase space by employing linear programming, and allowing the approximation errors explicitly. The extension helps to produce free-running time-series predictions that preserve typical topological, dynamical, and/or geometric characteristics of the underlying attractors more accurately than the radial basis function model that is widely used. The method can be broadly applied, from helping to improve weather forecasting, to creating electronic instruments that sound more natural, and to comprehensively understanding complex biological data.
Experimental ladder proof of Hardy's nonlocality for high-dimensional quantum systems
NASA Astrophysics Data System (ADS)
Chen, Lixiang; Zhang, Wuhong; Wu, Ziwen; Wang, Jikang; Fickler, Robert; Karimi, Ebrahim
2017-08-01
Recent years have witnessed a rapidly growing interest in high-dimensional quantum entanglement for fundamental studies as well as towards novel applications. Therefore, the ability to verify entanglement between physical qudits, d -dimensional quantum systems, is of crucial importance. To show nonclassicality, Hardy's paradox represents "the best version of Bell's theorem" without using inequalities. However, so far it has only been tested experimentally for bidimensional vector spaces. Here, we formulate a theoretical framework to demonstrate the ladder proof of Hardy's paradox for arbitrary high-dimensional systems. Furthermore, we experimentally demonstrate the ladder proof by taking advantage of the orbital angular momentum of high-dimensionally entangled photon pairs. We perform the ladder proof of Hardy's paradox for dimensions 3 and 4, both with the ladder up to the third step. Our paper paves the way towards a deeper understanding of the nature of high-dimensionally entangled quantum states and may find applications in quantum information science.
Hagen, Nathan; Kester, Robert T.; Gao, Liang; Tkaczyk, Tomasz S.
2012-01-01
The snapshot advantage is a large increase in light collection efficiency available to high-dimensional measurement systems that avoid filtering and scanning. After discussing this advantage in the context of imaging spectrometry, where the greatest effort towards developing snapshot systems has been made, we describe the types of measurements where it is applicable. We then generalize it to the larger context of high-dimensional measurements, where the advantage increases geometrically with measurement dimensionality. PMID:22791926
Nonlocality of high-dimensional two-photon orbital angular momentum states
Aiello, A.; Oemrawsingh, S. S. R.; Eliel, E. R.; Woerdman, J. P.
2005-11-15
We propose an interferometric method to investigate the nonlocality of high-dimensional two-photon orbital angular momentum states generated by spontaneous parametric down conversion. We incorporate two half-integer spiral phase plates and a variable-reflectivity output beam splitter into a Mach-Zehnder interferometer to build an orbital angular momentum analyzer. This setup enables testing the nonlocality of high-dimensional two-photon states by repeated use of the Clauser-Horne-Shimony-Holt inequality.
Particle filtering in high-dimensional chaotic systems.
Lingala, Nishanth; Sri Namachchivaya, N; Perkowski, Nicolas; Yeong, Hoong C
2012-12-01
We present an efficient particle filtering algorithm for multiscale systems, which is adapted for simple atmospheric dynamics models that are inherently chaotic. Particle filters represent the posterior conditional distribution of the state variables by a collection of particles, which evolves and adapts recursively as new information becomes available. The difference between the estimated state and the true state of the system constitutes the error in specifying or forecasting the state, which is amplified in chaotic systems that have a number of positive Lyapunov exponents. In this paper, we propose a reduced-order particle filtering algorithm based on the homogenized multiscale filtering framework developed in Imkeller et al. "Dimensional reduction in nonlinear filtering: A homogenization approach," Ann. Appl. Probab. (to be published). In order to adapt the proposed algorithm to chaotic signals, importance sampling and control theoretic methods are employed for the construction of the proposal density for the particle filter. Finally, we apply the general homogenized particle filtering algorithm developed here to the Lorenz'96 [E. N. Lorenz, "Predictability: A problem partly solved," in Predictability of Weather and Climate, ECMWF, 2006 (ECMWF, 2006), pp. 40-58] atmospheric model that mimics mid-latitude atmospheric dynamics with microscopic convective processes.
EPA released a problem formulation for TBBPA and related chemicals used as a flame retardants in plastics/printed circuit boards for electronics. The goal of this problem formulation was to identify scenarios where further risk analysis may be necessary.
Scalable collaborative targeted learning for high-dimensional data.
Ju, Cheng; Gruber, Susan; Lendle, Samuel D; Chambaz, Antoine; Franklin, Jessica M; Wyss, Richard; Schneeweiss, Sebastian; van der Laan, Mark J
2017-01-01
Robust inference of a low-dimensional parameter in a large semi-parametric model relies on external estimators of infinite-dimensional features of the distribution of the data. Typically, only one of the latter is optimized for the sake of constructing a well-behaved estimator of the low-dimensional parameter of interest. Optimizing more than one of them for the sake of achieving a better bias-variance trade-off in the estimation of the parameter of interest is the core idea driving the general template of the collaborative targeted minimum loss-based estimation procedure. The original instantiation of the collaborative targeted minimum loss-based estimation template can be presented as a greedy forward stepwise collaborative targeted minimum loss-based estimation algorithm. It does not scale well when the number p of covariates increases drastically. This motivates the introduction of a novel instantiation of the collaborative targeted minimum loss-based estimation template where the covariates are pre-ordered. Its time complexity is [Formula: see text] as opposed to the original [Formula: see text], a remarkable gain. We propose two pre-ordering strategies and suggest a rule of thumb to develop other meaningful strategies. Because it is usually unclear a priori which pre-ordering strategy to choose, we also introduce another instantiation called SL-C-TMLE algorithm that enables the data-driven choice of the better pre-ordering strategy given the problem at hand. Its time complexity is [Formula: see text] as well. The computational burden and relative performance of these algorithms were compared in simulation studies involving fully synthetic data or partially synthetic data based on a real world large electronic health database; and in analyses of three real, large electronic health databases. In all analyses involving electronic health databases, the greedy collaborative targeted minimum loss-based estimation algorithm is unacceptably slow. Simulation studies
Nadeau, Robert Michael
1995-10-01
This document contains information about the characterization and application of microearthquake clusters and fault zone dynamics. Topics discussed include: Seismological studies; fault-zone dynamics; periodic recurrence; scaling of microearthquakes to large earthquakes; implications of fault mechanics and seismic hazards; and wave propagation and temporal changes.
NASA Astrophysics Data System (ADS)
Piotto, G.; Zoccali, M.; King, I. R.; Djorgovski, S. G.; Sosin, C.; Rich, R. M.; Meylan, G.
1999-10-01
We present observations of the center of the Galactic globular cluster NGC 6273, obtained with the Hubble Space Telescope Wide Field Planetary Camera 2 as part of the snapshot program GO-7470. A BV color-magnitude diagram (CMD) for ~28,000 stars is presented and discussed. The most prominent feature of the CMD, identified for the first time in this paper, is the extended horizontal-branch blue tail (EBT) with a clear double-peaked distribution and a significant gap. The EBT of NGC 6273 is compared with the EBTs of seven other globular clusters for which we have a CMD in the same photometric system. From this comparison, we conclude that all the globular clusters in our sample with an EBT show at least one gap along the horizontal branch, which could have similar origins. A comparison with theoretical models suggests that at least some of these gaps may be occurring at a particular value of the stellar mass, common to a number of different clusters. From the CMD of NGC 6273 we obtain a distance modulus (m-M)_V=16.27+/-0.20. We also estimate an average reddening E(B-V)=0.47+/-0.03, though the CMD is strongly affected by differential reddening, with the relative reddening spanning a ΔE(B-V)~0.2 mag in the WFPC2 field. A luminosity function for the evolved stars in NGC 6273 is also presented and compared with the most recent evolutionary models.
Tripathy, Rohit Bilionis, Ilias Gonzalez, Marcial
2016-09-15
Uncertainty quantification (UQ) tasks, such as model calibration, uncertainty propagation, and optimization under uncertainty, typically require several thousand evaluations of the underlying computer codes. To cope with the cost of simulations, one replaces the real response surface with a cheap surrogate based, e.g., on polynomial chaos expansions, neural networks, support vector machines, or Gaussian processes (GP). However, the number of simulations required to learn a generic multivariate response grows exponentially as the input dimension increases. This curse of dimensionality can only be addressed, if the response exhibits some special structure that can be discovered and exploited. A wide range of physical responses exhibit a special structure known as an active subspace (AS). An AS is a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. If the dimensionality of the AS is low enough, then learning the link function is a much easier problem than the original problem of learning a high-dimensional function. The classic approach to discovering the AS requires gradient information, a fact that severely limits its applicability. Furthermore, and partly because of its reliance to gradients, it is not able to handle noisy observations. The latter is an essential trait if one wants to be able to propagate uncertainty through stochastic simulators, e.g., through molecular dynamics codes. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction. In particular, the AS is represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the
NASA Astrophysics Data System (ADS)
Tripathy, Rohit; Bilionis, Ilias; Gonzalez, Marcial
2016-09-01
Uncertainty quantification (UQ) tasks, such as model calibration, uncertainty propagation, and optimization under uncertainty, typically require several thousand evaluations of the underlying computer codes. To cope with the cost of simulations, one replaces the real response surface with a cheap surrogate based, e.g., on polynomial chaos expansions, neural networks, support vector machines, or Gaussian processes (GP). However, the number of simulations required to learn a generic multivariate response grows exponentially as the input dimension increases. This curse of dimensionality can only be addressed, if the response exhibits some special structure that can be discovered and exploited. A wide range of physical responses exhibit a special structure known as an active subspace (AS). An AS is a linear manifold of the stochastic space characterized by maximal response variation. The idea is that one should first identify this low dimensional manifold, project the high-dimensional input onto it, and then link the projection to the output. If the dimensionality of the AS is low enough, then learning the link function is a much easier problem than the original problem of learning a high-dimensional function. The classic approach to discovering the AS requires gradient information, a fact that severely limits its applicability. Furthermore, and partly because of its reliance to gradients, it is not able to handle noisy observations. The latter is an essential trait if one wants to be able to propagate uncertainty through stochastic simulators, e.g., through molecular dynamics codes. In this work, we develop a probabilistic version of AS which is gradient-free and robust to observational noise. Our approach relies on a novel Gaussian process regression with built-in dimensionality reduction. In particular, the AS is represented as an orthogonal projection matrix that serves as yet another covariance function hyper-parameter to be estimated from the data. To train the
Segmentation-based clustering of hyperspectral images using local band selection
NASA Astrophysics Data System (ADS)
Mehta, Anand; Dikshit, Onkar
2017-01-01
This study addresses the problems associated with high dimensionality of hyperspectral images in reference to clustering. A new local band selection approach that takes both relevancy and redundancy among the bands into account while obtaining the multiple relevant set of bands is developed. The local band selection approach is then incorporated within a multistage clustering framework that includes three stages: segmentation, region merging, and projected clustering. At first, k-means is used to produce initial segments/regions. Then, in the region merging stage, the modified local mutually best region merging strategy is applied on the initial segments to produce the refined segmentation map. Finally, an improved projected clustering technique is used to group these segments into a fixed number of clusters. Further, the main principle of projected clustering, that different sets of point may cluster better for different subsets of dimensions, is extended to region merging by incorporating the suggested local band selection approach. The framework requires input for two major parameters, which are number of clusters (k) and number of relevant bands (l). The framework is tested over two hyperspectral images and compared with other clustering frameworks. The experimental results confirm the effectiveness of the proposed framework.
PenPC: A Two-step Approach to Estimate the Skeletons of High Dimensional Directed Acyclic Graphs
Ha, Min Jin; Sun, Wei; Xie, Jichun
2015-01-01
Summary Estimation of the skeleton of a directed acyclic graph (DAG) is of great importance for understanding the underlying DAG and causal e ects can be assessed from the skeleton when the DAG is not identifiable. We propose a novel method named PenPC to estimate the skeleton of a high-dimensional DAG by a two-step approach. We first estimate the non-zero entries of a concentration matrix using penalized regression, and then fix the difference between the concentration matrix and the skeleton by evaluating a set of conditional independence hypotheses. For high dimensional problems where the number of vertices p is in polynomial or exponential scale of sample size n, we study the asymptotic property of PenPC on two types of graphs: traditional random graphs where all the vertices have the same expected number of neighbors, and scale-free graphs where a few vertices may have a large number of neighbors. As illustrated by extensive simulations and applications on gene expression data of cancer patients, PenPC has higher sensitivity and specificity than the state-of-the-art method, the PC-stable algorithm. PMID:26406114
Awate, Suyash P; Yushkevich, Paul; Song, Zhuang; Licht, Daniel; Gee, James C
2009-01-01
The paper presents a novel statistical framework for cortical folding pattern analysis that relies on a rich multivariate descriptor of folding patterns in a region of interest (ROI). The ROI-based approach avoids problems faced by spatial-normalization-based approaches stemming from the severe deficiency of homologous features between typical human cerebral cortices. Unlike typical ROI-based methods that summarize folding complexity or shape by a single number, the proposed descriptor unifies complexity and shape of the surface in a high-dimensional space. In this way, the proposed framework couples the reliability of ROI-based analysis with the richness of the novel cortical folding pattern descriptor. Furthermore, the descriptor can easily incorporate additional variables, e.g. cortical thickness. The paper proposes a novel application of a nonparametric permutation-based approach for statistical hypothesis testing for any multivariate high-dimensional descriptor. While the proposed framework has a rigorous theoretical underpinning, it is straightforward to implement. The framework is validated via simulated and clinical data. The paper is the first to quantitatively evaluate cortical folding in neonates with complex congenital heart disease.
NASA Astrophysics Data System (ADS)
Wagstaff, Kiri L.
2012-03-01
particular application involves considerations of the kind of data being analyzed, algorithm runtime efficiency, and how much prior knowledge is available about the problem domain, which can dictate the nature of clusters sought. Fundamentally, the clustering method and its representations of clusters carries with it a definition of what a cluster is, and it is important that this be aligned with the analysis goals for the problem at hand. In this chapter, I emphasize this point by identifying for each algorithm the cluster representation as a model, m_j , even for algorithms that are not typically thought of as creating a “model.” This chapter surveys a basic collection of clustering methods useful to any practitioner who is interested in applying clustering to a new data set. The algorithms include k-means (Section 25.2), EM (Section 25.3), agglomerative (Section 25.4), and spectral (Section 25.5) clustering, with side mentions of variants such as kernel k-means and divisive clustering. The chapter also discusses each algorithm’s strengths and limitations and provides pointers to additional in-depth reading for each subject. Section 25.6 discusses methods for incorporating domain knowledge into the clustering process. This chapter concludes with a brief survey of interesting applications of clustering methods to astronomy data (Section 25.7). The chapter begins with k-means because it is both generally accessible and so widely used that understanding it can be considered a necessary prerequisite for further work in the field. EM can be viewed as a more sophisticated version of k-means that uses a generative model for each cluster and probabilistic item assignments. Agglomerative clustering is the most basic form of hierarchical clustering and provides a basis for further exploration of algorithms in that vein. Spectral clustering permits a departure from feature-vector-based clustering and can operate on data sets instead represented as affinity, or similarity
Chapman, Benjamin P; Weiss, Alexander; Duberstein, Paul R
2016-12-01
Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in "big data" problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how 3 common SLT algorithms-supervised principal components, regularization, and boosting-can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach-or perhaps because of them-SLT methods may hold value as a statistically rigorous approach to exploratory regression. (PsycINFO Database Record
Boonstra, Philip S; Taylor, Jeremy M G; Mukherjee, Bhramar
2013-04-01
With advancement in genomic technologies, it is common that two high-dimensional datasets are available, both measuring the same underlying biological phenomenon with different techniques. We consider predicting a continuous outcome Y using X, a set of p markers which is the best available measure of the underlying biological process. This same biological process may also be measured by W, coming from a prior technology but correlated with X. On a moderately sized sample, we have (Y,X,W), and on a larger sample we have (Y,W). We utilize the data on W to boost the prediction of Y by X. When p is large and the subsample containing X is small, this is a p>n situation. When p is small, this is akin to the classical measurement error problem; however, ours is not the typical goal of calibrating W for use in future studies. We propose to shrink the regression coefficients β of Y on X toward different targets that use information derived from W in the larger dataset. We compare these proposals with the classical ridge regression of Y on X, which does not use W. We also unify all of these methods as targeted ridge estimators. Finally, we propose a hybrid estimator which is a linear combination of multiple estimators of β. With an optimal choice of weights, the hybrid estimator balances efficiency and robustness in a data-adaptive way to theoretically yield a smaller prediction error than any of its constituents. The methods, including a fully Bayesian alternative, are evaluated via simulation studies. We also apply them to a gene-expression dataset. mRNA expression measured via quantitative real-time polymerase chain reaction is used to predict survival time in lung cancer patients, with auxiliary information from microarray technology available on a larger sample.
Chapman, Benjamin P.; Weiss, Alexander; Duberstein, Paul
2016-01-01
Statistical learning theory (SLT) is the statistical formulation of machine learning theory, a body of analytic methods common in “big data” problems. Regression-based SLT algorithms seek to maximize predictive accuracy for some outcome, given a large pool of potential predictors, without overfitting the sample. Research goals in psychology may sometimes call for high dimensional regression. One example is criterion-keyed scale construction, where a scale with maximal predictive validity must be built from a large item pool. Using this as a working example, we first introduce a core principle of SLT methods: minimization of expected prediction error (EPE). Minimizing EPE is fundamentally different than maximizing the within-sample likelihood, and hinges on building a predictive model of sufficient complexity to predict the outcome well, without undue complexity leading to overfitting. We describe how such models are built and refined via cross-validation. We then illustrate how three common SLT algorithms–Supervised Principal Components, Regularization, and Boosting—can be used to construct a criterion-keyed scale predicting all-cause mortality, using a large personality item pool within a population cohort. Each algorithm illustrates a different approach to minimizing EPE. Finally, we consider broader applications of SLT predictive algorithms, both as supportive analytic tools for conventional methods, and as primary analytic tools in discovery phase research. We conclude that despite their differences from the classic null-hypothesis testing approach—or perhaps because of them–SLT methods may hold value as a statistically rigorous approach to exploratory regression. PMID:27454257
Efficient characterization of high-dimensional parameter spaces for systems biology
2011-01-01
Background A biological system's robustness to mutations and its evolution are influenced by the structure of its viable space, the region of its space of biochemical parameters where it can exert its function. In systems with a large number of biochemical parameters, viable regions with potentially complex geometries fill a tiny fraction of the whole parameter space. This hampers explorations of the viable space based on "brute force" or Gaussian sampling. Results We here propose a novel algorithm to characterize viable spaces efficiently. The algorithm combines global and local explorations of a parameter space. The global exploration involves an out-of-equilibrium adaptive Metropolis Monte Carlo method aimed at identifying poorly connected viable regions. The local exploration then samples these regions in detail by a method we call multiple ellipsoid-based sampling. Our algorithm explores efficiently nonconvex and poorly connected viable regions of different test-problems. Most importantly, its computational effort scales linearly with the number of dimensions, in contrast to "brute force" sampling that shows an exponential dependence on the number of dimensions. We also apply this algorithm to a simplified model of a biochemical oscillator with positive and negative feedback loops. A detailed characterization of the model's viable space captures well known structural properties of circadian oscillators. Concretely, we find that model topologies with an essential negative feedback loop and a nonessential positive feedback loop provide the most robust fixed period oscillations. Moreover, the connectedness of the model's viable space suggests that biochemical oscillators with varying topologies can evolve from one another. Conclusions Our algorithm permits an efficient analysis of high-dimensional, nonconvex, and poorly connected viable spaces characteristic of complex biological circuitry. It allows a systematic use of robustness as a tool for model
Fickler, Robert; Lapkiewicz, Radek; Huber, Marcus; Lavery, Martin P J; Padgett, Miles J; Zeilinger, Anton
2014-07-30
Photonics has become a mature field of quantum information science, where integrated optical circuits offer a way to scale the complexity of the set-up as well as the dimensionality of the quantum state. On photonic chips, paths are the natural way to encode information. To distribute those high-dimensional quantum states over large distances, transverse spatial modes, like orbital angular momentum possessing Laguerre Gauss modes, are favourable as flying information carriers. Here we demonstrate a quantum interface between these two vibrant photonic fields. We create three-dimensional path entanglement between two photons in a nonlinear crystal and use a mode sorter as the quantum interface to transfer the entanglement to the orbital angular momentum degree of freedom. Thus our results show a flexible way to create high-dimensional spatial mode entanglement. Moreover, they pave the way to implement broad complex quantum networks where high-dimensionally entangled states could be distributed over distant photonic chips.
Clarke, Robert; Ressom, Habtom W.; Wang, Antai; Xuan, Jianhua; Liu, Minetta C.; Gehan, Edmund A.; Wang, Yue
2007-01-01
High-throughput genomic and proteomic technologies are widely used in cancer research to build better predictive models of diagnosis, prognosis and therapy, to identify and characterize key signalling networks and to find new targets for drug development. These technologies present investigators with the task of extracting meaningful statistical and biological information from high-dimensional data spaces, wherein each sample is defined by hundreds or thousands of measurements, usually concurrently obtained. The properties of high dimensionality are often poorly understood or overlooked in data modelling and analysis. From the perspective of translational science, this Review discusses the properties of high-dimensional data spaces that arise in genomic and proteomic studies and the challenges they can pose for data analysis and interpretation. PMID:18097463
Machine learning etudes in astrophysics: selection functions for mock cluster catalogs
Hajian, Amir; Alvarez, Marcelo A.; Bond, J. Richard E-mail: malvarez@cita.utoronto.ca
2015-01-01
Making mock simulated catalogs is an important component of astrophysical data analysis. Selection criteria for observed astronomical objects are often too complicated to be derived from first principles. However the existence of an observed group of objects is a well-suited problem for machine learning classification. In this paper we use one-class classifiers to learn the properties of an observed catalog of clusters of galaxies from ROSAT and to pick clusters from mock simulations that resemble the observed ROSAT catalog. We show how this method can be used to study the cross-correlations of thermal Sunya'ev-Zeldovich signals with number density maps of X-ray selected cluster catalogs. The method reduces the bias due to hand-tuning the selection function and is readily scalable to large catalogs with a high-dimensional space of astrophysical features.
Reduced basis ANOVA methods for partial differential equations with high-dimensional random inputs
NASA Astrophysics Data System (ADS)
Liao, Qifeng; Lin, Guang
2016-07-01
In this paper we present a reduced basis ANOVA approach for partial deferential equations (PDEs) with random inputs. The ANOVA method combined with stochastic collocation methods provides model reduction in high-dimensional parameter space through decomposing high-dimensional inputs into unions of low-dimensional inputs. In this work, to further reduce the computational cost, we investigate spatial low-rank structures in the ANOVA-collocation method, and develop efficient spatial model reduction techniques using hierarchically generated reduced bases. We present a general mathematical framework of the methodology, validate its accuracy and demonstrate its efficiency with numerical experiments.
Reduced basis ANOVA methods for partial differential equations with high-dimensional random inputs
Liao, Qifeng; Lin, Guang
2016-07-15
In this paper we present a reduced basis ANOVA approach for partial deferential equations (PDEs) with random inputs. The ANOVA method combined with stochastic collocation methods provides model reduction in high-dimensional parameter space through decomposing high-dimensional inputs into unions of low-dimensional inputs. In this work, to further reduce the computational cost, we investigate spatial low-rank structures in the ANOVA-collocation method, and develop efficient spatial model reduction techniques using hierarchically generated reduced bases. We present a general mathematical framework of the methodology, validate its accuracy and demonstrate its efficiency with numerical experiments.
Metamodel-based global optimization using fuzzy clustering for design space reduction
NASA Astrophysics Data System (ADS)
Li, Yulin; Liu, Li; Long, Teng; Dong, Weili
2013-09-01
High fidelity analysis are utilized in modern engineering design optimization problems which involve expensive black-box models. For computation-intensive engineering design problems, efficient global optimization methods must be developed to relieve the computational burden. A new metamodel-based global optimization method using fuzzy clustering for design space reduction (MGO-FCR) is presented. The uniformly distributed initial sample points are generated by Latin hypercube design to construct the radial basis function metamodel, whose accuracy is improved with increasing number of sample points gradually. Fuzzy c-mean method and Gath-Geva clustering method are applied to divide the design space into several small interesting cluster spaces for low and high dimensional problems respectively. Modeling efficiency and accuracy are directly related to the design space, so unconcerned spaces are eliminated by the proposed reduction principle and two pseudo reduction algorithms. The reduction principle is developed to determine whether the current design space should be reduced and which space is eliminated. The first pseudo reduction algorithm improves the speed of clustering, while the second pseudo reduction algorithm ensures the design space to be reduced. Through several numerical benchmark functions, comparative studies with adaptive response surface method, approximated unimodal region elimination method and mode-pursuing sampling are carried out. The optimization results reveal that this method captures the real global optimum for all the numerical benchmark functions. And the number of function evaluations show that the efficiency of this method is favorable especially for high dimensional problems. Based on this global design optimization method, a design optimization of a lifting surface in high speed flow is carried out and this method saves about 10 h compared with genetic algorithms. This method possesses favorable performance on efficiency, robustness
A rough set based rational clustering framework for determining correlated genes.
Jeyaswamidoss, Jeba Emilyn; Thangaraj, Kesavan; Ramar, Kadarkarai; Chitra, Muthusamy
2016-06-01
Cluster analysis plays a foremost role in identifying groups of genes that show similar behavior under a set of experimental conditions. Several clustering algorithms have been proposed for identifying gene behaviors and to understand their significance. The principal aim of this work is to develop an intelligent rough clustering technique, which will efficiently remove the irrelevant dimensions in a high-dimensional space and obtain appropriate meaningful clusters. This paper proposes a novel biclustering technique that is based on rough set theory. The proposed algorithm uses correlation coefficient as a similarity measure to simultaneously cluster both the rows and columns of a gene expression data matrix and mean squared residue to generate the initial biclusters. Furthermore, the biclusters are refined to form the lower and upper boundaries by determining the membership of the genes in the clusters using mean squared residue. The algorithm is illustrated with yeast gene expression data and the experiment proves the effectiveness of the method. The main advantage is that it overcomes the problem of selection of initial clusters and also the restriction of one object belonging to only one cluster by allowing overlapping of biclusters.
Rafi, Jonas; Ivanova, Ekaterina; Rozental, Alexander; Carlbring, Per
2017-09-25
Despite being considered a public health problem, no prevention programme for problem gambling in workplace settings has been scientifically evaluated. This study aims to fill a critical gap in the field of problem gambling by implementing and evaluating a large-scale prevention programme in organisations. Ten organisations, with a total of n=549 managers and n=8572 employees, will be randomised to either receiving a prevention programme or to a waitlist control condition. Measurements will be collected at the baseline and 3, 12 and 24 months after intervention. The primary outcome of interest is the managers' inclination to act when worried or suspicious about an employee's problem gambling or other harmful use. Additional outcomes of interest include the Problem Gambling Severity Index and gambling habits in both managers and employees. Furthermore, qualitative analyses of the responses from semistructured interviews with managers will be performed. This study has been approved by the regional ethics board of Stockholm, Sweden, and it will contribute to the body of knowledge concerning prevention of problem gambling. The findings will be published in peer-reviewed, open-access journals. NCT02925286; Pre-results. © Article author(s) (or their employer(s) unless otherwise stated in the text of the article) 2017. All rights reserved. No commercial use is permitted unless otherwise expressly granted.
NASA Astrophysics Data System (ADS)
Vaverka, Jakub; Pellinen-Wannberg, Asta; Kero, Johan; Mann, Ingrid; De Spiegeleer, Alexandre; Hamrin, Maria; Norberg, Carol; Pitkänen, Timo
2017-04-01
Detection of hypervelocity dust impacts on a spacecraft body by electric field instruments have been reported by several missions such as Voyager, WIND, Cassini, STEREO. The mechanism of this detection is still not completely understood and is under intensive laboratory investigation. A commonly accepted theory is based on re-collection of plasma cloud particles generated by a hypervelocity dust impact by a spacecraft surface and an electric field antenna resulting in a fast change in the potential of the spacecraft body and antenna. These changes can be detected as a short pulse measured by the electric field instrument. We present the first detection of dust impacts on the Earth-orbiting MMS and Cluster satellites. Each of the four MMS spacecraft provide probe-to-spacecraft potential measurements for their respective the six electric field antennas. This gives a unique view on signals generated by dust impacts and allow their reliable identification which is not possible for example on the Cluster spacecraft. We discuss various instrumental effects and solitary waves, commonly present in the Earth's magnetosphere, which can be easily misinterpreted as dust impacts. We show the influence of local plasma environment on dust impact detection for satellites crossing various regions of the Earth's magnetosphere where the concentration and the temperature of plasma particles change significantly.
Sciberras, E; Mulraney, M; Heussler, H; Rinehart, N; Schuster, T; Gold, L; Hayes, N; Hiscock, H
2017-01-01
Introduction Up to 70% of children with attention-deficit/hyperactivity disorder (ADHD) experience sleep problems. We have demonstrated the efficacy of a brief behavioural intervention for children with ADHD in a large randomised controlled trial (RCT) and now aim to examine whether this intervention is effective in real-life clinical settings when delivered by paediatricians or psychologists. We will also assess the cost-effectiveness of the intervention. Methods and analysis Children aged 5–12 years with ADHD (n=320) are being recruited for this translational cluster RCT through paediatrician practices in Victoria and Queensland, Australia. Children are eligible if they meet criteria for ADHD, have a moderate/severe sleep problem and meet American Academy of Sleep Medicine criteria for either chronic insomnia disorder or delayed sleep–wake phase disorder; or are experiencing sleep-related anxiety. Clinicians are randomly allocated at the level of the paediatrician to either receive the sleep training or not. The behavioural intervention comprises 2 consultations covering sleep hygiene and standardised behavioural strategies. The primary outcome is change in the proportion of children with moderate/severe sleep problems from moderate/severe to no/mild by parent report at 3 months postintervention. Secondary outcomes include a range of child (eg, sleep severity, ADHD symptoms, quality of life, behaviour, working memory, executive functioning, learning, academic achievement) and primary caregiver (mental health, parenting, work attendance) measures. Analyses will address clustering at the level of the paediatrician using linear mixed effect models adjusting for potential a priori confounding variables. Ethics and dissemination Ethics approval has been granted. Findings will determine whether the benefits of an efficacy trial can be realised more broadly at the population level and will inform the development of clinical guidelines for managing sleep problems
Siriwardena, A Niroshan; Apekey, Tanefa; Tilling, Michelle; Harrison, Andrew; Dyas, Jane V; Middleton, Hugh C; Ørner, Roderick; Sach, Tracey; Dewey, Michael; Qureshi, Zubair M
2009-01-01
Background Sleep problems are common, affecting over a third of adults in the United Kingdom and leading to reduced productivity and impaired health-related quality of life. Many of those whose lives are affected seek medical help from primary care. Drug treatment is ineffective long term. Psychological methods for managing sleep problems, including cognitive behavioural therapy for insomnia (CBTi) have been shown to be effective and cost effective but have not been widely implemented or evaluated in a general practice setting where they are most likely to be needed and most appropriately delivered. This paper outlines the protocol for a pilot study designed to evaluate the effectiveness and cost-effectiveness of an educational intervention for general practitioners, primary care nurses and other members of the primary care team to deliver problem focused therapy to adult patients presenting with sleep problems due to lifestyle causes, pain or mild to moderate depression or anxiety. Methods and design This will be a pilot cluster randomised controlled trial of a complex intervention. General practices will be randomised to an educational intervention for problem focused therapy which includes a consultation approach comprising careful assessment (using assessment of secondary causes, sleep diaries and severity) and use of modified CBTi for insomnia in the consultation compared with usual care (general advice on sleep hygiene and pharmacotherapy with hypnotic drugs). Clinicians randomised to the intervention will receive an educational intervention (2 × 2 hours) to implement a complex intervention of problem focused therapy. Clinicians randomised to the control group will receive reinforcement of usual care with sleep hygiene advice. Outcomes will be assessed via self-completion questionnaires and telephone interviews of patients and staff as well as clinical records for interventions and prescribing. Discussion Previous studies in adults have shown that
NASA Astrophysics Data System (ADS)
Ikeda, K.
1982-08-01
The radius of convergence of the cluster series (expressing the equation of state) is discussed in connection with the distribution of zeros of the grand partition function on the complex z(=activity) plane, by giving various examples of circular distribution. Anomalous phase transitions and phase transitions of third order are considered by showing some examples of circular distribution of zeros. For the ideal Fermi-Dirac gas, the distribution function of zeros, lying on the part of the negative real axis from -λ-3 to -∞ [where λ=h(2 π mkT)-1/ 2], is calculated , and the function-theoretical structure of the equation of state is investigated. The distribution of zeros for this gas is compared with that for Tonks' gas (having purely repulsive interparticle forces). The two-dimensional and one-dimensional Fermi-Dirac gases are dealt with from the point of view of the distribution of zeros.
A practical scheme of the sigma-point Kalman filter for high-dimensional systems
NASA Astrophysics Data System (ADS)
Tang, Youmin; Deng, Ziwang; Manoj, K. K.; Chen, Dake
2014-03-01
applying a sigma-point Kalman filter (SPKF) to a high-dimensional system such as the oceanic general circulation model (OGCM), a major challenge is to reduce its heavy burden of storage memory and costly computation. In this study, we propose a new scheme for SPKF to address these issues. First, a reduced rank SPKF was introduced on the high-dimensional model state space using the truncated single value decomposition (TSVD) method (T-SPKF). Second, the relationship of SVDs between the model state space and a low-dimensional ensemble space is used to construct sigma points on the ensemble space (ET-SPKF). As such, this new scheme greatly reduces the demand of memory storage and computational cost and makes the SPKF method applicable to high-dimensional systems. Two numerical models are used to test and validate the ET-SPKF algorithm. The first model is the 40-variable Lorenz model, which has been a test bed of new assimilation algorithms. The second model is a realistic OGCM for the assimilation of actual observations, including Argo and in situ observations over the Pacific Ocean. The experiments show that ET-SPKF is computationally feasible for high-dimensional systems and capable of precise analyses. In particular, for realistic oceanic assimilations, the ET-SPKF algorithm can significantly improve oceanic analysis and improve ENSO prediction. A comparison between the ET-SPKF algorithm and EnKF (ensemble Kalman filter) is also tribally conducted using the OGCM and actual observations.
A latent factor linear mixed model for high-dimensional longitudinal data analysis.
An, Xinming; Yang, Qing; Bentler, Peter M
2013-10-30
High-dimensional longitudinal data involving latent variables such as depression and anxiety that cannot be quantified directly are often encountered in biomedical and social sciences. Multiple responses are used to characterize these latent quantities, and repeated measures are collected to capture their trends over time. Furthermore, substantive research questions may concern issues such as interrelated trends among latent variables that can only be addressed by modeling them jointly. Although statistical analysis of univariate longitudinal data has been well developed, methods for modeling multivariate high-dimensional longitudinal data are still under development. In this paper, we propose a latent factor linear mixed model (LFLMM) for analyzing this type of data. This model is a combination of the factor analysis and multivariate linear mixed models. Under this modeling framework, we reduced the high-dimensional responses to low-dimensional latent factors by the factor analysis model, and then we used the multivariate linear mixed model to study the longitudinal trends of these latent factors. We developed an expectation-maximization algorithm to estimate the model. We used simulation studies to investigate the computational properties of the expectation-maximization algorithm and compare the LFLMM model with other approaches for high-dimensional longitudinal data analysis. We used a real data example to illustrate the practical usefulness of the model. Copyright © 2013 John Wiley & Sons, Ltd.
NASA Astrophysics Data System (ADS)
Ding, Yunhong; Bacco, Davide; Dalgaard, Kjeld; Cai, Xinlun; Zhou, Xiaoqi; Rottwitt, Karsten; Oxenløwe, Leif Katsuo
2017-06-01
Quantum key distribution provides an efficient means to exchange information in an unconditionally secure way. Historically, quantum key distribution protocols have been based on binary signal formats, such as two polarization states, and the transmitted information efficiency of the quantum key is intrinsically limited to 1 bit/photon. Here we propose and experimentally demonstrate, for the first time, a high-dimensional quantum key distribution protocol based on space division multiplexing in multicore fiber using silicon photonic integrated lightwave circuits. We successfully realized three mutually unbiased bases in a four-dimensional Hilbert space, and achieved low and stable quantum bit error rate well below both the coherent attack and individual attack limits. Compared to previous demonstrations, the use of a multicore fiber in our protocol provides a much more efficient way to create high-dimensional quantum states, and enables breaking the information efficiency limit of traditional quantum key distribution protocols. In addition, the silicon photonic circuits used in our work integrate variable optical attenuators, highly efficient multicore fiber couplers, and Mach-Zehnder interferometers, enabling manipulating high-dimensional quantum states in a compact and stable manner. Our demonstration paves the way to utilize state-of-the-art multicore fibers for noise tolerance high-dimensional quantum key distribution, and boost silicon photonics for high information efficiency quantum communications.
High-Dimensional Explanatory Random Item Effects Models for Rater-Mediated Assessments
ERIC Educational Resources Information Center
Kelcey, Ben; Wang, Shanshan; Cox, Kyle
2016-01-01
Valid and reliable measurement of unobserved latent variables is essential to understanding and improving education. A common and persistent approach to assessing latent constructs in education is the use of rater inferential judgment. The purpose of this study is to develop high-dimensional explanatory random item effects models designed for…
High-Dimensional Exploratory Item Factor Analysis by a Metropolis-Hastings Robbins-Monro Algorithm
ERIC Educational Resources Information Center
Cai, Li
2010-01-01
A Metropolis-Hastings Robbins-Monro (MH-RM) algorithm for high-dimensional maximum marginal likelihood exploratory item factor analysis is proposed. The sequence of estimates from the MH-RM algorithm converges with probability one to the maximum likelihood solution. Details on the computer implementation of this algorithm are provided. The…
High-Dimensional Exploratory Item Factor Analysis by a Metropolis-Hastings Robbins-Monro Algorithm
ERIC Educational Resources Information Center
Cai, Li
2010-01-01
A Metropolis-Hastings Robbins-Monro (MH-RM) algorithm for high-dimensional maximum marginal likelihood exploratory item factor analysis is proposed. The sequence of estimates from the MH-RM algorithm converges with probability one to the maximum likelihood solution. Details on the computer implementation of this algorithm are provided. The…
Cui, Yunduan; Matsubara, Takamitsu; Sugimoto, Kenji
2017-06-29
We propose a new value function approach for model-free reinforcement learning in Markov decision processes involving high dimensional states that addresses the issues of brittleness and intractable computational complexity, therefore rendering the value function approach based reinforcement learning algorithms applicable to high dimensional systems. Our new algorithm, Kernel Dynamic Policy Programming (KDPP) smoothly updates the value function in accordance to the Kullback-Leibler divergence between current and updated policies. Stabilizing the learning in this manner enables the application of the kernel trick to value function approximation, which greatly reduces computational requirements for learning in high dimensional state spaces. The performance of KDPP against other kernel trick based value function approaches is first investigated in a simulated n DOF manipulator reaching task, where only KDPP efficiently learned a viable policy at n=40. As an application to a real world high dimensional robot system, KDPP successfully learned the task of unscrewing a bottle cap via a Pneumatic Artificial Muscle (PAM) driven robotic hand with tactile sensors; a system with a state space of 32 dimensions, while given limited samples and with ordinary computing resources. Copyright © 2017 Elsevier Ltd. All rights reserved.
Lalonde, Lyne; Quintana-Bárcena, Patricia; Lord, Anne; Bell, Robert; Clément, Valérie; Daigneault, Anne-Marie; Legris, Marie-Ève; Letendre, Sara; Mouchbahani, Marie; Jouini, Ghaya; Azar, Joëlle; Martin, Élisabeth; Berbiche, Djamal; Beaulieu, Stephanie; Beaunoyer, Sébastien; Bertin, Émilie; Bouvrette, Marianne; Charbonneau-Séguin, Noémie; Desrochers, Jean-François; Desforges, Katherine; Dumoulin-Charette, Ariane; Dupuis, Sébastien; El Bouchikhi, Maryame; Forget, Roxanne; Guay, Marianne; Lemieux, Jean-Phillippe; Morin-Bélanger, Claudia; Noël, Isabelle; Ricard, Stephanie; Sauvé, Patricia; Ste-Marie Paradis, François
2017-09-01
Appropriate training for community pharmacists may improve the quality of medication use. Few studies have reported the impact of such programs on medication management for patients with chronic kidney disease (CKD). Multicenter, cluster-randomized, controlled trial. Patients with CKD stage 3a, 3b, or 4 from 6 CKD clinics (Quebec, Canada) and their community pharmacies. Each cluster (a pharmacy and its patients) was randomly assigned to either ProFiL, a training-and-communication network program, or the control group. ProFiL pharmacists completed a 90-minute interactive web-based training program on use of medications in CKD and received a clinical guide, patients' clinical summaries, and facilitated access to the CKD clinic. Drug-related problems (primary outcome), pharmacists' knowledge and clinical skills, and patients' clinical attributes (eg, blood pressure and glycated hemoglobin concentration). Drug-related problems were evaluated the year before and after the recruitment of patients using a validated set of significant drug-related problems, the Pharmacotherapy Assessment in Chronic Renal Disease (PAIR) criteria. Pharmacists' questionnaires were completed at baseline and after 1 year. Clinical attributes were documented at baseline and after 1 year using available information in medical charts. 207 community pharmacies, 494 pharmacists, and 442 patients with CKD participated. After 1 year, the mean number of drug-related problems per patient decreased from 2.16 to 1.60 and from 1.70 to 1.62 in the ProFiL and control groups, respectively. The difference in reduction of drug-related problems per patient between the ProFiL and control groups was -0.32 (95% CI, -0.63 to -0.01). Improvements in knowledge (difference, 4.5%; 95% CI, 1.6%-7.4%) and clinical competencies (difference, 7.4%; 95% CI, 3.5%-11.3%) were observed among ProFiL pharmacists. No significant differences in clinical attributes were observed across the groups. High proportion of missing data
Jiang, Xia; Neapolitan, Richard E.
2012-01-01
Background The interaction between loci to affect phenotype is called epistasis. It is strict epistasis if no proper subset of the interacting loci exhibits a marginal effect. For many diseases, it is likely that unknown epistatic interactions affect disease susceptibility. A difficulty when mining epistatic interactions from high-dimensional datasets concerns the curse of dimensionality. There are too many combinations of SNPs to perform an exhaustive search. A method that could locate strict epistasis without an exhaustive search can be considered the brass ring of methods for analyzing high-dimensional datasets. Methodology/Findings A SNP pattern is a Bayesian network representing SNP-disease relationships. The Bayesian score for a SNP pattern is the probability of the data given the pattern, and has been used to learn SNP patterns. We identified a bound for the score of a SNP pattern. The bound provides an upper limit on the Bayesian score of any pattern that could be obtained by expanding a given pattern. We felt that the bound might enable the data to say something about the promise of expanding a 1-SNP pattern even when there are no marginal effects. We tested the bound using simulated datasets and semi-synthetic high-dimensional datasets obtained from GWAS datasets. We found that the bound was able to dramatically reduce the search time for strict epistasis. Using an Alzheimer's dataset, we showed that it is possible to discover an interaction involving the APOE gene based on its score because of its large marginal effect, but that the bound is most effective at discovering interactions without marginal effects. Conclusions/Significance We conclude that the bound appears to ameliorate the curse of dimensionality in high-dimensional datasets. This is a very consequential result and could be pivotal in our efforts to reveal the dark matter of genetic disease risk from high-dimensional datasets. PMID:23071633
Kandrup, H.E. ); Morrison, P.J. . Inst. for Fusion Studies)
1992-11-01
The Hamiltonian formulation of the Vlasov-Einstein system, which is appropriate for collisionless, self-gravitating systems like clusters of stars that are so dense that gravity must be described by the Einstein equation, is presented. In particular, it is demonstrated explicitly in the context of a 3 + 1 splitting that, for spherically symmetric configurations, the Vlasov-Einstein system can be viewed as a Hamiltonian system, where the dynamics is generated by a noncanonical Poisson bracket, with the Hamiltonian generating the evolution of the distribution function f (a noncanonical variable) being the conserved ADM mass-energy H[sub ADM]. An explicit expression is derived for the energy [delta]([sup 2])H[sub ADM] associated with an arbitrary phase space preserving perturbation of an arbitrary spherical equilibrium, and it is shown that the equilibrium must be linearly stable if [delta]([sup 2])H[sub ADM] is positive semi-definite. Insight into the Hamiltonian reformulation is provided by a description of general finite degree of freedom systems.
Kandrup, H.E.; Morrison, P.J.
1992-11-01
The Hamiltonian formulation of the Vlasov-Einstein system, which is appropriate for collisionless, self-gravitating systems like clusters of stars that are so dense that gravity must be described by the Einstein equation, is presented. In particular, it is demonstrated explicitly in the context of a 3 + 1 splitting that, for spherically symmetric configurations, the Vlasov-Einstein system can be viewed as a Hamiltonian system, where the dynamics is generated by a noncanonical Poisson bracket, with the Hamiltonian generating the evolution of the distribution function f (a noncanonical variable) being the conserved ADM mass-energy H{sub ADM}. An explicit expression is derived for the energy {delta}({sup 2})H{sub ADM} associated with an arbitrary phase space preserving perturbation of an arbitrary spherical equilibrium, and it is shown that the equilibrium must be linearly stable if {delta}({sup 2})H{sub ADM} is positive semi-definite. Insight into the Hamiltonian reformulation is provided by a description of general finite degree of freedom systems.
Sun, Hokeun; Wang, Shuang
2013-05-30
The matched case-control designs are commonly used to control for potential confounding factors in genetic epidemiology studies especially epigenetic studies with DNA methylation. Compared with unmatched case-control studies with high-dimensional genomic or epigenetic data, there have been few variable selection methods for matched sets. In an earlier paper, we proposed the penalized logistic regression model for the analysis of unmatched DNA methylation data using a network-based penalty. However, for popularly applied matched designs in epigenetic studies that compare DNA methylation between tumor and adjacent non-tumor tissues or between pre-treatment and post-treatment conditions, applying ordinary logistic regression ignoring matching is known to bring serious bias in estimation. In this paper, we developed a penalized conditional logistic model using the network-based penalty that encourages a grouping effect of (1) linked Cytosine-phosphate-Guanine (CpG) sites within a gene or (2) linked genes within a genetic pathway for analysis of matched DNA methylation data. In our simulation studies, we demonstrated the superiority of using conditional logistic model over unconditional logistic model in high-dimensional variable selection problems for matched case-control data. We further investigated the benefits of utilizing biological group or graph information for matched case-control data. We applied the proposed method to a genome-wide DNA methylation study on hepatocellular carcinoma (HCC) where we investigated the DNA methylation levels of tumor and adjacent non-tumor tissues from HCC patients by using the Illumina Infinium HumanMethylation27 Beadchip. Several new CpG sites and genes known to be related to HCC were identified but were missed by the standard method in the original paper. Copyright © 2012 John Wiley & Sons, Ltd.
Xue, Hongqi; Wu, Yichao; Wu, Hulin
2013-01-01
In many regression problems, the relations between the covariates and the response may be nonlinear. Motivated by the application of reconstructing a gene regulatory network, we consider a sparse high-dimensional additive model with the additive components being some known nonlinear functions with unknown parameters. To identify the subset of important covariates, we propose a new method for simultaneous variable selection and parameter estimation by iteratively combining a large-scale variable screening (the nonlinear independence screening, NLIS) and a moderate-scale model selection (the nonnegative garrote, NNG) for the nonlinear additive regressions. We have shown that the NLIS procedure possesses the sure screening property and it is able to handle problems with non-polynomial dimensionality; and for finite dimension problems, the NNG for the nonlinear additive regressions has selection consistency for the unimportant covariates and also estimation consistency for the parameter estimates of the important covariates. The proposed method is applied to simulated data and a real data example for identifying gene regulations to illustrate its numerical performance. PMID:25170239
van der Leij, Aryan; Blok, Henk; de Jong, Peter F.
2010-01-01
This study investigated the role of speech perception accuracy and speed in fluent word decoding of reading disabled (RD) children. A same-different phoneme discrimination task with natural speech tested the perception of single consonants and consonant clusters by young but persistent RD children. RD children were slower than chronological age (CA) controls in recognizing identical sounds, suggesting less distinct phonemic categories. In addition, after controlling for phonetic similarity Tallal’s (Brain Lang 9:182–198, 1980) fast transitions account of RD children’s speech perception problems was contrasted with Studdert-Kennedy’s (Read Writ Interdiscip J 15:5–14, 2002) similarity explanation. Results showed no specific RD deficit in perceiving fast transitions. Both phonetic similarity and fast transitions influenced accurate speech perception for RD children as well as CA controls. PMID:20652455
NASA Astrophysics Data System (ADS)
Denis, Pablo A.
2014-04-01
By means of coupled cluster theory and correlation consistent basis sets we investigated the thermochemistry of dimethyl sulphide (DMS), dimethyl disulphide (DMDS) and four closely related sulphur-containing molecules: CH3SS, CH3S, CH3SH and CH3CH2SH. For the four closed-shell molecules studied, their enthalpies of formation (EOFs) were derived using bomb calorimetry. We found that the deviation of the EOF with respect to experiment was 0.96, 0.65, 1.24 and 1.29 kcal/mol, for CH3SH, CH3CH2SH, DMS and DMDS, respectively, when ΔHf,0 = 65.6 kcal/mol was utilised (JANAF value). However, if the recently proposed ΔHf,0 = 66.2 kcal/mol was used to estimate EOF, the errors dropped to 0.36, 0.05, 0.64 and 0.09 kcal/mol, respectively. In contrast, for the CH3SS radical, a better agreement with experiment was obtained if the 65.6 kcal/mol value was used. To compare with experiment avoiding the problem of the ΔHf,0 (S), we determined the CH3-S and CH3-SS bond dissociation energies (BDEs) in CH3S and CH3SS. At the coupled cluster with singles doubles and perturbative triples correction level of theory, these values are 48.0 and 71.4 kcal/mol, respectively. The latter BDEs are 1.5 and 1.2 kcal/mol larger than the experimental values. The agreement can be considered to be acceptable if we take into consideration that these two radicals present important challenges when determining their EOFs. It is our hope that this work stimulates new studies which help elucidate the problem of the EOF of atomic sulphur.
Elfwing, Stefan; Uchibe, Eiji; Doya, Kenji
2013-01-01
Free-energy based reinforcement learning (FERL) was proposed for learning in high-dimensional state- and action spaces, which cannot be handled by standard function approximation methods. In this study, we propose a scaled version of free-energy based reinforcement learning to achieve more robust and more efficient learning performance. The action-value function is approximated by the negative free-energy of a restricted Boltzmann machine, divided by a constant scaling factor that is related to the size of the Boltzmann machine (the square root of the number of state nodes in this study). Our first task is a digit floor gridworld task, where the states are represented by images of handwritten digits from the MNIST data set. The purpose of the task is to investigate the proposed method's ability, through the extraction of task-relevant features in the hidden layer, to cluster images of the same digit and to cluster images of different digits that corresponds to states with the same optimal action. We also test the method's robustness with respect to different exploration schedules, i.e., different settings of the initial temperature and the temperature discount rate in softmax action selection. Our second task is a robot visual navigation task, where the robot can learn its position by the different colors of the lower part of four landmarks and it can infer the correct corner goal area by the color of the upper part of the landmarks. The state space consists of binarized camera images with, at most, nine different colors, which is equal to 6642 binary states. For both tasks, the learning performance is compared with standard FERL and with function approximation where the action-value function is approximated by a two-layered feedforward neural network. PMID:23450126
Data clustering and visualization via energy minimization
NASA Astrophysics Data System (ADS)
Andrecut, M.
2011-09-01
We discuss a stochastic method for configurational energy minimization, with applications to high-dimensional data clustering and visualization. Also, we demonstrate numerically the ability of the method to capture meaningful biological information from cancer-related microarray data, and to differentiate between different leukemia cancer subtypes.
Distribution of high-dimensional entanglement via an intra-city free-space link.
Steinlechner, Fabian; Ecker, Sebastian; Fink, Matthias; Liu, Bo; Bavaresco, Jessica; Huber, Marcus; Scheidl, Thomas; Ursin, Rupert
2017-07-24
Quantum entanglement is a fundamental resource in quantum information processing and its distribution between distant parties is a key challenge in quantum communications. Increasing the dimensionality of entanglement has been shown to improve robustness and channel capacities in secure quantum communications. Here we report on the distribution of genuine high-dimensional entanglement via a 1.2-km-long free-space link across Vienna. We exploit hyperentanglement, that is, simultaneous entanglement in polarization and energy-time bases, to encode quantum information, and observe high-visibility interference for successive correlation measurements in each degree of freedom. These visibilities impose lower bounds on entanglement in each subspace individually and certify four-dimensional entanglement for the hyperentangled system. The high-fidelity transmission of high-dimensional entanglement under real-world atmospheric link conditions represents an important step towards long-distance quantum communications with more complex quantum systems and the implementation of advanced quantum experiments with satellite links.
Luan, Xiaoli; Chen, Qiang; Liu, Fei
2014-09-01
This article presents a new scheme to design full matrix controller for high dimensional multivariable processes based on equivalent transfer function (ETF). Differing from existing ETF method, the proposed ETF is derived directly by exploiting the relationship between the equivalent closed-loop transfer function and the inverse of open-loop transfer function. Based on the obtained ETF, the full matrix controller is designed utilizing the existing PI tuning rules. The new proposed ETF model can more accurately represent the original processes. Furthermore, the full matrix centralized controller design method proposed in this paper is applicable to high dimensional multivariable systems with satisfactory performance. Comparison with other multivariable controllers shows that the designed ETF based controller is superior with respect to design-complexity and obtained performance.
A Shell Multi-dimensional Hierarchical Cubing Approach for High-Dimensional Cube
NASA Astrophysics Data System (ADS)
Zou, Shuzhi; Zhao, Li; Hu, Kongfa
The pre-computation of data cubes is critical for improving the response time of OLAP systems and accelerating data mining tasks in large data warehouses. However, as the sizes of data warehouses grow, the time it takes to perform this pre-computation becomes a significant performance bottleneck. In a high dimensional data warehouse, it might not be practical to build all these cuboids and their indices. In this paper, we propose a shell multi-dimensional hierarchical cubing algorithm, based on an extension of the previous minimal cubing approach. This method partitions the high dimensional data cube into low multi-dimensional hierarchical cube. Experimental results show that the proposed method is significantly more efficient than other existing cubing methods.
Distribution of high-dimensional entanglement via an intra-city free-space link
NASA Astrophysics Data System (ADS)
Steinlechner, Fabian; Ecker, Sebastian; Fink, Matthias; Liu, Bo; Bavaresco, Jessica; Huber, Marcus; Scheidl, Thomas; Ursin, Rupert
2017-07-01
Quantum entanglement is a fundamental resource in quantum information processing and its distribution between distant parties is a key challenge in quantum communications. Increasing the dimensionality of entanglement has been shown to improve robustness and channel capacities in secure quantum communications. Here we report on the distribution of genuine high-dimensional entanglement via a 1.2-km-long free-space link across Vienna. We exploit hyperentanglement, that is, simultaneous entanglement in polarization and energy-time bases, to encode quantum information, and observe high-visibility interference for successive correlation measurements in each degree of freedom. These visibilities impose lower bounds on entanglement in each subspace individually and certify four-dimensional entanglement for the hyperentangled system. The high-fidelity transmission of high-dimensional entanglement under real-world atmospheric link conditions represents an important step towards long-distance quantum communications with more complex quantum systems and the implementation of advanced quantum experiments with satellite links.
Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data.
Deng, Yi; Chang, Changgee; Ido, Moges Seyoum; Long, Qi
2016-02-12
Multiple imputation (MI) has been widely used for handling missing data in biomedical research. In the presence of high-dimensional data, regularized regression has been used as a natural strategy for building imputation models, but limited research has been conducted for handling general missing data patterns where multiple variables have missing values. Using the idea of multiple imputation by chained equations (MICE), we investigate two approaches of using regularized regression to impute missing values of high-dimensional data that can handle general missing data patterns. We compare our MICE methods with several existing imputation methods in simulation studies. Our simulation results demonstrate the superiority of the proposed MICE approach based on an indirect use of regularized regression in terms of bias. We further illustrate the proposed methods using two data examples.
Multiple Imputation for General Missing Data Patterns in the Presence of High-dimensional Data
NASA Astrophysics Data System (ADS)
Deng, Yi; Chang, Changgee; Ido, Moges Seyoum; Long, Qi
2016-02-01
Multiple imputation (MI) has been widely used for handling missing data in biomedical research. In the presence of high-dimensional data, regularized regression has been used as a natural strategy for building imputation models, but limited research has been conducted for handling general missing data patterns where multiple variables have missing values. Using the idea of multiple imputation by chained equations (MICE), we investigate two approaches of using regularized regression to impute missing values of high-dimensional data that can handle general missing data patterns. We compare our MICE methods with several existing imputation methods in simulation studies. Our simulation results demonstrate the superiority of the proposed MICE approach based on an indirect use of regularized regression in terms of bias. We further illustrate the proposed methods using two data examples.
Su, Yapeng; Shi, Qihui; Wei, Wei
2017-02-01
New insights on cellular heterogeneity in the last decade provoke the development of a variety of single cell omics tools at a lightning pace. The resultant high-dimensional single cell data generated by these tools require new theoretical approaches and analytical algorithms for effective visualization and interpretation. In this review, we briefly survey the state-of-the-art single cell proteomic tools with a particular focus on data acquisition and quantification, followed by an elaboration of a number of statistical and computational approaches developed to date for dissecting the high-dimensional single cell data. The underlying assumptions, unique features, and limitations of the analytical methods with the designated biological questions they seek to answer will be discussed. Particular attention will be given to those information theoretical approaches that are anchored in a set of first principles of physics and can yield detailed (and often surprising) predictions.
Efficient uncertainty quantification methodologies for high-dimensional climate land models
Sargsyan, Khachik; Safta, Cosmin; Berry, Robert Dan; Ray, Jaideep; Debusschere, Bert J.; Najm, Habib N.
2011-11-01
In this report, we proposed, examined and implemented approaches for performing efficient uncertainty quantification (UQ) in climate land models. Specifically, we applied Bayesian compressive sensing framework to a polynomial chaos spectral expansions, enhanced it with an iterative algorithm of basis reduction, and investigated the results on test models as well as on the community land model (CLM). Furthermore, we discussed construction of efficient quadrature rules for forward propagation of uncertainties from high-dimensional, constrained input space to output quantities of interest. The work lays grounds for efficient forward UQ for high-dimensional, strongly non-linear and computationally costly climate models. Moreover, to investigate parameter inference approaches, we have applied two variants of the Markov chain Monte Carlo (MCMC) method to a soil moisture dynamics submodel of the CLM. The evaluation of these algorithms gave us a good foundation for further building out the Bayesian calibration framework towards the goal of robust component-wise calibration.
Amniotic fluid: the use of high-dimensional biology to understand fetal well-being.
Kamath-Rayne, Beena D; Smith, Heather C; Muglia, Louis J; Morrow, Ardythe L
2014-01-01
Our aim was to review the use of high-dimensional biology techniques, specifically transcriptomics, proteomics, and metabolomics, in amniotic fluid to elucidate the mechanisms behind preterm birth or assessment of fetal development. We performed a comprehensive MEDLINE literature search on the use of transcriptomic, proteomic, and metabolomic technologies for amniotic fluid analysis. All abstracts were reviewed for pertinence to preterm birth or fetal maturation in human subjects. Nineteen articles qualified for inclusion. Most articles described the discovery of biomarker candidates, but few larger, multicenter replication or validation studies have been done. We conclude that the use of high-dimensional systems biology techniques to analyze amniotic fluid has significant potential to elucidate the mechanisms of preterm birth and fetal maturation. However, further multicenter collaborative efforts are needed to replicate and validate candidate biomarkers before they can become useful tools for clinical practice. Ideally, amniotic fluid biomarkers should be translated to a noninvasive test performed in maternal serum or urine.
Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data
Xiong, Lie; Kuan, Pei-Fen; Tian, Jianan; Keles, Sunduz; Wang, Sijian
2015-01-01
In this paper, we propose a novel multivariate component-wise boosting method for fitting multivariate response regression models under the high-dimension, low sample size setting. Our method is motivated by modeling the association among different biological molecules based on multiple types of high-dimensional genomic data. Particularly, we are interested in two applications: studying the influence of DNA copy number alterations on RNA transcript levels and investigating the association between DNA methylation and gene expression. For this purpose, we model the dependence of the RNA expression levels on DNA copy number alterations and the dependence of gene expression on DNA methylation through multivariate regression models and utilize boosting-type method to handle the high dimensionality as well as model the possible nonlinear associations. The performance of the proposed method is demonstrated through simulation studies. Finally, our multivariate boosting method is applied to two breast cancer studies. PMID:26609213
On landmark selection and sampling in high-dimensional data analysis
Belabbas, Mohamed-Ali; Wolfe, Patrick J.
2009-01-01
In recent years, the spectral analysis of appropriately defined kernel matrices has emerged as a principled way to extract the low-dimensional structure often prevalent in high-dimensional data. Here, we provide an introduction to spectral methods for linear and nonlinear dimension reduction, emphasizing ways to overcome the computational limitations currently faced by practitioners with massive datasets. In particular, a data subsampling or landmark selection process is often employed to construct a kernel based on partial information, followed by an approximate spectral analysis termed the Nyström extension. We provide a quantitative framework to analyse this procedure, and use it to demonstrate algorithmic performance bounds on a range of practical approaches designed to optimize the landmark selection process. We compare the practical implications of these bounds by way of real-world examples drawn from the field of computer vision, whereby low-dimensional manifold structure is shown to emerge from high-dimensional video data streams. PMID:19805446
Compressively Characterizing High-Dimensional Entangled States with Complementary, Random Filtering
NASA Astrophysics Data System (ADS)
Howland, Gregory A.; Knarr, Samuel H.; Schneeloch, James; Lum, Daniel J.; Howell, John C.
2016-04-01
The resources needed to conventionally characterize a quantum system are overwhelmingly large for high-dimensional systems. This obstacle may be overcome by abandoning traditional cornerstones of quantum measurement, such as general quantum states, strong projective measurement, and assumption-free characterization. Following this reasoning, we demonstrate an efficient technique for characterizing high-dimensional, spatial entanglement with one set of measurements. We recover sharp distributions with local, random filtering of the same ensemble in momentum followed by position—something the uncertainty principle forbids for projective measurements. Exploiting the expectation that entangled signals are highly correlated, we use fewer than 5000 measurements to characterize a 65,536-dimensional state. Finally, we use entropic inequalities to witness entanglement without a density matrix. Our method represents the sea change unfolding in quantum measurement, where methods influenced by the information theory and signal-processing communities replace unscalable, brute-force techniques—a progression previously followed by classical sensing.
Multivariate Boosting for Integrative Analysis of High-Dimensional Cancer Genomic Data.
Xiong, Lie; Kuan, Pei-Fen; Tian, Jianan; Keles, Sunduz; Wang, Sijian
2015-01-01
In this paper, we propose a novel multivariate component-wise boosting method for fitting multivariate response regression models under the high-dimension, low sample size setting. Our method is motivated by modeling the association among different biological molecules based on multiple types of high-dimensional genomic data. Particularly, we are interested in two applications: studying the influence of DNA copy number alterations on RNA transcript levels and investigating the association between DNA methylation and gene expression. For this purpose, we model the dependence of the RNA expression levels on DNA copy number alterations and the dependence of gene expression on DNA methylation through multivariate regression models and utilize boosting-type method to handle the high dimensionality as well as model the possible nonlinear associations. The performance of the proposed method is demonstrated through simulation studies. Finally, our multivariate boosting method is applied to two breast cancer studies.
Humoral fingerprinting of immune responses: “super-resolution”, high-dimensional serology
Lau, William W.; Tsang, John S.
2016-01-01
In a recent study, Chung et al. report the development of a high-dimensional approach to assess humoral responses to immune perturbation that goes beyond antibody neutralization and titers. This approach enables the identification of potentially novel correlates and mechanisms of protective immunity to HIV vaccination, thus offering a glimpse of how dense phenotyping of serological responses coupled with bioinformatics analysis could lead to much-sought-after markers of protective vaccination responses. PMID:26830541
Prediction of high-dimensional states subject to respiratory motion: a manifold learning approach.
Liu, Wenyang; Sawant, Amit; Ruan, Dan
2016-07-07
The development of high-dimensional imaging systems in image-guided radiotherapy provides important pathways to the ultimate goal of real-time full volumetric motion monitoring. Effective motion management during radiation treatment usually requires prediction to account for system latency and extra signal/image processing time. It is challenging to predict high-dimensional respiratory motion due to the complexity of the motion pattern combined with the curse of dimensionality. Linear dimension reduction methods such as PCA have been used to construct a linear subspace from the high-dimensional data, followed by efficient predictions on the lower-dimensional subspace. In this study, we extend such rationale to a more general manifold and propose a framework for high-dimensional motion prediction with manifold learning, which allows one to learn more descriptive features compared to linear methods with comparable dimensions. Specifically, a kernel PCA is used to construct a proper low-dimensional feature manifold, where accurate and efficient prediction can be performed. A fixed-point iterative pre-image estimation method is used to recover the predicted value in the original state space. We evaluated and compared the proposed method with a PCA-based approach on level-set surfaces reconstructed from point clouds captured by a 3D photogrammetry system. The prediction accuracy was evaluated in terms of root-mean-squared-error. Our proposed method achieved consistent higher prediction accuracy (sub-millimeter) for both 200 ms and 600 ms lookahead lengths compared to the PCA-based approach, and the performance gain was statistically significant.
a New Color Image Encryption Based on High-Dimensional Chaotic Systems
NASA Astrophysics Data System (ADS)
Li, Pi; Wang, Xing-Yuan; Fu, Hong-Jing; Xu, Da-Hai; Wang, Xiu-Kun
2014-12-01
The high-dimensional chaotic systems (HDCS) have a lot of advantages as more multifarious mechanism, greater the key space, more ruleless for the time series of the system variable than with the low-dimensional chaotic systems (LDCS), etc. Thus, a novel encryption scheme using Lorenz system is suggested. Moreover, we use substitution-diffusion architecture to advance the security of the scheme. The theoretical and experimental results show that the suggested cryptosystem has higher security.
Prediction of high-dimensional states subject to respiratory motion: a manifold learning approach
NASA Astrophysics Data System (ADS)
Liu, Wenyang; Sawant, Amit; Ruan, Dan
2016-07-01
The development of high-dimensional imaging systems in image-guided radiotherapy provides important pathways to the ultimate goal of real-time full volumetric motion monitoring. Effective motion management during radiation treatment usually requires prediction to account for system latency and extra signal/image processing time. It is challenging to predict high-dimensional respiratory motion due to the complexity of the motion pattern combined with the curse of dimensionality. Linear dimension reduction methods such as PCA have been used to construct a linear subspace from the high-dimensional data, followed by efficient predictions on the lower-dimensional subspace. In this study, we extend such rationale to a more general manifold and propose a framework for high-dimensional motion prediction with manifold learning, which allows one to learn more descriptive features compared to linear methods with comparable dimensions. Specifically, a kernel PCA is used to construct a proper low-dimensional feature manifold, where accurate and efficient prediction can be performed. A fixed-point iterative pre-image estimation method is used to recover the predicted value in the original state space. We evaluated and compared the proposed method with a PCA-based approach on level-set surfaces reconstructed from point clouds captured by a 3D photogrammetry system. The prediction accuracy was evaluated in terms of root-mean-squared-error. Our proposed method achieved consistent higher prediction accuracy (sub-millimeter) for both 200 ms and 600 ms lookahead lengths compared to the PCA-based approach, and the performance gain was statistically significant.
Controlling chaos in low and high dimensional systems with periodic parametric perturbations
Mirus, K.A.; Sprott, J.C.
1998-06-01
The effect of applying a periodic perturbation to an accessible parameter of various chaotic systems is examined. Numerical results indicate that perturbation frequencies near the natural frequencies of the unstable periodic orbits of the chaotic systems can result in limit cycles for relatively small perturbations. Such perturbations can also control or significantly reduce the dimension of high-dimensional systems. Initial application to the control of fluctuations in a prototypical magnetic fusion plasma device will be reviewed.
Hirata, Yoshito; Aihara, Kazuyuki
2012-06-01
We introduce a low-dimensional description for a high-dimensional system, which is a piecewise affine model whose state space is divided by permutations. We show that the proposed model tends to predict wind speeds and photovoltaic outputs for the time scales from seconds to 100 s better than by global affine models. In addition, computations using the piecewise affine model are much faster than those of usual nonlinear models such as radial basis function models.
Plurigon: three dimensional visualization and classification of high-dimensionality data.
Martin, Bronwen; Chen, Hongyu; Daimon, Caitlin M; Chadwick, Wayne; Siddiqui, Sana; Maudsley, Stuart
2013-01-01
High-dimensionality data is rapidly becoming the norm for biomedical sciences and many other analytical disciplines. Not only is the collection and processing time for such data becoming problematic, but it has become increasingly difficult to form a comprehensive appreciation of high-dimensionality data. Though data analysis methods for coping with multivariate data are well-documented in technical fields such as computer science, little effort is currently being expended to condense data vectors that exist beyond the realm of physical space into an easily interpretable and aesthetic form. To address this important need, we have developed Plurigon, a data visualization and classification tool for the integration of high-dimensionality visualization algorithms with a user-friendly, interactive graphical interface. Unlike existing data visualization methods, which are focused on an ensemble of data points, Plurigon places a strong emphasis upon the visualization of a single data point and its determining characteristics. Multivariate data vectors are represented in the form of a deformed sphere with a distinct topology of hills, valleys, plateaus, peaks, and crevices. The gestalt structure of the resultant Plurigon object generates an easily-appreciable model. User interaction with the Plurigon is extensive; zoom, rotation, axial and vector display, feature extraction, and anaglyph stereoscopy are currently supported. With Plurigon and its ability to analyze high-complexity data, we hope to see a unification of biomedical and computational sciences as well as practical applications in a wide array of scientific disciplines. Increased accessibility to the analysis of high-dimensionality data may increase the number of new discoveries and breakthroughs, ranging from drug screening to disease diagnosis to medical literature mining.
Implementation of High Dimensional Feature Map for Segmentation of MR Images
He, Renjie; Sajja, Balasrinivasa Rao; Narayana, Ponnada A.
2005-01-01
A method that considerably reduces the computational and memory complexities associated with the generation of high dimensional (≥3) feature maps for image segmentation is described. The method is based on the K-nearest neighbor (KNN) classification and consists of two parts: preprocessing of feature space and fast KNN. This technique is implemented on a PC and applied for generating three-and four-dimensional feature maps for segmenting MR brain images of multiple sclerosis patients. PMID:16240091
Towards reliable multi-pathogen biosensors using high-dimensional encoding and decoding techniques
NASA Astrophysics Data System (ADS)
Chakrabartty, Shantanu; Liu, Yang
2008-08-01
Advances in micro-nano-biosensor fabrication are enabling technology that can integrate a large number of biological recognition elements within a single package. As a result, hundreds to millions of tests can be performed simultaneously and can facilitate rapid detection of multiple pathogens in a given sample. However, it is an open question as to how to exploit the high-dimensional nature of the multi-pathogen testing for improving the detection reliability a typical biosensor system. In this paper, we discuss two complementary high-dimensional encoding/decoding methods for improving the reliability of multi-pathogen detection. The first method uses a support vector machine (SVM) to learn the non-linear detection boundaries in the high-dimensional measurement space. The second method uses a forward error correcting (FEC) technique to synthetically introduce redundant patterns on the biosensor which can then be efficiently decoded. In this paper, experimental and simulation studies are based on a model conductimetric lateral flow immunoassay that uses antigen-antibody interaction in conjunction with a polyaniline transducer to detect presence or absence of pathogen in a given sample. Our results show that both SVM and FEC techniques can improve the detection performance by exploiting cross-reaction amongst multiple recognition sites on the biosensor. This is contrary to many existing methods used in pathogen detection technology where the main emphasis has been reducing the effects of cross-reaction and coupling instead of exploiting them as side information.
A non-parametric method for building predictive genetic tests on high-dimensional data.
Ye, Chengyin; Cui, Yuehua; Wei, Changshuai; Elston, Robert C; Zhu, Jun; Lu, Qing
2011-01-01
Predictive tests that capitalize on emerging genetic findings hold great promise for enhanced personalized healthcare. With the emergence of a large amount of data from genome-wide association studies (GWAS), interest has shifted towards high-dimensional risk prediction. To form predictive genetic tests on high-dimensional data, we propose a non-parametric method, called the 'forward ROC method'. The method adopts a computationally efficient algorithm to search for environment risk factors, genetic predictors on the entire genome, and their possible interactions for an optimal risk prediction model, without relying on prior knowledge of known risk factors. An efficient yet powerful procedure is also incorporated into the method to handle missing data. Through simulations and real data applications, we found our proposed method outperformed the existing approaches. We applied the new method to the Wellcome Trust rheumatoid arthritis GWAS dataset with a total of 460,547 markers. The results from the risk prediction analysis suggested important roles of HLA-DRB1 and PTPN22 in predicting rheumatoid arthritis. We proposed a powerful and robust approach for high-dimensional risk prediction. The new method will facilitate future risk prediction that considers a large number of predictors and their interaction for improved performance. Copyright © 2011 S. Karger AG, Basel.
Lessons learned in the analysis of high-dimensional data in vaccinomics
Oberg, Ann L.; McKinney, Brett A.; Schaid, Daniel J.; Pankratz, V. Shane; Kennedy, Richard B.; Poland, Gregory A.
2015-01-01
The field of vaccinology is increasingly moving toward the generation, analysis, and modeling of extremely large and complex high-dimensional datasets. We have used data such as these in the development and advancement of the field of vaccinomics to enable prediction of vaccine responses and to develop new vaccine candidates. However, the application of systems biology to what has been termed “big data,” or “high-dimensional data,” is not without significant challenges—chief among them a paucity of gold standard analysis and modeling paradigms with which to interpret the data. In this article, we relate some of the lessons we have learned over the last decade of working with high-dimensional, high-throughput data as applied to the field of vaccinomics. The value of such efforts, however, is ultimately to better understand the immune mechanisms by which protective and non-protective responses to vaccines are generated, and to use this information to support a personalized vaccinology approach in creating better, and safer, vaccines for the public health. PMID:25957070
Algamal, Zakariya Yahya; Lee, Muhammad Hisyam
2015-12-01
Cancer classification and gene selection in high-dimensional data have been popular research topics in genetics and molecular biology. Recently, adaptive regularized logistic regression using the elastic net regularization, which is called the adaptive elastic net, has been successfully applied in high-dimensional cancer classification to tackle both estimating the gene coefficients and performing gene selection simultaneously. The adaptive elastic net originally used elastic net estimates as the initial weight, however, using this weight may not be preferable for certain reasons: First, the elastic net estimator is biased in selecting genes. Second, it does not perform well when the pairwise correlations between variables are not high. Adjusted adaptive regularized logistic regression (AAElastic) is proposed to address these issues and encourage grouping effects simultaneously. The real data results indicate that AAElastic is significantly consistent in selecting genes compared to the other three competitor regularization methods. Additionally, the classification performance of AAElastic is comparable to the adaptive elastic net and better than other regularization methods. Thus, we can conclude that AAElastic is a reliable adaptive regularized logistic regression method in the field of high-dimensional cancer classification. Copyright © 2015 Elsevier Ltd. All rights reserved.
Schuster, Tibor; Pang, Menglan; Platt, Robert W
2015-09-01
The high-dimensional propensity score algorithm attempts to improve control of confounding in typical treatment effect studies in pharmacoepidemiology and is increasingly being used for the analysis of large administrative databases. Within this multi-step variable selection algorithm, the marginal prevalence of non-zero covariate values is considered to be an indicator for a count variable's potential confounding impact. We investigate the role of the marginal prevalence of confounder variables on potentially caused bias magnitudes when estimating risk ratios in point exposure studies with binary outcomes. We apply the law of total probability in conjunction with an established bias formula to derive and illustrate relative bias boundaries with respect to marginal confounder prevalence. We show that maximum possible bias magnitudes can occur at any marginal prevalence level of a binary confounder variable. In particular, we demonstrate that, in case of rare or very common exposures, low and high prevalent confounder variables can still have large confounding impact on estimated risk ratios. Covariate pre-selection by prevalence may lead to sub-optimal confounder sampling within the high-dimensional propensity score algorithm. While we believe that the high-dimensional propensity score has important benefits in large-scale pharmacoepidemiologic studies, we recommend omitting the prevalence-based empirical identification of candidate covariates. Copyright © 2015 John Wiley & Sons, Ltd.
Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models.
Binder, Harald; Schumacher, Martin
2008-01-10
When predictive survival models are built from high-dimensional data, there are often additional covariates, such as clinical scores, that by all means have to be included into the final model. While there are several techniques for the fitting of sparse high-dimensional survival models by penalized parameter estimation, none allows for explicit consideration of such mandatory covariates. We introduce a new boosting algorithm for censored time-to-event data that shares the favorable properties of existing approaches, i.e., it results in sparse models with good prediction performance, but uses an offset-based update mechanism. The latter allows for tailored penalization of the covariates under consideration. Specifically, unpenalized mandatory covariates can be introduced. Microarray survival data from patients with diffuse large B-cell lymphoma, in combination with the recent, bootstrap-based prediction error curve technique, is used to illustrate the advantages of the new procedure. It is demonstrated that it can be highly beneficial in terms of prediction performance to use an estimation procedure that incorporates mandatory covariates into high-dimensional survival models. The new approach also allows to answer the question whether improved predictions are obtained by including microarray features in addition to classical clinical criteria.
NASA Astrophysics Data System (ADS)
Taşkin, Gülşen
2016-05-01
Recently, information extraction from hyperspectral images (HI) has become an attractive research area for many practical applications in earth observation due to the fact that HI provides valuable information with a huge number of spectral bands. In order to process such a huge amount of data in an effective way, traditional methods may not fully provide a satisfactory performance because they do not mostly consider high dimensionality of the data which causes curse of dimensionality also known as Hughes phenomena. In case of supervised classification, a poor generalization performance is achieved as a consequence resulting in availability of limited training samples. Therefore, advance methods accounting for the high dimensionality need to be developed in order to get a good generalization capability. In this work, a method of High Dimensional Model Representation (HDMR) was utilized for dimensionality reduction, and a novel feature selection method was introduced based on global sensitivity analysis. Several implementations were conducted with hyperspectral images in comparison to state-of-art feature selection algorithms in terms of classification accuracy, and the results showed that the proposed method outperforms the other feature selection methods even with all considered classifiers, that are support vector machines, Bayes, and decision tree j48.
High-dimensional decoy-state quantum key distribution over multicore telecommunication fibers
NASA Astrophysics Data System (ADS)
Cañas, G.; Vera, N.; Cariñe, J.; González, P.; Cardenas, J.; Connolly, P. W. R.; Przysiezna, A.; Gómez, E. S.; Figueroa, M.; Vallone, G.; Villoresi, P.; da Silva, T. Ferreira; Xavier, G. B.; Lima, G.
2017-08-01
Multiplexing is a strategy to augment the transmission capacity of a communication system. It consists of combining multiple signals over the same data channel and it has been very successful in classical communications. However, the use of enhanced channels has only reached limited practicality in quantum communications (QC) as it requires the manipulation of quantum systems of higher dimensions. Considerable effort is being made towards QC using high-dimensional quantum systems encoded into the transverse momentum of single photons, but so far no approach has been proven to be fully compatible with the existing telecommunication fibers. Here we overcome such a challenge and demonstrate a secure high-dimensional decoy-state quantum key distribution session over a 300-m-long multicore optical fiber. The high-dimensional quantum states are defined in terms of the transverse core modes available for the photon transmission over the fiber, and theoretical analyses show that positive secret key rates can be achieved through metropolitan distances.
Runcie, Daniel E.; Mukherjee, Sayan
2013-01-01
Quantitative genetic studies that model complex, multivariate phenotypes are important for both evolutionary prediction and artificial selection. For example, changes in gene expression can provide insight into developmental and physiological mechanisms that link genotype and phenotype. However, classical analytical techniques are poorly suited to quantitative genetic studies of gene expression where the number of traits assayed per individual can reach many thousand. Here, we derive a Bayesian genetic sparse factor model for estimating the genetic covariance matrix (G-matrix) of high-dimensional traits, such as gene expression, in a mixed-effects model. The key idea of our model is that we need consider only G-matrices that are biologically plausible. An organism’s entire phenotype is the result of processes that are modular and have limited complexity. This implies that the G-matrix will be highly structured. In particular, we assume that a limited number of intermediate traits (or factors, e.g., variations in development or physiology) control the variation in the high-dimensional phenotype, and that each of these intermediate traits is sparse – affecting only a few observed traits. The advantages of this approach are twofold. First, sparse factors are interpretable and provide biological insight into mechanisms underlying the genetic architecture. Second, enforcing sparsity helps prevent sampling errors from swamping out the true signal in high-dimensional data. We demonstrate the advantages of our model on simulated data and in an analysis of a published Drosophila melanogaster gene expression data set. PMID:23636737
High-Dimensional Function Approximation With Neural Networks for Large Volumes of Data.
Andras, Peter
2017-01-25
Approximation of high-dimensional functions is a challenge for neural networks due to the curse of dimensionality. Often the data for which the approximated function is defined resides on a low-dimensional manifold and in principle the approximation of the function over this manifold should improve the approximation performance. It has been show that projecting the data manifold into a lower dimensional space, followed by the neural network approximation of the function over this space, provides a more precise approximation of the function than the approximation of the function with neural networks in the original data space. However, if the data volume is very large, the projection into the low-dimensional space has to be based on a limited sample of the data. Here, we investigate the nature of the approximation error of neural networks trained over the projection space. We show that such neural networks should have better approximation performance than neural networks trained on high-dimensional data even if the projection is based on a relatively sparse sample of the data manifold. We also find that it is preferable to use a uniformly distributed sparse sample of the data for the purpose of the generation of the low-dimensional projection. We illustrate these results considering the practical neural network approximation of a set of functions defined on high-dimensional data including real world data as well.
Unbiased feature selection in learning random forests for high-dimensional data.
Nguyen, Thanh-Tung; Huang, Joshua Zhexue; Nguyen, Thuy Thi
2015-01-01
Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.
Greiff, Victor; Weber, Cédric R; Palme, Johannes; Bodenhofer, Ulrich; Miho, Enkelejda; Menzel, Ulrike; Reddy, Sai T
2017-09-18
Recent studies have revealed that immune repertoires contain a substantial fraction of public clones, which may be defined as Ab or TCR clonal sequences shared across individuals. It has remained unclear whether public clones possess predictable sequence features that differentiate them from private clones, which are believed to be generated largely stochastically. This knowledge gap represents a lack of insight into the shaping of immune repertoire diversity. Leveraging a machine learning approach capable of capturing the high-dimensional compositional information of each clonal sequence (defined by CDR3), we detected predictive public clone and private clone-specific immunogenomic differences concentrated in CDR3's N1-D-N2 region, which allowed the prediction of public and private status with 80% accuracy in humans and mice. Our results unexpectedly demonstrate that public, as well as private, clones possess predictable high-dimensional immunogenomic features. Our support vector machine model could be trained effectively on large published datasets (3 million clonal sequences) and was sufficiently robust for public clone prediction across individuals and studies prepared with different library preparation and high-throughput sequencing protocols. In summary, we have uncovered the existence of high-dimensional immunogenomic rules that shape immune repertoire diversity in a predictable fashion. Our approach may pave the way for the construction of a comprehensive atlas of public mouse and human immune repertoires with potential applications in rational vaccine design and immunotherapeutics. Copyright © 2017 by The American Association of Immunologists, Inc.
Entropy-based consensus clustering for patient stratification.
Liu, Hongfu; Zhao, Rui; Fang, Hongsheng; Cheng, Feixiong; Fu, Yun; Liu, Yang-Yu
2017-09-01
Patient stratification or disease subtyping is crucial for precision medicine and personalized treatment of complex diseases. The increasing availability of high-throughput molecular data provides a great opportunity for patient stratification. Many clustering methods have been employed to tackle this problem in a purely data-driven manner. Yet, existing methods leveraging high-throughput molecular data often suffers from various limitations, e.g. noise, data heterogeneity, high dimensionality or poor interpretability. Here we introduced an Entropy-based Consensus Clustering (ECC) method that overcomes those limitations all together. Our ECC method employs an entropy-based utility function to fuse many basic partitions to a consensus one that agrees with the basic ones as much as possible. Maximizing the utility function in ECC has a much more meaningful interpretation than any other consensus clustering methods. Moreover, we exactly map the complex utility maximization problem to the classic K -means clustering problem, which can then be efficiently solved with linear time and space complexity. Our ECC method can also naturally integrate multiple molecular data types measured from the same set of subjects, and easily handle missing values without any imputation. We applied ECC to 110 synthetic and 48 real datasets, including 35 cancer gene expression benchmark datasets and 13 cancer types with four molecular data types from The Cancer Genome Atlas. We found that ECC shows superior performance against existing clustering methods. Our results clearly demonstrate the power of ECC in clinically relevant patient stratification. The Matlab package is available at http://scholar.harvard.edu/yyl/ecc . yunfu@ece.neu.edu or yyl@channing.harvard.edu. Supplementary data are available at Bioinformatics online.
NASA Astrophysics Data System (ADS)
Merkurjev, Ekaterina; Bertozzi, Andrea; Yan, Xiaoran; Lerman, Kristina
2017-07-01
Recent advances in clustering have included continuous relaxations of the Cheeger cut problem and those which address its linear approximation using the graph Laplacian. In this paper, we show how to use the graph Laplacian to solve the fully nonlinear Cheeger cut problem, as well as the ratio cut optimization task. Both problems are connected to total variation minimization, and the related Ginzburg-Landau functional is used in the derivation of the methods. The graph framework discussed in this paper is undirected. The resulting algorithms are efficient ways to cluster the data into two classes, and they can be easily extended to the case of multiple classes, or used on a multiclass data set via recursive bipartitioning. In addition to showing results on benchmark data sets, we also show an application of the algorithm to hyperspectral video data.
Ge, Yongchao; Sealfon, Stuart C.
2012-01-01
Motivation: For flow cytometry data, there are two common approaches to the unsupervised clustering problem: one is based on the finite mixture model and the other on spatial exploration of the histograms. The former is computationally slow and has difficulty to identify clusters of irregular shapes. The latter approach cannot be applied directly to high-dimensional data as the computational time and memory become unmanageable and the estimated histogram is unreliable. An algorithm without these two problems would be very useful. Results: In this article, we combine ideas from the finite mixture model and histogram spatial exploration. This new algorithm, which we call flowPeaks, can be applied directly to high-dimensional data and identify irregular shape clusters. The algorithm first uses K-means algorithm with a large K to partition the cell population into many small clusters. These partitioned data allow the generation of a smoothed density function using the finite mixture model. All local peaks are exhaustively searched by exploring the density function and the cells are clustered by the associated local peak. The algorithm flowPeaks is automatic, fast and reliable and robust to cluster shape and outliers. This algorithm has been applied to flow cytometry data and it has been compared with state of the art algorithms, including Misty Mountain, FLOCK, flowMeans, flowMerge and FLAME. Availability: The R package flowPeaks is available at https://github.com/yongchao/flowPeaks. Contact: yongchao.ge@mssm.edu Supplementary information: Supplementary data are available at Bioinformatics online PMID:22595209
NASA Astrophysics Data System (ADS)
Regis, Rommel G.; Shoemaker, Christine A.
2013-05-01
This article presents the DYCORS (DYnamic COordinate search using Response Surface models) framework for surrogate-based optimization of HEB (High-dimensional, Expensive, and Black-box) functions that incorporates an idea from the DDS (Dynamically Dimensioned Search) algorithm. The iterate is selected from random trial solutions obtained by perturbing only a subset of the coordinates of the current best solution. Moreover, the probability of perturbing a coordinate decreases as the algorithm reaches the computational budget. Two DYCORS algorithms that use RBF (Radial Basis Function) surrogates are developed: DYCORS-LMSRBF is a modification of the LMSRBF algorithm while DYCORS-DDSRBF is an RBF-assisted DDS. Numerical results on a 14-D watershed calibration problem and on eleven 30-D and 200-D test problems show that DYCORS algorithms are generally better than EGO, DDS, LMSRBF, MADS with kriging, SQP, an RBF-assisted evolution strategy, and a genetic algorithm. Hence, DYCORS is a promising approach for watershed calibration and for HEB optimization.
NASA Astrophysics Data System (ADS)
Laloy, Eric; Vrugt, Jasper A.
2012-01-01
Spatially distributed hydrologic models are increasingly being used to study and predict soil moisture flow, groundwater recharge, surface runoff, and river discharge. The usefulness and applicability of such complex models is increasingly held back by the potentially many hundreds (thousands) of parameters that require calibration against some historical record of data. The current generation of search and optimization algorithms is typically not powerful enough to deal with a very large number of variables and summarize parameter and predictive uncertainty. We have previously presented a general-purpose Markov chain Monte Carlo (MCMC) algorithm for Bayesian inference of the posterior probability density function of hydrologic model parameters. This method, entitled differential evolution adaptive Metropolis (DREAM), runs multiple different Markov chains in parallel and uses a discrete proposal distribution to evolve the sampler to the posterior distribution. The DREAM approach maintains detailed balance and shows excellent performance on complex, multimodal search problems. Here we present our latest algorithmic developments and introduce MT-DREAM(ZS), which combines the strengths of multiple-try sampling, snooker updating, and sampling from an archive of past states. This new code is especially designed to solve high-dimensional search problems and receives particularly spectacular performance improvement over other adaptive MCMC approaches when using distributed computing. Four different case studies with increasing dimensionality up to 241 parameters are used to illustrate the advantages of MT-DREAM(ZS).
NASA Astrophysics Data System (ADS)
Mandrà, Salvatore; Zhu, Zheng; Wang, Wenlong; Perdomo-Ortiz, Alejandro; Katzgraber, Helmut G.
2016-08-01
To date, a conclusive detection of quantum speedup remains elusive. Recently, a team by Google Inc. [V. S. Denchev et al., Phys. Rev. X 6, 031015 (2016), 10.1103/PhysRevX.6.031015] proposed a weak-strong cluster model tailored to have tall and narrow energy barriers separating local minima, with the aim to highlight the value of finite-range tunneling. More precisely, results from quantum Monte Carlo simulations as well as the D-Wave 2X quantum annealer scale considerably better than state-of-the-art simulated annealing simulations. Moreover, the D-Wave 2X quantum annealer is ˜108 times faster than simulated annealing on conventional computer hardware for problems with approximately 103 variables. Here, an overview of different sequential, nontailored, as well as specialized tailored algorithms on the Google instances is given. We show that the quantum speedup is limited to sequential approaches and study the typical complexity of the benchmark problems using insights from the study of spin glasses.
A new approach to the optimal target selection problem
NASA Astrophysics Data System (ADS)
Elson, E. C.; Bassett, B. A.; van der Heyden, K.; Vilakazi, Z. Z.
2007-03-01
Context: This paper addresses a common problem in astronomy and cosmology: to optimally select a subset of targets from a larger catalog. A specific example is the selection of targets from an imaging survey for multi-object spectrographic follow-up. Aims: We present a new heuristic optimisation algorithm, HYBRID, for this purpose and undertake detailed studies of its performance. Methods: HYBRID combines elements of the simulated annealing, MCMC and particle-swarm methods and is particularly successful in cases where the survey landscape has multiple curvature or clustering scales. Results: HYBRID consistently outperforms the other methods, especially in high-dimensionality spaces with many extrema. This means many fewer simulations must be run to reach a given performance confidence level and implies very significant advantages in solving complex or computationally expensive optimisation problems. Conclusions: .HYBRID outperforms both MCMC and SA in all cases including optimisation of high dimensional continuous surfaces indicating that HYBRID is useful far beyond the specific problem of optimal target selection. Future work will apply HYBRID to target selection for the new 10 m Southern African Large Telescope in South Africa.
Motivation: Molecular pathways and networks play a key role in basic and disease biology. An emerging notion is that networks encoding patterns of molecular interplay may themselves differ between contexts, such as cell type, tissue or disease (sub)type. However, while statistical testing of differences in mean expression levels has been extensively studied, testing of network differences remains challenging.
Haug, Severin; Kowatsch, Tobias; Castro, Raquel Paz; Filler, Andreas; Schaub, Michael P
2014-08-07
Problem drinking, particularly risky single-occasion drinking is widespread among adolescents and young adults in most Western countries. Mobile phone text messaging allows a proactive and cost-effective delivery of short messages at any time and place and allows the delivery of individualised information at times when young people typically drink alcohol. The main objective of the planned study is to test the efficacy of a combined web- and text messaging-based intervention to reduce problem drinking in young people with heterogeneous educational level. A two-arm cluster-randomised controlled trial with one follow-up assessment after 6 months will be conducted to test the efficacy of the intervention in comparison to assessment only. The fully-automated intervention program will provide an online feedback based on the social norms approach as well as individually tailored mobile phone text messages to stimulate (1) positive outcome expectations to drink within low-risk limits, (2) self-efficacy to resist alcohol and (3) planning processes to translate intentions to resist alcohol into action. Program participants will receive up to two weekly text messages over a time period of 3 months. Study participants will be 934 students from approximately 93 upper secondary and vocational schools in Switzerland. Main outcome criterion will be risky single-occasion drinking in the past 30 days preceding the follow-up assessment. This is the first study testing the efficacy of a combined web- and text messaging-based intervention to reduce problem drinking in young people. Given that this intervention approach proves to be effective, it could be easily implemented in various settings, and it could reach large numbers of young people in a cost-effective way. Current Controlled Trials ISRCTN59944705.
NASA Astrophysics Data System (ADS)
Gavrishchaka, Valeriy; Ganguli, Supriya
2001-10-01
Predictive capabilities of the data-driven models of the systems with complex multi-scale dynamics depend on the quality and amount of the available data and on the algorithms used to extract generalized mappings. Availability of the real-time high-resolution data constantly increases in many fields of practical interest. However, the majority of advanced nonlinear algorithms, including neural networks (NN), can encounter a set of problems called "dimensionality curse" when applied to high-dimensional data. Nonstationarity of the system can also impose significant limitations on the size of training set which leads to poor generalization ability of the model. A very promising algorithm that combines the power of the best nonlinear techniques and tolerance to high-dimensional and incomplete data is support vector machine (SVM). We have summarized and demonstrated advantages of the SVM by applying it to two important and challenging problems: substorm forecasting from solar wind data and volatility forecasting from multi-scale stock and exchange market data. We have shown that performance of the SVM model for substorm prediction can be comparable to or be superior to that of the best existing models including NNs. The advantages of the SVM-based techniques are expected to be much more pronounced in future space-weather forecasting models, which will incorporate many types of high-dimensional, multi-scale input data once real-time availability of this information becomes technologically feasible. We have also demonstrated encouraging performance of the SVM in application to volatility prediction using S&P 500 stock index and USD-DM exchange rate data. Future applications of the SVM in the emerging field of high-frequency finance and its relation to existing models are also discussed.
Unified tests for fine-scale mapping and identifying sparse high-dimensional sequence associations
Cao, Shaolong; Qin, Huaizhen; Gossmann, Alexej; Deng, Hong-Wen; Wang, Yu-Ping
2016-01-01
Motivation: In searching for genetic variants for complex diseases with deep sequencing data, genomic marker sets of high-dimensional genotypic data and sparse functional variants are quite common. Existing sequence association tests are incapable of identifying such marker sets or individual causal loci, although they appeared powerful to identify small marker sets with dense functional variants. In sequence association studies of admixed individuals, cryptic relatedness and population structure are known to confound the association analyses. Method: We here propose a unified marker wise test (uFineMap) to accurately localize causal loci and a unified high-dimensional set based test (uHDSet) to identify high-dimensional sparse associations in deep sequencing genomic data of multi-ethnic individuals with random relatedness. These two novel tests are based on scaled sparse linear mixed regressions with Lp (0 < p < 1) norm regularization. They jointly adjust for cryptic relatedness, population structure and other confounders to prevent false discoveries and improve statistical power for identifying promising individual markers and marker sets that harbor functional genetic variants of a complex trait. Results: With large scale simulation data and real data analyses, the proposed tests appropriately controlled Type I error rates and appeared to be more powerful than several prominent methods. We illustrated their practical utilities by the applications to DNA sequence data of Framingham Heart Study for osteoporosis. The proposed tests identified 11 novel significant genes that were missed by the prominent famSKAT and GEMMA. In particular, four out of six most significant pathways identified by the uHDSet but missed by famSKAT have been reported to be related to BMD or osteoporosis in the literature. Availability and implementation: The computational toolkit is available for academic use: https://sites.google.com/site/shaolongscode/home/uhdset Contact: wyp
On-chip generation of high-dimensional entangled quantum states and their coherent control
NASA Astrophysics Data System (ADS)
Kues, Michael; Reimer, Christian; Roztocki, Piotr; Cortés, Luis Romero; Sciara, Stefania; Wetzel, Benjamin; Zhang, Yanbing; Cino, Alfonso; Chu, Sai T.; Little, Brent E.; Moss, David J.; Caspani, Lucia; Azaña, José; Morandotti, Roberto
2017-06-01
Optical quantum states based on entangled photons are essential for solving questions in fundamental physics and are at the heart of quantum information science. Specifically, the realization of high-dimensional states (D-level quantum systems, that is, qudits, with D > 2) and their control are necessary for fundamental investigations of quantum mechanics, for increasing the sensitivity of quantum imaging schemes, for improving the robustness and key rate of quantum communication protocols, for enabling a richer variety of quantum simulations, and for achieving more efficient and error-tolerant quantum computation. Integrated photonics has recently become a leading platform for the compact, cost-efficient, and stable generation and processing of non-classical optical states. However, so far, integrated entangled quantum sources have been limited to qubits (D = 2). Here we demonstrate on-chip generation of entangled qudit states, where the photons are created in a coherent superposition of multiple high-purity frequency modes. In particular, we confirm the realization of a quantum system with at least one hundred dimensions, formed by two entangled qudits with D = 10. Furthermore, using state-of-the-art, yet off-the-shelf telecommunications components, we introduce a coherent manipulation platform with which to control frequency-entangled states, capable of performing deterministic high-dimensional gate operations. We validate this platform by measuring Bell inequality violations and performing quantum state tomography. Our work enables the generation and processing of high-dimensional quantum states in a single spatial mode.
Optimal cellular preservation for high dimensional flow cytometric analysis of multicentre trials.
Ng, Amanda A P; Lee, Bernett T K; Teo, Timothy S Y; Poidinger, Michael; Connolly, John E
2012-11-30
High dimensional flow cytometry is best served by centralized facilities. However, the difficulties around sample processing, storage and shipment make large scale international studies impractical. We therefore sought to identify optimized fixation procedures which fully leverage the analytical capability of high dimensional flow cytometry without the need for complex cell processing or a sustained cold chain. Whole blood staining procedure was employed to investigate the applicability of fixatives including Cyto-Chex® Blood Collection tube (Streck), Transfix® (Cytomark), 1% and 4% paraformaldehyde to centralized analysis of field trial samples. Samples were subjected to environmental conditions which mimic field studies, without refrigerated shipment and analyzed across 10 days, based on cell count and marker expression. This study showed that Cyto-Chex® demonstrated the least variability in absolute cell count relative to samples analyzed directly from donors in the absence of fixation. Transfix® was better at preserving the marker expression among all fixatives. However, Transfix® caused marked increased cell membrane permeabilization and was detrimental to intracellular marker identification. Paraformaldehyde fixation, at either 1% or 4% concentrations, was unfavorable for cell preservation under the conditions tested and thus not recommended. Using these data, we have created an online interactive tool which enables researchers to evaluate the impact of different fixatives on their panel of interest. In this study, we have identified Cyto-Chex® as the optimal cellular preservative for high dimensional flow cytometry in large scale studies for shipped whole blood samples, even in the absence of a sustained cold chain. Copyright © 2012 Elsevier B.V. All rights reserved.
On-chip generation of high-dimensional entangled quantum states and their coherent control.
Kues, Michael; Reimer, Christian; Roztocki, Piotr; Cortés, Luis Romero; Sciara, Stefania; Wetzel, Benjamin; Zhang, Yanbing; Cino, Alfonso; Chu, Sai T; Little, Brent E; Moss, David J; Caspani, Lucia; Azaña, José; Morandotti, Roberto
2017-06-28
Optical quantum states based on entangled photons are essential for solving questions in fundamental physics and are at the heart of quantum information science. Specifically, the realization of high-dimensional states (D-level quantum systems, that is, qudits, with D > 2) and their control are necessary for fundamental investigations of quantum mechanics, for increasing the sensitivity of quantum imaging schemes, for improving the robustness and key rate of quantum communication protocols, for enabling a richer variety of quantum simulations, and for achieving more efficient and error-tolerant quantum computation. Integrated photonics has recently become a leading platform for the compact, cost-efficient, and stable generation and processing of non-classical optical states. However, so far, integrated entangled quantum sources have been limited to qubits (D = 2). Here we demonstrate on-chip generation of entangled qudit states, where the photons are created in a coherent superposition of multiple high-purity frequency modes. In particular, we confirm the realization of a quantum system with at least one hundred dimensions, formed by two entangled qudits with D = 10. Furthermore, using state-of-the-art, yet off-the-shelf telecommunications components, we introduce a coherent manipulation platform with which to control frequency-entangled states, capable of performing deterministic high-dimensional gate operations. We validate this platform by measuring Bell inequality violations and performing quantum state tomography. Our work enables the generation and processing of high-dimensional quantum states in a single spatial mode.
Jiang, Xia; Cai, Binghuang; Xue, Diyang; Lu, Xinghua; Cooper, Gregory F; Neapolitan, Richard E
2014-10-01
The objective of this investigation is to evaluate binary prediction methods for predicting disease status using high-dimensional genomic data. The central hypothesis is that the Bayesian network (BN)-based method called efficient Bayesian multivariate classifier (EBMC) will do well at this task because EBMC builds on BN-based methods that have performed well at learning epistatic interactions. We evaluate how well eight methods perform binary prediction using high-dimensional discrete genomic datasets containing epistatic interactions. The methods are as follows: naive Bayes (NB), model averaging NB (MANB), feature selection NB (FSNB), EBMC, logistic regression (LR), support vector machines (SVM), Lasso, and extreme learning machines (ELM). We use a hundred 1000-single nucleotide polymorphism (SNP) simulated datasets, ten 10,000-SNP datasets, six semi-synthetic sets, and two real genome-wide association studies (GWAS) datasets in our evaluation. In fivefold cross-validation studies, the SVM performed best on the 1000-SNP dataset, while the BN-based methods performed best on the other datasets, with EBMC exhibiting the best overall performance. In-sample testing indicates that LR, SVM, Lasso, ELM, and NB tend to overfit the data. EBMC performed better than NB when there are several strong predictors, whereas NB performed better when there are many weak predictors. Furthermore, for all BN-based methods, prediction capability did not degrade as the dimension increased. Our results support the hypothesis that EBMC performs well at binary outcome prediction using high-dimensional discrete datasets containing epistatic-like interactions. Future research using more GWAS datasets is needed to further investigate the potential of EBMC. Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.
Power calculation for overall hypothesis testing with high-dimensional commensurate outcomes
Chi, Yueh-Yun; Gribbin, Matthew J.; Johnson, Jacqueline L.; Muller, Keith E.
2014-01-01
The complexity of system biology means that any metabolic, genetic, or proteomic pathway typically includes so many components (e.g., molecules) that statistical methods specialized for overall testing of high-dimensional and commensurate outcomes are required. While many overall tests have been proposed, very few have power and sample size methods. We develop accurate power and sample size methods and software to facilitate study planning for high-dimensional pathway analysis. With an account of any complex correlation structure between high-dimensional outcomes, the new methods allow power calculation even when the sample size is less than the number of variables. We derive the exact (finite-sample) and approximate non-null distributions of the ‘univariate’ approach to repeated measures test statistic, as well as power-equivalent scenarios useful to generalize our numerical evaluations. Extensive simulations of group comparisons support the accuracy of the approximations even when the ratio of number of variables to sample size is large. We derive a minimum set of constants and parameters sufficient and practical for power calculation. Using the new methods and specifying the minimum set to determine power for a study of metabolic consequences of vitamin B6 deficiency helps illustrate the practical value of the new results. Free software implementing the power and sample size methods applies to a wide range of designs, including one group pre-intervention and post-intervention comparisons, multiple parallel group comparisons with one-way or factorial designs, and the adjustment and evaluation of covariate effects. PMID:24122945
Scale-Invariant Sparse PCA on High Dimensional Meta-elliptical Data.
Han, Fang; Liu, Han
2014-01-01
We propose a semiparametric method for conducting scale-invariant sparse principal component analysis (PCA) on high dimensional non-Gaussian data. Compared with sparse PCA, our method has weaker modeling assumption and is more robust to possible data contamination. Theoretically, the proposed method achieves a parametric rate of convergence in estimating the parameter of interests under a flexible semiparametric distribution family; Computationally, the proposed method exploits a rank-based procedure and is as efficient as sparse PCA; Empirically, our method outperforms most competing methods on both synthetic and real-world datasets.
Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso
Kong, Shengchun; Nan, Bin
2013-01-01
We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses. PMID:24516328
Non-Asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso.
Kong, Shengchun; Nan, Bin
2014-01-01
We consider finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz.We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulties caused by lacking iid Lipschitz losses.
Fast time-series prediction using high-dimensional data: evaluating confidence interval credibility.
Hirata, Yoshito
2014-05-01
I propose an index for evaluating the credibility of confidence intervals for future observables predicted from high-dimensional time-series data. The index evaluates the distance from the current state to the data manifold. I demonstrate the index with artificial datasets generated from the Lorenz'96 II model [Lorenz, in Proceedings of the Seminar on Predictability, Vol. 1 (ECMWF, Reading, UK, 1996), p. 1], the Lorenz'96 I model [Hansen and Smith, 2859:TROOCI>2.0.CO;2">J. Atmos. Sci. 57, 2859 (2000).
High-dimensional, massive sample-size Cox proportional hazards regression for survival analysis.
Mittal, Sushil; Madigan, David; Burd, Randall S; Suchard, Marc A
2014-04-01
Survival analysis endures as an old, yet active research field with applications that spread across many domains. Continuing improvements in data acquisition techniques pose constant challenges in applying existing survival analysis methods to these emerging data sets. In this paper, we present tools for fitting regularized Cox survival analysis models on high-dimensional, massive sample-size (HDMSS) data using a variant of the cyclic coordinate descent optimization technique tailored for the sparsity that HDMSS data often present. Experiments on two real data examples demonstrate that efficient analyses of HDMSS data using these tools result in improved predictive performance and calibration.
NASA Astrophysics Data System (ADS)
Ceotto, Michele; Di Liberto, Giovanni; Conte, Riccardo
2017-07-01
A new semiclassical "divide-and-conquer" method is presented with the aim of demonstrating that quantum dynamics simulations of high dimensional molecular systems are doable. The method is first tested by calculating the quantum vibrational power spectra of water, methane, and benzene—three molecules of increasing dimensionality for which benchmark quantum results are available—and then applied to C60 , a system characterized by 174 vibrational degrees of freedom. Results show that the approach can accurately account for quantum anharmonicities, purely quantum features like overtones, and the removal of degeneracy when the molecular symmetry is broken.
NASA Astrophysics Data System (ADS)
Chen, Peng; Quarteroni, Alfio
2015-10-01
In this work we develop an adaptive and reduced computational algorithm based on dimension-adaptive sparse grid approximation and reduced basis methods for solving high-dimensional uncertainty quantification (UQ) problems. In order to tackle the computational challenge of "curse of dimensionality" commonly faced by these problems, we employ a dimension-adaptive tensor-product algorithm [16] and propose a verified version to enable effective removal of the stagnation phenomenon besides automatically detecting the importance and interaction of different dimensions. To reduce the heavy computational cost of UQ problems modelled by partial differential equations (PDE), we adopt a weighted reduced basis method [7] and develop an adaptive greedy algorithm in combination with the previous verified algorithm for efficient construction of an accurate reduced basis approximation. The efficiency and accuracy of the proposed algorithm are demonstrated by several numerical experiments.
NASA Astrophysics Data System (ADS)
Lestari, A. W.; Rustam, Z.
2017-07-01
In the last decade, breast cancer has become the focus of world attention as this disease is one of the primary leading cause of death for women. Therefore, it is necessary to have the correct precautions and treatment. In previous studies, Fuzzy Kennel K-Medoid algorithm has been used for multi-class data. This paper proposes an algorithm to classify the high dimensional data of breast cancer using Fuzzy Possibilistic C-means (FPCM) and a new method based on clustering analysis using Normed Kernel Function-Based Fuzzy Possibilistic C-Means (NKFPCM). The objective of this paper is to obtain the best accuracy in classification of breast cancer data. In order to improve the accuracy of the two methods, the features candidates are evaluated using feature selection, where Laplacian Score is used. The results show the comparison accuracy and running time of FPCM and NKFPCM with and without feature selection.
ERIC Educational Resources Information Center
Dishion, Thomas J.; Ha, Thao; Veronneau, Marie-Helene
2012-01-01
The authors propose that peer relationships should be included in a life history perspective on adolescent problem behavior. Longitudinal analyses were used to examine deviant peer clustering as the mediating link between attenuated family ties, peer marginalization, and social disadvantage in early adolescence and sexual promiscuity in middle…
ERIC Educational Resources Information Center
Dishion, Thomas J.; Ha, Thao; Veronneau, Marie-Helene
2012-01-01
The authors propose that peer relationships should be included in a life history perspective on adolescent problem behavior. Longitudinal analyses were used to examine deviant peer clustering as the mediating link between attenuated family ties, peer marginalization, and social disadvantage in early adolescence and sexual promiscuity in middle…
Chiu, Mei Choi; Pun, Chi Seng; Wong, Hoi Ying
2017-08-01
Investors interested in the global financial market must analyze financial securities internationally. Making an optimal global investment decision involves processing a huge amount of data for a high-dimensional portfolio. This article investigates the big data challenges of two mean-variance optimal portfolios: continuous-time precommitment and constant-rebalancing strategies. We show that both optimized portfolios implemented with the traditional sample estimates converge to the worst performing portfolio when the portfolio size becomes large. The crux of the problem is the estimation error accumulated from the huge dimension of stock data. We then propose a linear programming optimal (LPO) portfolio framework, which applies a constrained ℓ1 minimization to the theoretical optimal control to mitigate the risk associated with the dimensionality issue. The resulting portfolio becomes a sparse portfolio that selects stocks with a data-driven procedure and hence offers a stable mean-variance portfolio in practice. When the number of observations becomes large, the LPO portfolio converges to the oracle optimal portfolio, which is free of estimation error, even though the number of stocks grows faster than the number of observations. Our numerical and empirical studies demonstrate the superiority of the proposed approach. © 2017 Society for Risk Analysis.
NASA Astrophysics Data System (ADS)
Storm, Emma; Weniger, Christoph; Calore, Francesca
2017-08-01
We present SkyFACT (Sky Factorization with Adaptive Constrained Templates), a new approach for studying, modeling and decomposing diffuse gamma-ray emission. Like most previous analyses, the approach relies on predictions from cosmic-ray propagation codes like GALPROP and DRAGON. However, in contrast to previous approaches, we account for the fact that models are not perfect and allow for a very large number (gtrsim 105) of nuisance parameters to parameterize these imperfections. We combine methods of image reconstruction and adaptive spatio-spectral template regression in one coherent hybrid approach. To this end, we use penalized Poisson likelihood regression, with regularization functions that are motivated by the maximum entropy method. We introduce methods to efficiently handle the high dimensionality of the convex optimization problem as well as the associated semi-sparse covariance matrix, using the L-BFGS-B algorithm and Cholesky factorization. We test the method both on synthetic data as well as on gamma-ray emission from the inner Galaxy, |l|<90o and |b|<20o, as observed by the Fermi Large Area Telescope. We finally define a simple reference model that removes most of the residual emission from the inner Galaxy, based on conventional diffuse emission components as well as components for the Fermi bubbles, the Fermi Galactic center excess, and extended sources along the Galactic disk. Variants of this reference model can serve as basis for future studies of diffuse emission in and outside the Galactic disk.
Matlab Cluster Ensemble Toolbox
Sapio, Vincent De; Kegelmeyer, Philip
2009-04-27
This is a Matlab toolbox for investigating the application of cluster ensembles to data classification, with the objective of improving the accuracy and/or speed of clustering. The toolbox divides the cluster ensemble problem into four areas, providing functionality for each. These include, (1) synthetic data generation, (2) clustering to generate individual data partitions and similarity matrices, (3) consensus function generation and final clustering to generate ensemble data partitioning, and (4) implementation of accuracy metrics. With regard to data generation, Gaussian data of arbitrary dimension can be generated. The kcenters algorithm can then be used to generate individual data partitions by either, (a) subsampling the data and clustering each subsample, or by (b) randomly initializing the algorithm and generating a clustering for each initialization. In either case an overall similarity matrix can be computed using a consensus function operating on the individual similarity matrices. A final clustering can be performed and performance metrics are provided for evaluation purposes.
The end of gating? An introduction to automated analysis of high dimensional cytometry data.
Mair, Florian; Hartmann, Felix J; Mrdjen, Dunja; Tosevski, Vinko; Krieg, Carsten; Becher, Burkhard
2016-01-01
Ever since its invention half a century ago, flow cytometry has been a major tool for single-cell analysis, fueling advances in our understanding of a variety of complex cellular systems, in particular the immune system. The last decade has witnessed significant technical improvements in available cytometry platforms, such that more than 20 parameters can be analyzed on a single-cell level by fluorescence-based flow cytometry. The advent of mass cytometry has pushed this limit up to, currently, 50 parameters. However, traditional analysis approaches for the resulting high-dimensional datasets, such as gating on bivariate dot plots, have proven to be inefficient. Although a variety of novel computational analysis approaches to interpret these datasets are already available, they have not yet made it into the mainstream and remain largely unknown to many immunologists. Therefore, this review aims at providing a practical overview of novel analysis techniques for high-dimensional cytometry data including SPADE, t-SNE, Wanderlust, Citrus, and PhenoGraph, and how these applications can be used advantageously not only for the most complex datasets, but also for standard 14-parameter cytometry datasets.
Tiwari, Pallavi; Rosen, Mark; Reed, Galen; Kurhanewicz, John; Madabhushi, Anant
2009-01-01
The major challenge with classifying high dimensional biomedical data is in identifying the appropriate feature representation to (a) overcome the curse of dimensionality, and (b) facilitate separation between the data classes. Another challenge is to integrate information from two disparate modalities, possibly existing in different dimensional spaces, for improved classification. In this paper, we present a novel data representation, integration and classification scheme, Spectral Embedding based Probabilistic boosting Tree (ScEPTre), which incorporates Spectral Embedding (SE) for data representation and integration and a Probabilistic Boosting Tree classifier for data classification. SE provides an alternate representation of the data by non-linearly transforming high dimensional data into a low dimensional embedding space such that the relative adjacencies between objects are preserved. We demonstrate the utility of ScEPTre to classify and integrate Magnetic Resonance (MR) Spectroscopy (MRS) and Imaging (MRI) data for prostate cancer detection. Area under the receiver operating Curve (AUC) obtained via randomized cross validation on 15 prostate MRI-MRS studies suggests that (a) ScEPTre on MRS significantly outperforms a Haar wavelets based classifier, (b) integration of MRI-MRS via ScEPTre performs significantly better compared to using MRI and MRS alone, and (c) data integration via ScEPTre yields superior classification results compared to combining decisions from individual classifiers (or modalities).
Discrimination and synthesis of recursive quantum states in high-dimensional Hilbert spaces
NASA Astrophysics Data System (ADS)
Simon, David S.; Fitzpatrick, Casey A.; Sergienko, Alexander V.
2015-04-01
We propose an interferometric method for statistically discriminating between nonorthogonal states in high-dimensional Hilbert spaces for use in quantum information processing. The method is illustrated for the case of photon orbital angular momentum (OAM) states. These states belong to pairs of bases that are mutually unbiased on a sequence of two-dimensional subspaces of the full Hilbert space, but the vectors within the same basis are not necessarily orthogonal to each other. Over multiple trials, this method allows distinguishing OAM eigenstates from superpositions of multiple such eigenstates. Variations of the same method are then shown to be capable of preparing and detecting arbitrary linear combinations of states in Hilbert space. One further variation allows the construction of chains of states obeying recurrence relations on the Hilbert space itself, opening a new range of possibilities for more abstract information-coding algorithms to be carried out experimentally in a simple manner. Among other applications, we show that this approach provides a simplified means of switching between pairs of high-dimensional mutually unbiased OAM bases.
Array-representation Integration Factor Method for High-dimensional Systems
Wang, Dongyong; Zhang, Lei; Nie, Qing
2013-01-01
High order spatial derivatives and stiff reactions often introduce severe temporal stability constraints on the time step in numerical methods. Implicit integration method (IIF) method, which treats diffusion exactly and reaction implicitly, provides excellent stability properties with good efficiency by decoupling the treatment of reactions and diffusions. One major challenge for IIF is storage and calculation of the potential dense exponential matrices of the sparse discretization matrices resulted from the linear differential operators. Motivated by a compact representation for IIF (cIIF) for Laplacian operators in two and three dimensions, we introduce an array-representation technique for efficient handling of exponential matrices from a general linear differential operator that may include cross-derivatives and non-constant diffusion coefficients. In this approach, exponentials are only needed for matrices of small size that depend only on the order of derivatives and number of discretization points, independent of the size of spatial dimensions. This method is particularly advantageous for high dimensional systems, and it can be easily incorporated with IIF to preserve the excellent stability of IIF. Implementation and direct simulations of the array-representation compact IIF (AcIIF) on systems, such as Fokker-Planck equations in three and four dimensions and chemical master equations, in addition to reaction-diffusion equations, show efficiency, accuracy, and robustness of the new method. Such array-presentation based on methods may have broad applications for simulating other complex systems involving high-dimensional data. PMID:24415797
Unbiased Prediction and Feature Selection in High-Dimensional Survival Regression
Laimighofer, Michael; Krumsiek, Jan; Theis, Fabian J.
2016-01-01
Abstract With widespread availability of omics profiling techniques, the analysis and interpretation of high-dimensional omics data, for example, for biomarkers, is becoming an increasingly important part of clinical medicine because such datasets constitute a promising resource for predicting survival outcomes. However, early experience has shown that biomarkers often generalize poorly. Thus, it is crucial that models are not overfitted and give accurate results with new data. In addition, reliable detection of multivariate biomarkers with high predictive power (feature selection) is of particular interest in clinical settings. We present an approach that addresses both aspects in high-dimensional survival models. Within a nested cross-validation (CV), we fit a survival model, evaluate a dataset in an unbiased fashion, and select features with the best predictive power by applying a weighted combination of CV runs. We evaluate our approach using simulated toy data, as well as three breast cancer datasets, to predict the survival of breast cancer patients after treatment. In all datasets, we achieve more reliable estimation of predictive power for unseen cases and better predictive performance compared to the standard CoxLasso model. Taken together, we present a comprehensive and flexible framework for survival models, including performance estimation, final feature selection, and final model construction. The proposed algorithm is implemented in an open source R package (SurvRank) available on CRAN. PMID:26894327
The validation and assessment of machine learning: a game of prediction from high-dimensional data.
Pers, Tune H; Albrechtsen, Anders; Holst, Claus; Sørensen, Thorkild I A; Gerds, Thomas A
2009-08-04
In applied statistics, tools from machine learning are popular for analyzing complex and high-dimensional data. However, few theoretical results are available that could guide to the appropriate machine learning tool in a new application. Initial development of an overall strategy thus often implies that multiple methods are tested and compared on the same set of data. This is particularly difficult in situations that are prone to over-fitting where the number of subjects is low compared to the number of potential predictors. The article presents a game which provides some grounds for conducting a fair model comparison. Each player selects a modeling strategy for predicting individual response from potential predictors. A strictly proper scoring rule, bootstrap cross-validation, and a set of rules are used to make the results obtained with different strategies comparable. To illustrate the ideas, the game is applied to data from the Nugenob Study where the aim is to predict the fat oxidation capacity based on conventional factors and high-dimensional metabolomics data. Three players have chosen to use support vector machines, LASSO, and random forests, respectively.
A Joint Modeling Approach for Right Censored High Dimensional Multivariate Longitudinal Data
Jaffa, Miran A.; Gebregziabher, Mulugeta; Jaffa, Ayad A
2015-01-01
Analysis of multivariate longitudinal data becomes complicated when the outcomes are of high dimension and informative right censoring is prevailing. Here, we propose a likelihood based approach for high dimensional outcomes wherein we jointly model the censoring process along with the slopes of the multivariate outcomes in the same likelihood function. We utilized pseudo likelihood function to generate parameter estimates for the population slopes and Empirical Bayes estimates for the individual slopes. The proposed approach was applied to jointly model longitudinal measures of blood urea nitrogen, plasma creatinine, and estimated glomerular filtration rate which are key markers of kidney function in a cohort of renal transplant patients followed from kidney transplant to kidney failure. Feasibility of the proposed joint model for high dimensional multivariate outcomes was successfully demonstrated and its performance was compared to that of a pairwise bivariate model. Our simulation study results suggested that there was a significant reduction in bias and mean squared errors associated with the joint model compared to the pairwise bivariate model. PMID:25688330
Finite-key analysis for time-energy high-dimensional quantum key distribution
NASA Astrophysics Data System (ADS)
Niu, Murphy Yuezhen; Xu, Feihu; Shapiro, Jeffrey H.; Furrer, Fabian
2016-11-01
Time-energy high-dimensional quantum key distribution (HD-QKD) leverages the high-dimensional nature of time-energy entangled biphotons and the loss tolerance of single-photon detection to achieve long-distance key distribution with high photon information efficiency. To date, the general-attack security of HD-QKD has only been proven in the asymptotic regime, while HD-QKD's finite-key security has only been established for a limited set of attacks. Here we fill this gap by providing a rigorous HD-QKD security proof for general attacks in the finite-key regime. Our proof relies on an entropic uncertainty relation that we derive for time and conjugate-time measurements that use dispersive optics, and our analysis includes an efficient decoy-state protocol in its parameter estimation. We present numerically evaluated secret-key rates illustrating the feasibility of secure and composable HD-QKD over metropolitan-area distances when the system is subjected to the most powerful eavesdropping attack.
A Robust Supervised Variable Selection for Noisy High-Dimensional Data
Kalina, Jan; Schlenker, Anna
2015-01-01
The Minimum Redundancy Maximum Relevance (MRMR) approach to supervised variable selection represents a successful methodology for dimensionality reduction, which is suitable for high-dimensional data observed in two or more different groups. Various available versions of the MRMR approach have been designed to search for variables with the largest relevance for a classification task while controlling for redundancy of the selected set of variables. However, usual relevance and redundancy criteria have the disadvantages of being too sensitive to the presence of outlying measurements and/or being inefficient. We propose a novel approach called Minimum Regularized Redundancy Maximum Robust Relevance (MRRMRR), suitable for noisy high-dimensional data observed in two groups. It combines principles of regularization and robust statistics. Particularly, redundancy is measured by a new regularized version of the coefficient of multiple correlation and relevance is measured by a highly robust correlation coefficient based on the least weighted squares regression with data-adaptive weights. We compare various dimensionality reduction methods on three real data sets. To investigate the influence of noise or outliers on the data, we perform the computations also for data artificially contaminated by severe noise of various forms. The experimental results confirm the robustness of the method with respect to outliers. PMID:26137474
Pang, Herbert; Jung, Sin-Ho
2013-04-01
A variety of prediction methods are used to relate high-dimensional genome data with a clinical outcome using a prediction model. Once a prediction model is developed from a data set, it should be validated using a resampling method or an independent data set. Although the existing prediction methods have been intensively evaluated by many investigators, there has not been a comprehensive study investigating the performance of the validation methods, especially with a survival clinical outcome. Understanding the properties of the various validation methods can allow researchers to perform more powerful validations while controlling for type I error. In addition, sample size calculation strategy based on these validation methods is lacking. We conduct extensive simulations to examine the statistical properties of these validation strategies. In both simulations and a real data example, we have found that 10-fold cross-validation with permutation gave the best power while controlling type I error close to the nominal level. Based on this, we have also developed a sample size calculation method that will be used to design a validation study with a user-chosen combination of prediction. Microarray and genome-wide association studies data are used as illustrations. The power calculation method in this presentation can be used for the design of any biomedical studies involving high-dimensional data and survival outcomes.
Does the Cerebral Cortex Exploit High-Dimensional, Non-linear Dynamics for Information Processing?
Singer, Wolf; Lazar, Andreea
2016-01-01
The discovery of stimulus induced synchronization in the visual cortex suggested the possibility that the relations among low-level stimulus features are encoded by the temporal relationship between neuronal discharges. In this framework, temporal coherence is considered a signature of perceptual grouping. This insight triggered a large number of experimental studies which sought to investigate the relationship between temporal coordination and cognitive functions. While some core predictions derived from the initial hypothesis were confirmed, these studies, also revealed a rich dynamical landscape beyond simple coherence whose role in signal processing is still poorly understood. In this paper, a framework is presented which establishes links between the various manifestations of cortical dynamics by assigning specific coding functions to low-dimensional dynamic features such as synchronized oscillations and phase shifts on the one hand and high-dimensional non-linear, non-stationary dynamics on the other. The data serving as basis for this synthetic approach have been obtained with chronic multisite recordings from the visual cortex of anesthetized cats and from monkeys trained to solve cognitive tasks. It is proposed that the low-dimensional dynamics characterized by synchronized oscillations and large-scale correlations are substates that represent the results of computations performed in the high-dimensional state-space provided by recurrently coupled networks. PMID:27713697
Validi, AbdoulAhad
2014-03-01
This study introduces a non-intrusive approach in the context of low-rank separated representation to construct a surrogate of high-dimensional stochastic functions, e.g., PDEs/ODEs, in order to decrease the computational cost of Markov Chain Monte Carlo simulations in Bayesian inference. The surrogate model is constructed via a regularized alternative least-square regression with Tikhonov regularization using a roughening matrix computing the gradient of the solution, in conjunction with a perturbation-based error indicator to detect optimal model complexities. The model approximates a vector of a continuous solution at discrete values of a physical variable. The required number of random realizations to achieve a successful approximation linearly depends on the function dimensionality. The computational cost of the model construction is quadratic in the number of random inputs, which potentially tackles the curse of dimensionality in high-dimensional stochastic functions. Furthermore, this vector-valued separated representation-based model, in comparison to the available scalar-valued case, leads to a significant reduction in the cost of approximation by an order of magnitude equal to the vector size. The performance of the method is studied through its application to three numerical examples including a 41-dimensional elliptic PDE and a 21-dimensional cavity flow.
Prediction of Incident Diabetes in the Jackson Heart Study Using High-Dimensional Machine Learning.
Casanova, Ramon; Saldana, Santiago; Simpson, Sean L; Lacy, Mary E; Subauste, Angela R; Blackshear, Chad; Wagenknecht, Lynne; Bertoni, Alain G
2016-01-01
Statistical models to predict incident diabetes are often based on limited variables. Here we pursued two main goals: 1) investigate the relative performance of a machine learning method such as Random Forests (RF) for detecting incident diabetes in a high-dimensional setting defined by a large set of observational data, and 2) uncover potential predictors of diabetes. The Jackson Heart Study collected data at baseline and in two follow-up visits from 5,301 African Americans. We excluded those with baseline diabetes and no follow-up, leaving 3,633 individuals for analyses. Over a mean 8-year follow-up, 584 participants developed diabetes. The full RF model evaluated 93 variables including demographic, anthropometric, blood biomarker, medical history, and echocardiogram data. We also used RF metrics of variable importance to rank variables according to their contribution to diabetes prediction. We implemented other models based on logistic regression and RF where features were preselected. The RF full model performance was similar (AUC = 0.82) to those more parsimonious models. The top-ranked variables according to RF included hemoglobin A1C, fasting plasma glucose, waist circumference, adiponectin, c-reactive protein, triglycerides, leptin, left ventricular mass, high-density lipoprotein cholesterol, and aldosterone. This work shows the potential of RF for incident diabetes prediction while dealing with high-dimensional data.
Pang, Herbert; Jung, Sin-Ho
2013-01-01
A variety of prediction methods are used to relate high-dimensional genome data with a clinical outcome using a prediction model. Once a prediction model is developed from a data set, it should be validated using a resampling method or an independent data set. Although the existing prediction methods have been intensively evaluated by many investigators, there has not been a comprehensive study investigating the performance of the validation methods, especially with a survival clinical outcome. Understanding the properties of the various validation methods can allow researchers to perform more powerful validations while controlling for type I error. In addition, sample size calculation strategy based on these validation methods is lacking. We conduct extensive simulations to examine the statistical properties of these validation strategies. In both simulations and a real data example, we have found that 10-fold cross-validation with permutation gave the best power while controlling type I error close to the nominal level. Based on this, we have also developed a sample size calculation method that will be used to design a validation study with a user-chosen combination of prediction. Microarray and genome-wide association studies data are used as illustrations. The power calculation method in this presentation can be used for the design of any biomedical studies involving high-dimensional data and survival outcomes. PMID:23471879
Multiple Group Testing Procedures for Analysis of High-Dimensional Genomic Data
Ko, Hyoseok; Kim, Kipoong
2016-01-01
In genetic association studies with high-dimensional genomic data, multiple group testing procedures are often required in order to identify disease/trait-related genes or genetic regions, where multiple genetic sites or variants are located within the same gene or genetic region. However, statistical testing procedures based on an individual test suffer from multiple testing issues such as the control of family-wise error rate and dependent tests. Moreover, detecting only a few of genes associated with a phenotype outcome among tens of thousands of genes is of main interest in genetic association studies. In this reason regularization procedures, where a phenotype outcome regresses on all genomic markers and then regression coefficients are estimated based on a penalized likelihood, have been considered as a good alternative approach to analysis of high-dimensional genomic data. But, selection performance of regularization procedures has been rarely compared with that of statistical group testing procedures. In this article, we performed extensive simulation studies where commonly used group testing procedures such as principal component analysis, Hotelling's T2 test, and permutation test are compared with group lasso (least absolute selection and shrinkage operator) in terms of true positive selection. Also, we applied all methods considered in simulation studies to identify genes associated with ovarian cancer from over 20,000 genetic sites generated from Illumina Infinium HumanMethylation27K Beadchip. We found a big discrepancy of selected genes between multiple group testing procedures and group lasso. PMID:28154510
The cross-validated AUC for MCP-logistic regression with high-dimensional data.
Jiang, Dingfeng; Huang, Jian; Zhang, Ying
2013-10-01
We propose a cross-validated area under the receiving operator characteristic (ROC) curve (CV-AUC) criterion for tuning parameter selection for penalized methods in sparse, high-dimensional logistic regression models. We use this criterion in combination with the minimax concave penalty (MCP) method for variable selection. The CV-AUC criterion is specifically designed for optimizing the classification performance for binary outcome data. To implement the proposed approach, we derive an efficient coordinate descent algorithm to compute the MCP-logistic regression solution surface. Simulation studies are conducted to evaluate the finite sample performance of the proposed method and its comparison with the existing methods including the Akaike information criterion (AIC), Bayesian information criterion (BIC) or Extended BIC (EBIC). The model selected based on the CV-AUC criterion tends to have a larger predictive AUC and smaller classification error than those with tuning parameters selected using the AIC, BIC or EBIC. We illustrate the application of the MCP-logistic regression with the CV-AUC criterion on three microarray datasets from the studies that attempt to identify genes related to cancers. Our simulation studies and data examples demonstrate that the CV-AUC is an attractive method for tuning parameter selection for penalized methods in high-dimensional logistic regression models.
Distribution of high-dimensional entanglement via an intra-city free-space link
Steinlechner, Fabian; Ecker, Sebastian; Fink, Matthias; Liu, Bo; Bavaresco, Jessica; Huber, Marcus; Scheidl, Thomas; Ursin, Rupert
2017-01-01
Quantum entanglement is a fundamental resource in quantum information processing and its distribution between distant parties is a key challenge in quantum communications. Increasing the dimensionality of entanglement has been shown to improve robustness and channel capacities in secure quantum communications. Here we report on the distribution of genuine high-dimensional entanglement via a 1.2-km-long free-space link across Vienna. We exploit hyperentanglement, that is, simultaneous entanglement in polarization and energy-time bases, to encode quantum information, and observe high-visibility interference for successive correlation measurements in each degree of freedom. These visibilities impose lower bounds on entanglement in each subspace individually and certify four-dimensional entanglement for the hyperentangled system. The high-fidelity transmission of high-dimensional entanglement under real-world atmospheric link conditions represents an important step towards long-distance quantum communications with more complex quantum systems and the implementation of advanced quantum experiments with satellite links. PMID:28737168
Multivariate multidistance tests for high-dimensional low sample size case-control studies.
Marozzi, Marco
2015-04-30
A class of multivariate tests for case-control studies with high-dimensional low sample size data and with complex dependence structure, which are common in medical imaging and molecular biology, is proposed. The tests can be applied when the number of variables is much larger than the number of subjects and when the underlying population distributions are heavy-tailed or skewed. As a motivating application, we consider a case-control study where phase-contrast cinematic cardiovascular magnetic resonance imaging has been used to compare many cardiovascular characteristics of young healthy smokers and young healthy non-smokers. The tests are based on the combination of tests on interpoint distances. It is theoretically proved that the tests are exact, unbiased and consistent. It is shown that the tests are very powerful under normal, heavy-tailed and skewed distributions. The tests can also be applied to case-control studies with high-dimensional low sample size data from other medical imaging techniques (like computed tomography or X-ray radiography), chemometrics and microarray data (proteomics and transcriptomics).
Quantum secret sharing based on modulated high-dimensional time-bin entanglement
Takesue, Hiroki; Inoue, Kyo
2006-07-15
We propose a scheme for quantum secret sharing (QSS) that uses a modulated high-dimensional time-bin entanglement. By modulating the relative phase randomly by {l_brace}0,{pi}{r_brace}, a sender with the entanglement source can randomly change the sign of the correlation of the measurement outcomes obtained by two distant recipients. The two recipients must cooperate if they are to obtain the sign of the correlation, which is used as a secret key. We show that our scheme is secure against intercept-and-resend (IR) and beam splitting attacks by an outside eavesdropper thanks to the nonorthogonality of high-dimensional time-bin entangled states. We also show that a cheating attempt based on an IR attack by one of the recipients can be detected by changing the dimension of the time-bin entanglement randomly and inserting two 'vacant' slots between the packets. Then, cheating attempts can be detected by monitoring the count rate in the vacant slots. The proposed scheme has better experimental feasibility than previously proposed entanglement-based QSS schemes.
A Joint Modeling Approach for Right Censored High Dimensional Multivariate Longitudinal Data.
Jaffa, Miran A; Gebregziabher, Mulugeta; Jaffa, Ayad A
2014-08-01
Analysis of multivariate longitudinal data becomes complicated when the outcomes are of high dimension and informative right censoring is prevailing. Here, we propose a likelihood based approach for high dimensional outcomes wherein we jointly model the censoring process along with the slopes of the multivariate outcomes in the same likelihood function. We utilized pseudo likelihood function to generate parameter estimates for the population slopes and Empirical Bayes estimates for the individual slopes. The proposed approach was applied to jointly model longitudinal measures of blood urea nitrogen, plasma creatinine, and estimated glomerular filtration rate which are key markers of kidney function in a cohort of renal transplant patients followed from kidney transplant to kidney failure. Feasibility of the proposed joint model for high dimensional multivariate outcomes was successfully demonstrated and its performance was compared to that of a pairwise bivariate model. Our simulation study results suggested that there was a significant reduction in bias and mean squared errors associated with the joint model compared to the pairwise bivariate model.
Prediction of Incident Diabetes in the Jackson Heart Study Using High-Dimensional Machine Learning
Casanova, Ramon; Saldana, Santiago; Simpson, Sean L.; Lacy, Mary E.; Subauste, Angela R.; Blackshear, Chad; Wagenknecht, Lynne; Bertoni, Alain G.
2016-01-01
Statistical models to predict incident diabetes are often based on limited variables. Here we pursued two main goals: 1) investigate the relative performance of a machine learning method such as Random Forests (RF) for detecting incident diabetes in a high-dimensional setting defined by a large set of observational data, and 2) uncover potential predictors of diabetes. The Jackson Heart Study collected data at baseline and in two follow-up visits from 5,301 African Americans. We excluded those with baseline diabetes and no follow-up, leaving 3,633 individuals for analyses. Over a mean 8-year follow-up, 584 participants developed diabetes. The full RF model evaluated 93 variables including demographic, anthropometric, blood biomarker, medical history, and echocardiogram data. We also used RF metrics of variable importance to rank variables according to their contribution to diabetes prediction. We implemented other models based on logistic regression and RF where features were preselected. The RF full model performance was similar (AUC = 0.82) to those more parsimonious models. The top-ranked variables according to RF included hemoglobin A1C, fasting plasma glucose, waist circumference, adiponectin, c-reactive protein, triglycerides, leptin, left ventricular mass, high-density lipoprotein cholesterol, and aldosterone. This work shows the potential of RF for incident diabetes prediction while dealing with high-dimensional data. PMID:27727289
NASA Astrophysics Data System (ADS)
Sargsyan, K.; Safta, C.; Ray, J.; Debusschere, B.; Najm, H.; Ricciuto, D. M.; Thornton, P. E.
2012-12-01
Uncertainty quantification capabilities have been boosted considerably by recent advances in associated algorithms and software, as well as increased computational capabilities. As a result, it has become possible to address uncertainties in complex climate models more quantitatively. However, there still remain numerous challenges when dealing with complex climate models. In this work, we highlight and address some of these challenges, using the Community Land Model (CLM) as the main benchmark system for algorithm development. To begin with, climate models are computationally intensive. This necessarily disqualifies pure Monte-Carlo algorithms for uncertainty estimation, since naive Monte-Carlo approaches require too many sampled simulations for reasonable accuracy. In this work, we build computationally inexpensive surrogate model in order to accelerate both forward and inverse UQ methods. We apply Polynomial Chaos (PC) spectral expansions to build surrogate relationships between output quantities and model parameters using as few forward model simulations as possible. Next, climate models typically suffer from the curse of dimensionality. For example, the CLM depends on about 80 input parameters with somewhat uncertain values. Representation of the input-output dependence requires prohibitively many basis functions for spectral expansions. Moreover, to obtain such a representation, one needs to sample an 80-dimensional space, which can at best be sparsely covered. We apply Bayesian compressive sensing (BCS) techniques in order to infer the best basis set for the PC surrogate model. BCS performs particularly well in high-dimensional settings when model simulations are very sparse. Furthermore, many climate models incorporate dependent uncertain parameters. In this context, we apply the Rosenblatt transformation, mapping dependent parameters into a computationally convenient set of independent variables. This allows efficient parameter sampling even in presence of
... Genetics Services Directory Cancer Prevention Overview Research Cancer Clusters On This Page What is a cancer cluster? ... the number of cancer cases in the suspected cluster Many reported clusters include too few cancer cases ...
Online clustering algorithms for radar emitter classification.
Liu, Jun; Lee, Jim P Y; Senior; Li, Lingjie; Luo, Zhi-Quan; Wong, K Max
2005-08-01
Radar emitter classification is a special application of data clustering for classifying unknown radar emitters from received radar pulse samples. The main challenges of this task are the high dimensionality of radar pulse samples, small sample group size, and closely located radar pulse clusters. In this paper, two new online clustering algorithms are developed for radar emitter classification: One is model-based using the Minimum Description Length (MDL) criterion and the other is based on competitive learning. Computational complexity is analyzed for each algorithm and then compared. Simulation results show the superior performance of the model-based algorithm over competitive learning in terms of better classification accuracy, flexibility, and stability.
Multiple frame cluster tracking
NASA Astrophysics Data System (ADS)
Gadaleta, Sabino; Klusman, Mike; Poore, Aubrey; Slocumb, Benjamin J.
2002-08-01
Tracking large number of closely spaced objects is a challenging problem for any tracking system. In missile defense systems, countermeasures in the form of debris, chaff, spent fuel, and balloons can overwhelm tracking systems that track only individual objects. Thus, tracking these groups or clusters of objects followed by transitions to individual object tracking (if and when individual objects separate from the groups) is a necessary capability for a robust and real-time tracking system. The objectives of this paper are to describe the group tracking problem in the context of multiple frame target tracking and to formulate a general assignment problem for the multiple frame cluster/group tracking problem. The proposed approach forms multiple clustering hypotheses on each frame of data and base individual frame clustering decisions on the information from multiple frames of data in much the same way that MFA or MHT work for individual object tracking. The formulation of the assignment problem for resolved object tracking and candidate clustering methods for use in multiple frame cluster tracking are briefly reviewed. Then, three different formulations are presented for the combination of multiple clustering hypotheses on each frame of data and the multiple frame assignments of clusters between frames.
Steinwand, Daniel R.; Maddox, Brian; Beckmann, Tim; Hamer, George
2003-01-01
Beowulf clusters can provide a cost-effective way to compute numerical models and process large amounts of remote sensing image data. Usually a Beowulf cluster is designed to accomplish a specific set of processing goals, and processing is very efficient when the problem remains inside the constraints of the original design. There are cases, however, when one might wish to compute a problem that is beyond the capacity of the local Beowulf system. In these cases, spreading the problem to multiple clusters or to other machines on the network may provide a cost-effective solution.
An Efficient Initialization Method for K-Means Clustering of Hyperspectral Data
NASA Astrophysics Data System (ADS)
Alizade Naeini, A.; Jamshidzadeh, A.; Saadatseresht, M.; Homayouni, S.
2014-10-01
K-means is definitely the most frequently used partitional clustering algorithm in the remote sensing community. Unfortunately due to its gradient decent nature, this algorithm is highly sensitive to the initial placement of cluster centers. This problem deteriorates for the high-dimensional data such as hyperspectral remotely sensed imagery. To tackle this problem, in this paper, the spectral signatures of the endmembers in the image scene are extracted and used as the initial positions of the cluster centers. For this purpose, in the first step, A Neyman-Pearson detection theory based eigen-thresholding method (i.e., the HFC method) has been employed to estimate the number of endmembers in the image. Afterwards, the spectral signatures of the endmembers are obtained using the Minimum Volume Enclosing Simplex (MVES) algorithm. Eventually, these spectral signatures are used to initialize the k-means clustering algorithm. The proposed method is implemented on a hyperspectral dataset acquired by ROSIS sensor with 103 spectral bands over the Pavia University campus, Italy. For comparative evaluation, two other commonly used initialization methods (i.e., Bradley & Fayyad (BF) and Random methods) are implemented and compared. The confusion matrix, overall accuracy and Kappa coefficient are employed to assess the methods' performance. The evaluations demonstrate that the proposed solution outperforms the other initialization methods and can be applied for unsupervised classification of hyperspectral imagery for landcover mapping.
2013-01-01
Background There is large body of knowledge to support the importance of early interventions to improve child health and development. Nonetheless, it is important to identify cost-effective blends of preventive interventions with adequate coverage and feasible delivery modes. The aim of the Children and Parents in Focus trial is to compare two levels of parenting programme intensity and rate of exposure, with a control condition to address impact and cost-effectiveness of a universally offered evidence-based parenting programme in the Swedish context. Methods/Design The trial has a cluster randomised controlled design comprising three arms: Universal arm (with access to participation in Triple P - Positive Parenting Program, level 2); Universal Plus arm (with access to participation in Triple P - Positive Parenting Program, level 2 as well as level 3, and level 4 group); and Services as Usual arm. The sampling frame is Uppsala municipality in Sweden. Child health centres consecutively recruit parents of children aged 3 to 5 years before their yearly check-ups (during the years 2013–2017). Outcomes will be measured annually. The primary outcome will be children’s behavioural and emotional problems as rated by three informants: fathers, mothers and preschool teachers. The other outcomes will be parents’ behaviour and parents’ general health. Health economic evaluations will analyse cost-effectiveness of the interventions versus care as usual by comparing the costs and consequences in terms of impact on children’s mental health, parent’s mental health and health-related quality of life. Discussion This study addresses the need for comprehensive evaluation of the long-term effects, costs and benefits of early parenting interventions embedded within existing systems. In addition, the study will generate population-based data on the mental health and well-being of preschool aged children in Sweden. Trial registration ISRCTN: ISRCTN16513449. PMID:24131587
Wang, Zhiping; Chen, Jinyu; Yu, Benli
2017-02-20
We investigate the two-dimensional (2D) and three-dimensional (3D) atom localization behaviors via spontaneously generated coherence in a microwave-driven four-level atomic system. Owing to the space-dependent atom-field interaction, it is found that the detecting probability and precision of 2D and 3D atom localization behaviors can be significantly improved via adjusting the system parameters, the phase, amplitude, and initial population distribution. Interestingly, the atom can be localized in volumes that are substantially smaller than a cubic optical wavelength. Our scheme opens a promising way to achieve high-precision and high-efficiency atom localization, which provides some potential applications in high-dimensional atom nanolithography.
Transient times and periods in the high-dimensional shape-space model for immune systems
NASA Astrophysics Data System (ADS)
Zorzenon dos Santos, Rita Maria
1993-05-01
A simplified version of the cellular automata approximation introduced by De Boer, Segel and Perelson in the shape-space model, to describe the interaction of different types of B cells in the immune system, indicates the existence of a threshold separating the periodic regime from the chaotic one, on high-dimensional finite lattices. We study the behavior of the periods of the limit cycles nearby the transition threshold as well as the behavior of the transient times necessary to attain the attractors in the periodic regime. We find that both become large close to the threshold. We also find that even before the chaotic regime is reached, the system is already trapped in a sort of non-healthy state. Nevertheless the system will never attain it, because the transient times in this region are much larger than the usual average lifetime of the system.
An adaptive fuzzy neural network for MIMO system model approximation in high-dimensional spaces.
Chak, C K; Feng, G; Ma, J
1998-01-01
An adaptive fuzzy system implemented within the framework of neural network is proposed. The integration of the fuzzy system into a neural network enables the new fuzzy system to have learning and adaptive capabilities. The proposed fuzzy neural network can locate its rules and optimize its membership functions by competitive learning, Kalman filter algorithm and extended Kalman filter algorithms. A key feature of the new architecture is that a high dimensional fuzzy system can be implemented with fewer number of rules than the Takagi-Sugeno fuzzy systems. A number of simulations are presented to demonstrate the performance of the proposed system including modeling nonlinear function, operator's control of chemical plant, stock prices and bioreactor (multioutput dynamical system).
Narrow peaks and high dimensionalities: exploiting the advantages of random sampling.
Kazimierczuk, Krzysztof; Zawadzka, Anna; Koźmiński, Wiktor
2009-04-01
Level of artifacts in spectra obtained by Multidimensional Fourier Transform has been studied, considering randomly sampled signals of high dimensionality and long evolution times. It has been shown theoretically and experimentally, that this level is dependent on the number of time domain samples, but not on its relation to the number of points required in appropriate conventional experiment. Independence of the evolution time domain size (in the terms of both: dimensionality and evolution time reached), suggests that random sampling should be used rather to design new techniques with large time domain than to accelerate standard experiments. 5D HC(CC-TOCSY)CONH has been presented as the example of such approach. The feature of Multidimensional Fourier Transform, namely the possibility of calculating spectral values at arbitrary chosen frequency points, allowed easy examination of resulting spectrum. We present the example of such approach, referred to as Sparse Multidimensional Fourier Transform.
Stable High-Dimensional Spatial Optical Solitons and Vortices in an Active Raman Gain Medium
NASA Astrophysics Data System (ADS)
Li, Hui-jun; Duan, Yu-hua; Wen, Wen; Huang, Guoxiang
2015-05-01
We propose a scheme to produce stable high-dimensional spatial optical solitons and vortices in an M-type five-level active Raman gain medium at room temperature. We derive a (2+1)-dimensional [(2+1)D] nonlinear Schrödinger (NLS) equation with a 2D trapping potential, which is contributed by an assisted field. We show that by adjusting the system parameter, the signs of the Kerr nonlinearity and the external potential can be manipulated at will. We then present three types of NLS equation, provide their soliton solutions, and analyze their stabilities. We finally discuss the differences in the soliton solutions between (2+1)D and (3+1)D systems with the same 2D trapping potential.
NASA Astrophysics Data System (ADS)
Sobrino-Coll, N.; Puertas-Centeno, D.; Toranzo, I. V.; Dehesa, J. S.
2017-08-01
In this work we find that not only the Heisenberg-like uncertainty products and the Rényi-entropy-based uncertainty sum have the same first-order values for all the quantum states of the D-dimensional hydrogenic and oscillator-like systems, respectively, in the pseudoclassical (D \\to ∞ ) limit but a similar phenomenon also happens for both the Fisher-information-based uncertainty product and the Shannon-entropy-based uncertainty sum, as well as for the Crámer-Rao and Fisher-Shannon complexities. Moreover, we show that the López-Ruiz-Mancini-Calvet (LMC) and LMC-Rényi complexity measures capture the hydrogenic-harmonic difference in the high dimensional limit already at first order.
Multiple imputation for high-dimensional mixed incomplete continuous and binary data.
He, Ren; Belin, Thomas
2014-06-15
It is common in applied research to have large numbers of variables measured on a modest number of cases. Even with low rates of missingness of individual variables, such data sets can have a large number of incomplete cases with a mix of data types. Here, we propose a new joint modeling approach to address the high-dimensional incomplete data with a mix of continuous and binary data. Specifically, we propose a multivariate normal model encompassing both continuous variables and latent variables corresponding to binary variables. We apply a parameter-extended Metropolis–Hastings algorithm to generate the covariance matrix of a mixture of continuous and binary variables. We also introduce prior distribution families for unstructured covariance matrices to reduce the dimension of the parameter space. In several simulation settings, the method is compared with available-case analysis, a rounding method, and a sequential regression method.
DecisionFlow: Visual Analytics for High-Dimensional Temporal Event Sequence Data.
Gotz, David; Stavropoulos, Harry
2014-12-01
Temporal event sequence data is increasingly commonplace, with applications ranging from electronic medical records to financial transactions to social media activity. Previously developed techniques have focused on low-dimensional datasets (e.g., with less than 20 distinct event types). Real-world datasets are often far more complex. This paper describes DecisionFlow, a visual analysis technique designed to support the analysis of high-dimensional temporal event sequence data (e.g., thousands of event types). DecisionFlow combines a scalable and dynamic temporal event data structure with interactive multi-view visualizations and ad hoc statistical analytics. We provide a detailed review of our methods, and present the results from a 12-person user study. The study results demonstrate that DecisionFlow enables the quick and accurate completion of a range of sequence analysis tasks for datasets containing thousands of event types and millions of individual events.
Duchesne, Simon; Mouiha, Abderazzak
2011-01-01
We propose a novel morphological factor estimate from structural MRI for disease state evaluation. We tested this methodology in the context of Alzheimer's disease (AD) with 349 subjects. The method consisted in (a) creating a reference MRI feature eigenspace using intensity and local volume change data from 149 healthy, young subjects; (b) projecting MRI data from 75 probable AD, 76 controls (CTRL), and 49 Mild Cognitive Impairment (MCI) in that space; (c) extracting high-dimensional discriminant functions; (d) calculating a single morphological factor based on various models. We used this methodology in leave-one-out experiments to (1) confirm the superiority of an inverse-squared model over other approaches; (2) obtain accuracy estimates for the discrimination of probable AD from CTRL (90%) and the prediction of conversion of MCI subjects to probable AD (79.4%). PMID:21755033
High-dimensional single-cell analysis reveals the immune signature of narcolepsy.
Hartmann, Felix J; Bernard-Valnet, Raphaël; Quériault, Clémence; Mrdjen, Dunja; Weber, Lukas M; Galli, Edoardo; Krieg, Carsten; Robinson, Mark D; Nguyen, Xuan-Hung; Dauvilliers, Yves; Liblau, Roland S; Becher, Burkhard
2016-11-14
Narcolepsy type 1 is a devastating neurological sleep disorder resulting from the destruction of orexin-producing neurons in the central nervous system (CNS). Despite its striking association with the HLA-DQB1*06:02 allele, the autoimmune etiology of narcolepsy has remained largely hypothetical. Here, we compared peripheral mononucleated cells from narcolepsy patients with HLA-DQB1*06:02-matched healthy controls using high-dimensional mass cytometry in combination with algorithm-guided data analysis. Narcolepsy patients displayed multifaceted immune activation in CD4(+) and CD8(+) T cells dominated by elevated levels of B cell-supporting cytokines. Additionally, T cells from narcolepsy patients showed increased production of the proinflammatory cytokines IL-2 and TNF. Although it remains to be established whether these changes are primary to an autoimmune process in narcolepsy or secondary to orexin deficiency, these findings are indicative of inflammatory processes in the pathogenesis of this enigmatic disease. © 2016 Hartmann et al.
High-dimensional single-cell analysis reveals the immune signature of narcolepsy
Quériault, Clémence; Krieg, Carsten; Nguyen, Xuan-Hung
2016-01-01
Narcolepsy type 1 is a devastating neurological sleep disorder resulting from the destruction of orexin-producing neurons in the central nervous system (CNS). Despite its striking association with the HLA-DQB1*06:02 allele, the autoimmune etiology of narcolepsy has remained largely hypothetical. Here, we compared peripheral mononucleated cells from narcolepsy patients with HLA-DQB1*06:02-matched healthy controls using high-dimensional mass cytometry in combination with algorithm-guided data analysis. Narcolepsy patients displayed multifaceted immune activation in CD4+ and CD8+ T cells dominated by elevated levels of B cell–supporting cytokines. Additionally, T cells from narcolepsy patients showed increased production of the proinflammatory cytokines IL-2 and TNF. Although it remains to be established whether these changes are primary to an autoimmune process in narcolepsy or secondary to orexin deficiency, these findings are indicative of inflammatory processes in the pathogenesis of this enigmatic disease. PMID:27821550
A two-state hysteresis model from high-dimensional friction.
Biswas, Saurabh; Chatterjee, Anindya
2015-07-01
In prior work (Biswas & Chatterjee 2014 Proc. R. Soc. A 470, 20130817 (doi:10.1098/rspa.2013.0817)), we developed a six-state hysteresis model from a high-dimensional frictional system. Here, we use a more intuitively appealing frictional system that resembles one studied earlier by Iwan. The basis functions now have simple analytical description. The number of states required decreases further, from six to the theoretical minimum of two. The number of fitted parameters is reduced by an order of magnitude, to just six. An explicit and faster numerical solution method is developed. Parameter fitting to match different specified hysteresis loops is demonstrated. In summary, a new two-state model of hysteresis is presented that is ready for practical implementation. Essential Matlab code is provided.
A two-state hysteresis model from high-dimensional friction
Biswas, Saurabh; Chatterjee, Anindya
2015-01-01
In prior work (Biswas & Chatterjee 2014 Proc. R. Soc. A 470, 20130817 (doi:10.1098/rspa.2013.0817)), we developed a six-state hysteresis model from a high-dimensional frictional system. Here, we use a more intuitively appealing frictional system that resembles one studied earlier by Iwan. The basis functions now have simple analytical description. The number of states required decreases further, from six to the theoretical minimum of two. The number of fitted parameters is reduced by an order of magnitude, to just six. An explicit and faster numerical solution method is developed. Parameter fitting to match different specified hysteresis loops is demonstrated. In summary, a new two-state model of hysteresis is presented that is ready for practical implementation. Essential Matlab code is provided. PMID:26587279
Beyond the adaptive matched filter: nonlinear detectors for weak signals in high-dimensional clutter
NASA Astrophysics Data System (ADS)
Theiler, James; Foy, Bernard R.; Fraser, Andrew M.
2007-04-01
For known signals that are linearly superimposed on gaussian backgrounds, the linear adaptive matched filter (AMF) is well-known to be the optimal detector. The AMF has furthermore proved to be remarkably effective in a broad range of circumstances where it is not optimal, and for which the optimal detector is not linear. In these cases, nonlinear detectors are theoretically superior, but direct estimation of nonlinear detectors in high-dimensional spaces often leads to flagrant overfitting and poor out-of-sample performance. Despite this difficulty in the general case, we will describe several situations in which nonlinearity can be effectively combined with the AMF to detect weak signals. This allows improvement over AMF performance while avoiding the full force of dimensionality's curse.
Testing interaction between treatment and high-dimensional covariates in randomized clinical trials.
Callegaro, Andrea; Spiessens, Bart; Dizier, Benjamin; Montoya, Fernando U; van Houwelingen, Hans C
2016-10-20
In this paper, we considered different methods to test the interaction between treatment and a potentially large number (p) of covariates in randomized clinical trials. The simplest approach was to fit univariate (marginal) models and to combine the univariate statistics or p-values (e.g., minimum p-value). Another possibility was to reduce the dimension of the covariates using the principal components (PCs) and to test the interaction between treatment and PCs. Finally, we considered the Goeman global test applied to the high-dimensional interaction matrix, adjusted for the main (treatment and covariates) effects. These tests can be used for personalized medicine to test if a large set of biomarkers can be useful to identify a subset of patients who may be more responsive to treatment. We evaluated the performance of these methods on simulated data and we applied them on data from two early phases oncology clinical trials.
High-dimensional quantum key distribution with the entangled single-photon-added coherent state
NASA Astrophysics Data System (ADS)
Wang, Yang; Bao, Wan-Su; Bao, Hai-Ze; Zhou, Chun; Jiang, Mu-Sheng; Li, Hong-Wei
2017-04-01
High-dimensional quantum key distribution (HD-QKD) can generate more secure bits for one detection event so that it can achieve long distance key distribution with a high secret key capacity. In this Letter, we present a decoy state HD-QKD scheme with the entangled single-photon-added coherent state (ESPACS) source. We present two tight formulas to estimate the single-photon fraction of postselected events and Eve's Holevo information and derive lower bounds on the secret key capacity and the secret key rate of our protocol. We also present finite-key analysis for our protocol by using the Chernoff bound. Our numerical results show that our protocol using one decoy state can perform better than that of previous HD-QKD protocol with the spontaneous parametric down conversion (SPDC) using two decoy states. Moreover, when considering finite resources, the advantage is more obvious.
High-Dimensional Circular Quantum Secret Sharing Using Orbital Angular Momentum
NASA Astrophysics Data System (ADS)
Tang, Dawei; Wang, Tie-jun; Mi, Sichen; Geng, Xiao-Meng; Wang, Chuan
2016-11-01
Quantum secret sharing is to distribute secret message securely between multi-parties. Here exploiting orbital angular momentum (OAM) state of single photons as the information carrier, we propose a high-dimensional circular quantum secret sharing protocol which increases the channel capacity largely. In the proposed protocol, the secret message is split into two parts, and each encoded on the OAM state of single photons. The security of the protocol is guaranteed by the laws of non-cloning theorem. And the secret messages could not be recovered except that the two receivers collaborated with each other. Moreover, the proposed protocol could be extended into high-level quantum systems, and the enhanced security could be achieved.
Pal, Ranjan; Chelmis, Charalampos; Aman, Saima; Frincu, Marc; Prasanna, Viktor
2015-07-15
The advent of smart meters and advanced communication infrastructures catalyzes numerous smart grid applications such as dynamic demand response, and paves the way to solve challenging research problems in sustainable energy consumption. The space of solution possibilities are restricted primarily by the huge amount of generated data requiring considerable computational resources and efficient algorithms. To overcome this Big Data challenge, data clustering techniques have been proposed. Current approaches however do not scale in the face of the “increasing dimensionality” problem where a cluster point is represented by the entire customer consumption time series. To overcome this aspect we first rethink the way cluster points are created and designed, and then design an efficient online clustering technique for demand response (DR) in order to analyze high volume, high dimensional energy consumption time series data at scale, and on the fly. Our online algorithm is randomized in nature, and provides optimal performance guarantees in a computationally efficient manner. Unlike prior work we (i) study the consumption properties of the whole population simultaneously rather than developing individual models for each customer separately, claiming it to be a ‘killer’ approach that breaks the “curse of dimensionality” in online time series clustering, and (ii) provide tight performance guarantees in theory to validate our approach. Our insights are driven by the field of sociology, where collective behavior often emerges as the result of individual patterns and lifestyles.
The role of high-dimensional diffusive search, stabilization, and frustration in protein folding.
Rimratchada, Supreecha; McLeish, Tom C B; Radford, Sheena E; Paci, Emanuele
2014-04-15
Proteins are polymeric molecules with many degrees of conformational freedom whose internal energetic interactions are typically screened to small distances. Therefore, in the high-dimensional conformation space of a protein, the energy landscape is locally relatively flat, in contrast to low-dimensional representations, where, because of the induced entropic contribution to the full free energy, it appears funnel-like. Proteins explore the conformation space by searching these flat subspaces to find a narrow energetic alley that we call a hypergutter and then explore the next, lower-dimensional, subspace. Such a framework provides an effective representation of the energy landscape and folding kinetics that does justice to the essential characteristic of high-dimensionality of the search-space. It also illuminates the important role of nonnative interactions in defining folding pathways. This principle is here illustrated using a coarse-grained model of a family of three-helix bundle proteins whose conformations, once secondary structure has formed, can be defined by six rotational degrees of freedom. Two folding mechanisms are possible, one of which involves an intermediate. The stabilization of intermediate subspaces (or states in low-dimensional projection) in protein folding can either speed up or slow down the folding rate depending on the amount of native and nonnative contacts made in those subspaces. The folding rate increases due to reduced-dimension pathways arising from the mere presence of intermediate states, but decreases if the contacts in the intermediate are very stable and introduce sizeable topological or energetic frustration that needs to be overcome. Remarkably, the hypergutter framework, although depending on just a few physically meaningful parameters, can reproduce all the types of experimentally observed curvature in chevron plots for realizations of this fold.
Otto, Frank
2014-01-07
The multi-layer multi-configuration time-dependent Hartree method (ML-MCTDH) is a highly efficient scheme for studying the dynamics of high-dimensional quantum systems. Its use is greatly facilitated if the Hamiltonian of the system possesses a particular structure through which the multi-dimensional matrix elements can be computed efficiently. In the field of quantum molecular dynamics, the effective interaction between the atoms is often described by potential energy surfaces (PES), and it is necessary to fit such PES into the desired structure. For high-dimensional systems, the current approaches for this fitting process either lead to fits that are too large to be practical, or their accuracy is difficult to predict and control. This article introduces multi-layer Potfit (MLPF), a novel fitting scheme that results in a PES representation in the hierarchical tensor (HT) format. The scheme is based on the hierarchical singular value decomposition, which can yield a near-optimal fit and give strict bounds for the obtained accuracy. Here, a recursive scheme for using the HT-format PES within ML-MCTDH is derived, and theoretical estimates as well as a computational example show that the use of MLPF can reduce the numerical effort for ML-MCTDH by orders of magnitude, compared to the traditionally used POTFIT representation of the PES. Moreover, it is shown that MLPF is especially beneficial for high-accuracy PES representations, and it turns out that MLPF leads to computational savings already for comparatively small systems with just four modes.
Technology innovation clusters are geographic concentrations of interconnected companies, universities, and other organizations with a focus on environmental technology. They play a key role in addressing the nation’s pressing environmental problems.
Relation chain based clustering analysis
NASA Astrophysics Data System (ADS)
Zhang, Cheng-ning; Zhao, Ming-yang; Luo, Hai-bo
2011-08-01
Clustering analysis is currently one of well-developed branches in data mining technology which is supposed to find the hidden structures in the multidimensional space called feature or pattern space. A datum in the space usually possesses a vector form and the elements in the vector represent several specifically selected features. These features are often of efficiency to the problem oriented. Generally, clustering analysis goes into two divisions: one is based on the agglomerative clustering method, and the other one is based on divisive clustering method. The former refers to a bottom-up process which regards each datum as a singleton cluster while the latter refers to a top-down process which regards entire data as a cluster. As the collected literatures, it is noted that the divisive clustering is currently overwhelming both in application and research. Although some famous divisive clustering methods are designed and well developed, clustering problems are still far from being solved. The k - means algorithm is the original divisive clustering method which initially assigns some important index values, such as the clustering number and the initial clustering prototype positions, and that could not be reasonable in some certain occasions. More than the initial problem, the k - means algorithm may also falls into local optimum, clusters in a rigid way and is not available for non-Gaussian distribution. One can see that seeking for a good or natural clustering result, in fact, originates from the one's understanding of the concept of clustering. Thus, the confusion or misunderstanding of the definition of clustering always derives some unsatisfied clustering results. One should consider the definition deeply and seriously. This paper demonstrates the nature of clustering, gives the way of understanding clustering, discusses the methodology of designing a clustering algorithm, and proposes a new clustering method based on relation chains among 2D patterns. In
Dishion, Thomas J; Ha, Thao; Véronneau, Marie-Hélène
2012-05-01
The authors propose that peer relationships should be included in a life history perspective on adolescent problem behavior. Longitudinal analyses were used to examine deviant peer clustering as the mediating link between attenuated family ties, peer marginalization, and social disadvantage in early adolescence and sexual promiscuity in middle adolescence and childbearing by early adulthood. Specifically, 998 youths, along with their families, were assessed at age 11 years and periodically through age 24 years. Structural equation modeling revealed that the peer-enhanced life history model provided a good fit to the longitudinal data, with deviant peer clustering strongly predicting adolescent sexual promiscuity and other correlated problem behaviors. Sexual promiscuity, as expected, also strongly predicted the number of children by ages 22-24 years. Consistent with a life history perspective, family social disadvantage directly predicted deviant peer clustering and number of children in early adulthood, controlling for all other variables in the model. These data suggest that deviant peer clustering is a core dimension of a fast life history strategy, with strong links to sexual activity and childbearing. The implications of these findings are discussed with respect to the need to integrate an evolutionary-based model of self-organized peer groups in developmental and intervention science.