Querying Patterns in High-Dimensional Heterogenous Datasets
ERIC Educational Resources Information Center
Singh, Vishwakarma
2012-01-01
The recent technological advancements have led to the availability of a plethora of heterogenous datasets, e.g., images tagged with geo-location and descriptive keywords. An object in these datasets is described by a set of high-dimensional feature vectors. For example, a keyword-tagged image is represented by a color-histogram and a…
A Hybrid Semi-Supervised Anomaly Detection Model for High-Dimensional Data.
Song, Hongchao; Jiang, Zhuqing; Men, Aidong; Yang, Bo
2017-01-01
Anomaly detection, which aims to identify observations that deviate from a nominal sample, is a challenging task for high-dimensional data. Traditional distance-based anomaly detection methods compute the neighborhood distance between each observation and suffer from the curse of dimensionality in high-dimensional space; for example, the distances between any pair of samples are similar and each sample may perform like an outlier. In this paper, we propose a hybrid semi-supervised anomaly detection model for high-dimensional data that consists of two parts: a deep autoencoder (DAE) and an ensemble k -nearest neighbor graphs- ( K -NNG-) based anomaly detector. Benefiting from the ability of nonlinear mapping, the DAE is first trained to learn the intrinsic features of a high-dimensional dataset to represent the high-dimensional data in a more compact subspace. Several nonparametric KNN-based anomaly detectors are then built from different subsets that are randomly sampled from the whole dataset. The final prediction is made by all the anomaly detectors. The performance of the proposed method is evaluated on several real-life datasets, and the results confirm that the proposed hybrid model improves the detection accuracy and reduces the computational complexity.
A Hybrid Semi-Supervised Anomaly Detection Model for High-Dimensional Data
Jiang, Zhuqing; Men, Aidong; Yang, Bo
2017-01-01
Anomaly detection, which aims to identify observations that deviate from a nominal sample, is a challenging task for high-dimensional data. Traditional distance-based anomaly detection methods compute the neighborhood distance between each observation and suffer from the curse of dimensionality in high-dimensional space; for example, the distances between any pair of samples are similar and each sample may perform like an outlier. In this paper, we propose a hybrid semi-supervised anomaly detection model for high-dimensional data that consists of two parts: a deep autoencoder (DAE) and an ensemble k-nearest neighbor graphs- (K-NNG-) based anomaly detector. Benefiting from the ability of nonlinear mapping, the DAE is first trained to learn the intrinsic features of a high-dimensional dataset to represent the high-dimensional data in a more compact subspace. Several nonparametric KNN-based anomaly detectors are then built from different subsets that are randomly sampled from the whole dataset. The final prediction is made by all the anomaly detectors. The performance of the proposed method is evaluated on several real-life datasets, and the results confirm that the proposed hybrid model improves the detection accuracy and reduces the computational complexity. PMID:29270197
Ensemble of sparse classifiers for high-dimensional biological data.
Kim, Sunghan; Scalzo, Fabien; Telesca, Donatello; Hu, Xiao
2015-01-01
Biological data are often high in dimension while the number of samples is small. In such cases, the performance of classification can be improved by reducing the dimension of data, which is referred to as feature selection. Recently, a novel feature selection method has been proposed utilising the sparsity of high-dimensional biological data where a small subset of features accounts for most variance of the dataset. In this study we propose a new classification method for high-dimensional biological data, which performs both feature selection and classification within a single framework. Our proposed method utilises a sparse linear solution technique and the bootstrap aggregating algorithm. We tested its performance on four public mass spectrometry cancer datasets along with two other conventional classification techniques such as Support Vector Machines and Adaptive Boosting. The results demonstrate that our proposed method performs more accurate classification across various cancer datasets than those conventional classification techniques.
NASA Astrophysics Data System (ADS)
Taylor, M. B.
2009-09-01
The new plotting functionality in version 2.0 of STILTS is described. STILTS is a mature and powerful package for all kinds of table manipulation, and this version adds facilities for generating plots from one or more tables to its existing wide range of non-graphical capabilities. 2- and 3-dimensional scatter plots and 1-dimensional histograms may be generated using highly configurable style parameters. Features include multiple dataset overplotting, variable transparency, 1-, 2- or 3-dimensional symmetric or asymmetric error bars, higher-dimensional visualization using color, and textual point labeling. Vector and bitmapped output formats are supported. The plotting options provide enough flexibility to perform meaningful visualization on datasets from a few points up to tens of millions. Arbitrarily large datasets can be plotted without heavy memory usage.
Can We Train Machine Learning Methods to Outperform the High-dimensional Propensity Score Algorithm?
Karim, Mohammad Ehsanul; Pang, Menglan; Platt, Robert W
2018-03-01
The use of retrospective health care claims datasets is frequently criticized for the lack of complete information on potential confounders. Utilizing patient's health status-related information from claims datasets as surrogates or proxies for mismeasured and unobserved confounders, the high-dimensional propensity score algorithm enables us to reduce bias. Using a previously published cohort study of postmyocardial infarction statin use (1998-2012), we compare the performance of the algorithm with a number of popular machine learning approaches for confounder selection in high-dimensional covariate spaces: random forest, least absolute shrinkage and selection operator, and elastic net. Our results suggest that, when the data analysis is done with epidemiologic principles in mind, machine learning methods perform as well as the high-dimensional propensity score algorithm. Using a plasmode framework that mimicked the empirical data, we also showed that a hybrid of machine learning and high-dimensional propensity score algorithms generally perform slightly better than both in terms of mean squared error, when a bias-based analysis is used.
Progeny Clustering: A Method to Identify Biological Phenotypes
Hu, Chenyue W.; Kornblau, Steven M.; Slater, John H.; Qutub, Amina A.
2015-01-01
Estimating the optimal number of clusters is a major challenge in applying cluster analysis to any type of dataset, especially to biomedical datasets, which are high-dimensional and complex. Here, we introduce an improved method, Progeny Clustering, which is stability-based and exceptionally efficient in computing, to find the ideal number of clusters. The algorithm employs a novel Progeny Sampling method to reconstruct cluster identity, a co-occurrence probability matrix to assess the clustering stability, and a set of reference datasets to overcome inherent biases in the algorithm and data space. Our method was shown successful and robust when applied to two synthetic datasets (datasets of two-dimensions and ten-dimensions containing eight dimensions of pure noise), two standard biological datasets (the Iris dataset and Rat CNS dataset) and two biological datasets (a cell phenotype dataset and an acute myeloid leukemia (AML) reverse phase protein array (RPPA) dataset). Progeny Clustering outperformed some popular clustering evaluation methods in the ten-dimensional synthetic dataset as well as in the cell phenotype dataset, and it was the only method that successfully discovered clinically meaningful patient groupings in the AML RPPA dataset. PMID:26267476
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation.
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it.
Localized Ambient Solidity Separation Algorithm Based Computer User Segmentation
Sun, Xiao; Zhang, Tongda; Chai, Yueting; Liu, Yi
2015-01-01
Most of popular clustering methods typically have some strong assumptions of the dataset. For example, the k-means implicitly assumes that all clusters come from spherical Gaussian distributions which have different means but the same covariance. However, when dealing with datasets that have diverse distribution shapes or high dimensionality, these assumptions might not be valid anymore. In order to overcome this weakness, we proposed a new clustering algorithm named localized ambient solidity separation (LASS) algorithm, using a new isolation criterion called centroid distance. Compared with other density based isolation criteria, our proposed centroid distance isolation criterion addresses the problem caused by high dimensionality and varying density. The experiment on a designed two-dimensional benchmark dataset shows that our proposed LASS algorithm not only inherits the advantage of the original dissimilarity increments clustering method to separate naturally isolated clusters but also can identify the clusters which are adjacent, overlapping, and under background noise. Finally, we compared our LASS algorithm with the dissimilarity increments clustering method on a massive computer user dataset with over two million records that contains demographic and behaviors information. The results show that LASS algorithm works extremely well on this computer user dataset and can gain more knowledge from it. PMID:26221133
Exploring patterns enriched in a dataset with contrastive principal component analysis.
Abid, Abubakar; Zhang, Martin J; Bagaria, Vivek K; Zou, James
2018-05-30
Visualization and exploration of high-dimensional data is a ubiquitous challenge across disciplines. Widely used techniques such as principal component analysis (PCA) aim to identify dominant trends in one dataset. However, in many settings we have datasets collected under different conditions, e.g., a treatment and a control experiment, and we are interested in visualizing and exploring patterns that are specific to one dataset. This paper proposes a method, contrastive principal component analysis (cPCA), which identifies low-dimensional structures that are enriched in a dataset relative to comparison data. In a wide variety of experiments, we demonstrate that cPCA with a background dataset enables us to visualize dataset-specific patterns missed by PCA and other standard methods. We further provide a geometric interpretation of cPCA and strong mathematical guarantees. An implementation of cPCA is publicly available, and can be used for exploratory data analysis in many applications where PCA is currently used.
Li, Jinyan; Fong, Simon; Wong, Raymond K; Millham, Richard; Wong, Kelvin K L
2017-06-28
Due to the high-dimensional characteristics of dataset, we propose a new method based on the Wolf Search Algorithm (WSA) for optimising the feature selection problem. The proposed approach uses the natural strategy established by Charles Darwin; that is, 'It is not the strongest of the species that survives, but the most adaptable'. This means that in the evolution of a swarm, the elitists are motivated to quickly obtain more and better resources. The memory function helps the proposed method to avoid repeat searches for the worst position in order to enhance the effectiveness of the search, while the binary strategy simplifies the feature selection problem into a similar problem of function optimisation. Furthermore, the wrapper strategy gathers these strengthened wolves with the classifier of extreme learning machine to find a sub-dataset with a reasonable number of features that offers the maximum correctness of global classification models. The experimental results from the six public high-dimensional bioinformatics datasets tested demonstrate that the proposed method can best some of the conventional feature selection methods up to 29% in classification accuracy, and outperform previous WSAs by up to 99.81% in computational time.
A reanalysis dataset of the South China Sea.
Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu
2014-01-01
Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992-2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability.
A reanalysis dataset of the South China Sea
Zeng, Xuezhi; Peng, Shiqiu; Li, Zhijin; Qi, Yiquan; Chen, Rongyu
2014-01-01
Ocean reanalysis provides a temporally continuous and spatially gridded four-dimensional estimate of the ocean state for a better understanding of the ocean dynamics and its spatial/temporal variability. Here we present a 19-year (1992–2010) high-resolution ocean reanalysis dataset of the upper ocean in the South China Sea (SCS) produced from an ocean data assimilation system. A wide variety of observations, including in-situ temperature/salinity profiles, ship-measured and satellite-derived sea surface temperatures, and sea surface height anomalies from satellite altimetry, are assimilated into the outputs of an ocean general circulation model using a multi-scale incremental three-dimensional variational data assimilation scheme, yielding a daily high-resolution reanalysis dataset of the SCS. Comparisons between the reanalysis and independent observations support the reliability of the dataset. The presented dataset provides the research community of the SCS an important data source for studying the thermodynamic processes of the ocean circulation and meso-scale features in the SCS, including their spatial and temporal variability. PMID:25977803
Manifold Learning by Preserving Distance Orders.
Ataer-Cansizoglu, Esra; Akcakaya, Murat; Orhan, Umut; Erdogmus, Deniz
2014-03-01
Nonlinear dimensionality reduction is essential for the analysis and the interpretation of high dimensional data sets. In this manuscript, we propose a distance order preserving manifold learning algorithm that extends the basic mean-squared error cost function used mainly in multidimensional scaling (MDS)-based methods. We develop a constrained optimization problem by assuming explicit constraints on the order of distances in the low-dimensional space. In this optimization problem, as a generalization of MDS, instead of forcing a linear relationship between the distances in the high-dimensional original and low-dimensional projection space, we learn a non-decreasing relation approximated by radial basis functions. We compare the proposed method with existing manifold learning algorithms using synthetic datasets based on the commonly used residual variance and proposed percentage of violated distance orders metrics. We also perform experiments on a retinal image dataset used in Retinopathy of Prematurity (ROP) diagnosis.
Julien, Clavel; Leandro, Aristide; Hélène, Morlon
2018-06-19
Working with high-dimensional phylogenetic comparative datasets is challenging because likelihood-based multivariate methods suffer from low statistical performances as the number of traits p approaches the number of species n and because some computational complications occur when p exceeds n. Alternative phylogenetic comparative methods have recently been proposed to deal with the large p small n scenario but their use and performances are limited. Here we develop a penalized likelihood framework to deal with high-dimensional comparative datasets. We propose various penalizations and methods for selecting the intensity of the penalties. We apply this general framework to the estimation of parameters (the evolutionary trait covariance matrix and parameters of the evolutionary model) and model comparison for the high-dimensional multivariate Brownian (BM), Early-burst (EB), Ornstein-Uhlenbeck (OU) and Pagel's lambda models. We show using simulations that our penalized likelihood approach dramatically improves the estimation of evolutionary trait covariance matrices and model parameters when p approaches n, and allows for their accurate estimation when p equals or exceeds n. In addition, we show that penalized likelihood models can be efficiently compared using Generalized Information Criterion (GIC). We implement these methods, as well as the related estimation of ancestral states and the computation of phylogenetic PCA in the R package RPANDA and mvMORPH. Finally, we illustrate the utility of the new proposed framework by evaluating evolutionary models fit, analyzing integration patterns, and reconstructing evolutionary trajectories for a high-dimensional 3-D dataset of brain shape in the New World monkeys. We find a clear support for an Early-burst model suggesting an early diversification of brain morphology during the ecological radiation of the clade. Penalized likelihood offers an efficient way to deal with high-dimensional multivariate comparative data.
CoINcIDE: A framework for discovery of patient subtypes across multiple datasets.
Planey, Catherine R; Gevaert, Olivier
2016-03-09
Patient disease subtypes have the potential to transform personalized medicine. However, many patient subtypes derived from unsupervised clustering analyses on high-dimensional datasets are not replicable across multiple datasets, limiting their clinical utility. We present CoINcIDE, a novel methodological framework for the discovery of patient subtypes across multiple datasets that requires no between-dataset transformations. We also present a high-quality database collection, curatedBreastData, with over 2,500 breast cancer gene expression samples. We use CoINcIDE to discover novel breast and ovarian cancer subtypes with prognostic significance and novel hypothesized ovarian therapeutic targets across multiple datasets. CoINcIDE and curatedBreastData are available as R packages.
Beta Hebbian Learning as a New Method for Exploratory Projection Pursuit.
Quintián, Héctor; Corchado, Emilio
2017-09-01
In this research, a novel family of learning rules called Beta Hebbian Learning (BHL) is thoroughly investigated to extract information from high-dimensional datasets by projecting the data onto low-dimensional (typically two dimensional) subspaces, improving the existing exploratory methods by providing a clear representation of data's internal structure. BHL applies a family of learning rules derived from the Probability Density Function (PDF) of the residual based on the beta distribution. This family of rules may be called Hebbian in that all use a simple multiplication of the output of the neural network with some function of the residuals after feedback. The derived learning rules can be linked to an adaptive form of Exploratory Projection Pursuit and with artificial distributions, the networks perform as the theory suggests they should: the use of different learning rules derived from different PDFs allows the identification of "interesting" dimensions (as far from the Gaussian distribution as possible) in high-dimensional datasets. This novel algorithm, BHL, has been tested over seven artificial datasets to study the behavior of BHL parameters, and was later applied successfully over four real datasets, comparing its results, in terms of performance, with other well-known Exploratory and projection models such as Maximum Likelihood Hebbian Learning (MLHL), Locally-Linear Embedding (LLE), Curvilinear Component Analysis (CCA), Isomap and Neural Principal Component Analysis (Neural PCA).
NASA Astrophysics Data System (ADS)
Davoudi, Alireza; Shiry Ghidary, Saeed; Sadatnejad, Khadijeh
2017-06-01
Objective. In this paper, we propose a nonlinear dimensionality reduction algorithm for the manifold of symmetric positive definite (SPD) matrices that considers the geometry of SPD matrices and provides a low-dimensional representation of the manifold with high class discrimination in a supervised or unsupervised manner. Approach. The proposed algorithm tries to preserve the local structure of the data by preserving distances to local means (DPLM) and also provides an implicit projection matrix. DPLM is linear in terms of the number of training samples. Main results. We performed several experiments on the multi-class dataset IIa from BCI competition IV and two other datasets from BCI competition III including datasets IIIa and IVa. The results show that our approach as dimensionality reduction technique—leads to superior results in comparison with other competitors in the related literature because of its robustness against outliers and the way it preserves the local geometry of the data. Significance. The experiments confirm that the combination of DPLM with filter geodesic minimum distance to mean as the classifier leads to superior performance compared with the state of the art on brain-computer interface competition IV dataset IIa. Also the statistical analysis shows that our dimensionality reduction method performs significantly better than its competitors.
Li, Yiqing; Wang, Yu; Zi, Yanyang; Zhang, Mingquan
2015-10-21
The various multi-sensor signal features from a diesel engine constitute a complex high-dimensional dataset. The non-linear dimensionality reduction method, t-distributed stochastic neighbor embedding (t-SNE), provides an effective way to implement data visualization for complex high-dimensional data. However, irrelevant features can deteriorate the performance of data visualization, and thus, should be eliminated a priori. This paper proposes a feature subset score based t-SNE (FSS-t-SNE) data visualization method to deal with the high-dimensional data that are collected from multi-sensor signals. In this method, the optimal feature subset is constructed by a feature subset score criterion. Then the high-dimensional data are visualized in 2-dimension space. According to the UCI dataset test, FSS-t-SNE can effectively improve the classification accuracy. An experiment was performed with a large power marine diesel engine to validate the proposed method for diesel engine malfunction classification. Multi-sensor signals were collected by a cylinder vibration sensor and a cylinder pressure sensor. Compared with other conventional data visualization methods, the proposed method shows good visualization performance and high classification accuracy in multi-malfunction classification of a diesel engine.
Li, Yiqing; Wang, Yu; Zi, Yanyang; Zhang, Mingquan
2015-01-01
The various multi-sensor signal features from a diesel engine constitute a complex high-dimensional dataset. The non-linear dimensionality reduction method, t-distributed stochastic neighbor embedding (t-SNE), provides an effective way to implement data visualization for complex high-dimensional data. However, irrelevant features can deteriorate the performance of data visualization, and thus, should be eliminated a priori. This paper proposes a feature subset score based t-SNE (FSS-t-SNE) data visualization method to deal with the high-dimensional data that are collected from multi-sensor signals. In this method, the optimal feature subset is constructed by a feature subset score criterion. Then the high-dimensional data are visualized in 2-dimension space. According to the UCI dataset test, FSS-t-SNE can effectively improve the classification accuracy. An experiment was performed with a large power marine diesel engine to validate the proposed method for diesel engine malfunction classification. Multi-sensor signals were collected by a cylinder vibration sensor and a cylinder pressure sensor. Compared with other conventional data visualization methods, the proposed method shows good visualization performance and high classification accuracy in multi-malfunction classification of a diesel engine. PMID:26506347
Visual exploration of high-dimensional data through subspace analysis and dynamic projections
Liu, S.; Wang, B.; Thiagarajan, J. J.; ...
2015-06-01
Here, we introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that createmore » smooth animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.« less
Visual Exploration of High-Dimensional Data through Subspace Analysis and Dynamic Projections
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liu, S.; Wang, B.; Thiagarajan, Jayaraman J.
2015-06-01
We introduce a novel interactive framework for visualizing and exploring high-dimensional datasets based on subspace analysis and dynamic projections. We assume the high-dimensional dataset can be represented by a mixture of low-dimensional linear subspaces with mixed dimensions, and provide a method to reliably estimate the intrinsic dimension and linear basis of each subspace extracted from the subspace clustering. Subsequently, we use these bases to define unique 2D linear projections as viewpoints from which to visualize the data. To understand the relationships among the different projections and to discover hidden patterns, we connect these projections through dynamic projections that create smoothmore » animated transitions between pairs of projections. We introduce the view transition graph, which provides flexible navigation among these projections to facilitate an intuitive exploration. Finally, we provide detailed comparisons with related systems, and use real-world examples to demonstrate the novelty and usability of our proposed framework.« less
Zhou, Zhen; Wang, Jian-Bao; Zang, Yu-Feng; Pan, Gang
2018-01-01
Classification approaches have been increasingly applied to differentiate patients and normal controls using resting-state functional magnetic resonance imaging data (RS-fMRI). Although most previous classification studies have reported promising accuracy within individual datasets, achieving high levels of accuracy with multiple datasets remains challenging for two main reasons: high dimensionality, and high variability across subjects. We used two independent RS-fMRI datasets (n = 31, 46, respectively) both with eyes closed (EC) and eyes open (EO) conditions. For each dataset, we first reduced the number of features to a small number of brain regions with paired t-tests, using the amplitude of low frequency fluctuation (ALFF) as a metric. Second, we employed a new method for feature extraction, named the PAIR method, examining EC and EO as paired conditions rather than independent conditions. Specifically, for each dataset, we obtained EC minus EO (EC—EO) maps of ALFF from half of subjects (n = 15 for dataset-1, n = 23 for dataset-2) and obtained EO—EC maps from the other half (n = 16 for dataset-1, n = 23 for dataset-2). A support vector machine (SVM) method was used for classification of EC RS-fMRI mapping and EO mapping. The mean classification accuracy of the PAIR method was 91.40% for dataset-1, and 92.75% for dataset-2 in the conventional frequency band of 0.01–0.08 Hz. For cross-dataset validation, we applied the classifier from dataset-1 directly to dataset-2, and vice versa. The mean accuracy of cross-dataset validation was 94.93% for dataset-1 to dataset-2 and 90.32% for dataset-2 to dataset-1 in the 0.01–0.08 Hz range. For the UNPAIR method, classification accuracy was substantially lower (mean 69.89% for dataset-1 and 82.97% for dataset-2), and was much lower for cross-dataset validation (64.69% for dataset-1 to dataset-2 and 64.98% for dataset-2 to dataset-1) in the 0.01–0.08 Hz range. In conclusion, for within-group design studies (e.g., paired conditions or follow-up studies), we recommend the PAIR method for feature extraction. In addition, dimensionality reduction with strong prior knowledge of specific brain regions should also be considered for feature selection in neuroimaging studies. PMID:29375288
Differentially Private Synthesization of Multi-Dimensional Data using Copula Functions
Li, Haoran; Xiong, Li; Jiang, Xiaoqian
2014-01-01
Differential privacy has recently emerged in private statistical data release as one of the strongest privacy guarantees. Most of the existing techniques that generate differentially private histograms or synthetic data only work well for single dimensional or low-dimensional histograms. They become problematic for high dimensional and large domain data due to increased perturbation error and computation complexity. In this paper, we propose DPCopula, a differentially private data synthesization technique using Copula functions for multi-dimensional data. The core of our method is to compute a differentially private copula function from which we can sample synthetic data. Copula functions are used to describe the dependence between multivariate random vectors and allow us to build the multivariate joint distribution using one-dimensional marginal distributions. We present two methods for estimating the parameters of the copula functions with differential privacy: maximum likelihood estimation and Kendall’s τ estimation. We present formal proofs for the privacy guarantee as well as the convergence property of our methods. Extensive experiments using both real datasets and synthetic datasets demonstrate that DPCopula generates highly accurate synthetic multi-dimensional data with significantly better utility than state-of-the-art techniques. PMID:25405241
Parallel Visualization of Large-Scale Aerodynamics Calculations: A Case Study on the Cray T3E
NASA Technical Reports Server (NTRS)
Ma, Kwan-Liu; Crockett, Thomas W.
1999-01-01
This paper reports the performance of a parallel volume rendering algorithm for visualizing a large-scale, unstructured-grid dataset produced by a three-dimensional aerodynamics simulation. This dataset, containing over 18 million tetrahedra, allows us to extend our performance results to a problem which is more than 30 times larger than the one we examined previously. This high resolution dataset also allows us to see fine, three-dimensional features in the flow field. All our tests were performed on the Silicon Graphics Inc. (SGI)/Cray T3E operated by NASA's Goddard Space Flight Center. Using 511 processors, a rendering rate of almost 9 million tetrahedra/second was achieved with a parallel overhead of 26%.
Wang, Xueyi
2012-02-08
The k-nearest neighbors (k-NN) algorithm is a widely used machine learning method that finds nearest neighbors of a test object in a feature space. We present a new exact k-NN algorithm called kMkNN (k-Means for k-Nearest Neighbors) that uses the k-means clustering and the triangle inequality to accelerate the searching for nearest neighbors in a high dimensional space. The kMkNN algorithm has two stages. In the buildup stage, instead of using complex tree structures such as metric trees, kd-trees, or ball-tree, kMkNN uses a simple k-means clustering method to preprocess the training dataset. In the searching stage, given a query object, kMkNN finds nearest training objects starting from the nearest cluster to the query object and uses the triangle inequality to reduce the distance calculations. Experiments show that the performance of kMkNN is surprisingly good compared to the traditional k-NN algorithm and tree-based k-NN algorithms such as kd-trees and ball-trees. On a collection of 20 datasets with up to 10(6) records and 10(4) dimensions, kMkNN shows a 2-to 80-fold reduction of distance calculations and a 2- to 60-fold speedup over the traditional k-NN algorithm for 16 datasets. Furthermore, kMkNN performs significant better than a kd-tree based k-NN algorithm for all datasets and performs better than a ball-tree based k-NN algorithm for most datasets. The results show that kMkNN is effective for searching nearest neighbors in high dimensional spaces.
Unbiased feature selection in learning random forests for high-dimensional data.
Nguyen, Thanh-Tung; Huang, Joshua Zhexue; Nguyen, Thuy Thi
2015-01-01
Random forests (RFs) have been widely used as a powerful classification method. However, with the randomization in both bagging samples and feature selection, the trees in the forest tend to select uninformative features for node splitting. This makes RFs have poor accuracy when working with high-dimensional data. Besides that, RFs have bias in the feature selection process where multivalued features are favored. Aiming at debiasing feature selection in RFs, we propose a new RF algorithm, called xRF, to select good features in learning RFs for high-dimensional data. We first remove the uninformative features using p-value assessment, and the subset of unbiased features is then selected based on some statistical measures. This feature subset is then partitioned into two subsets. A feature weighting sampling technique is used to sample features from these two subsets for building trees. This approach enables one to generate more accurate trees, while allowing one to reduce dimensionality and the amount of data needed for learning RFs. An extensive set of experiments has been conducted on 47 high-dimensional real-world datasets including image datasets. The experimental results have shown that RFs with the proposed approach outperformed the existing random forests in increasing the accuracy and the AUC measures.
Hubbard, B.E.; Crowley, J.K.
2005-01-01
Hyperspectral data coverage from the EO-1 Hyperion sensor was useful for calibrating Advanced Land Imager (ALI) and Advanced Spaceborne Thermal Emission and Reflection Radiometer (ASTER) images of a volcanic terrane area of the Chilean-Bolivian Altiplano. Following calibration, the ALI and ASTER datasets were co-registered and joined to produce a 13-channel reflectance cube spanning the Visible to Short Wave Infrared (0.4-2.4 ??m). Eigen analysis and comparison of the Hyperion data with the ALI + ASTER reflectance data, as well as mapping results using various ALI+ASTER data subsets, provided insights into the information dimensionality of all the data. In particular, high spectral resolution, low signal-to-noise Hyperion data were only marginally better for mineral mapping than the merged 13-channel, low spectral resolution, high signal-to-noise ALI + ASTER dataset. Neither the Hyperion nor the combined ALI + ASTER datasets had sufficient information dimensionality for mapping the diverse range of surface materials exposed on the Altiplano. However, it is possible to optimize the use of the multispectral data for mineral-mapping purposes by careful data subsetting, and by employing other appropriate image-processing strategies.
LEAP: biomarker inference through learning and evaluating association patterns.
Jiang, Xia; Neapolitan, Richard E
2015-03-01
Single nucleotide polymorphism (SNP) high-dimensional datasets are available from Genome Wide Association Studies (GWAS). Such data provide researchers opportunities to investigate the complex genetic basis of diseases. Much of genetic risk might be due to undiscovered epistatic interactions, which are interactions in which combination of several genes affect disease. Research aimed at discovering interacting SNPs from GWAS datasets proceeded in two directions. First, tools were developed to evaluate candidate interactions. Second, algorithms were developed to search over the space of candidate interactions. Another problem when learning interacting SNPs, which has not received much attention, is evaluating how likely it is that the learned SNPs are associated with the disease. A complete system should provide this information as well. We develop such a system. Our system, called LEAP, includes a new heuristic search algorithm for learning interacting SNPs, and a Bayesian network based algorithm for computing the probability of their association. We evaluated the performance of LEAP using 100 1,000-SNP simulated datasets, each of which contains 15 SNPs involved in interactions. When learning interacting SNPs from these datasets, LEAP outperformed seven others methods. Furthermore, only SNPs involved in interactions were found to be probable. We also used LEAP to analyze real Alzheimer's disease and breast cancer GWAS datasets. We obtained interesting and new results from the Alzheimer's dataset, but limited results from the breast cancer dataset. We conclude that our results support that LEAP is a useful tool for extracting candidate interacting SNPs from high-dimensional datasets and determining their probability. © 2015 The Authors. *Genetic Epidemiology published by Wiley Periodicals, Inc.
Subsampling for dataset optimisation
NASA Astrophysics Data System (ADS)
Ließ, Mareike
2017-04-01
Soil-landscapes have formed by the interaction of soil-forming factors and pedogenic processes. In modelling these landscapes in their pedodiversity and the underlying processes, a representative unbiased dataset is required. This concerns model input as well as output data. However, very often big datasets are available which are highly heterogeneous and were gathered for various purposes, but not to model a particular process or data space. As a first step, the overall data space and/or landscape section to be modelled needs to be identified including considerations regarding scale and resolution. Then the available dataset needs to be optimised via subsampling to well represent this n-dimensional data space. A couple of well-known sampling designs may be adapted to suit this purpose. The overall approach follows three main strategies: (1) the data space may be condensed and de-correlated by a factor analysis to facilitate the subsampling process. (2) Different methods of pattern recognition serve to structure the n-dimensional data space to be modelled into units which then form the basis for the optimisation of an existing dataset through a sensible selection of samples. Along the way, data units for which there is currently insufficient soil data available may be identified. And (3) random samples from the n-dimensional data space may be replaced by similar samples from the available dataset. While being a presupposition to develop data-driven statistical models, this approach may also help to develop universal process models and identify limitations in existing models.
A Fast SVD-Hidden-nodes based Extreme Learning Machine for Large-Scale Data Analytics.
Deng, Wan-Yu; Bai, Zuo; Huang, Guang-Bin; Zheng, Qing-Hua
2016-05-01
Big dimensional data is a growing trend that is emerging in many real world contexts, extending from web mining, gene expression analysis, protein-protein interaction to high-frequency financial data. Nowadays, there is a growing consensus that the increasing dimensionality poses impeding effects on the performances of classifiers, which is termed as the "peaking phenomenon" in the field of machine intelligence. To address the issue, dimensionality reduction is commonly employed as a preprocessing step on the Big dimensional data before building the classifiers. In this paper, we propose an Extreme Learning Machine (ELM) approach for large-scale data analytic. In contrast to existing approaches, we embed hidden nodes that are designed using singular value decomposition (SVD) into the classical ELM. These SVD nodes in the hidden layer are shown to capture the underlying characteristics of the Big dimensional data well, exhibiting excellent generalization performances. The drawback of using SVD on the entire dataset, however, is the high computational complexity involved. To address this, a fast divide and conquer approximation scheme is introduced to maintain computational tractability on high volume data. The resultant algorithm proposed is labeled here as Fast Singular Value Decomposition-Hidden-nodes based Extreme Learning Machine or FSVD-H-ELM in short. In FSVD-H-ELM, instead of identifying the SVD hidden nodes directly from the entire dataset, SVD hidden nodes are derived from multiple random subsets of data sampled from the original dataset. Comprehensive experiments and comparisons are conducted to assess the FSVD-H-ELM against other state-of-the-art algorithms. The results obtained demonstrated the superior generalization performance and efficiency of the FSVD-H-ELM. Copyright © 2016 Elsevier Ltd. All rights reserved.
Statistical Exploration of Electronic Structure of Molecules from Quantum Monte-Carlo Simulations
DOE Office of Scientific and Technical Information (OSTI.GOV)
Prabhat, Mr; Zubarev, Dmitry; Lester, Jr., William A.
In this report, we present results from analysis of Quantum Monte Carlo (QMC) simulation data with the goal of determining internal structure of a 3N-dimensional phase space of an N-electron molecule. We are interested in mining the simulation data for patterns that might be indicative of the bond rearrangement as molecules change electronic states. We examined simulation output that tracks the positions of two coupled electrons in the singlet and triplet states of an H2 molecule. The electrons trace out a trajectory, which was analyzed with a number of statistical techniques. This project was intended to address the following scientificmore » questions: (1) Do high-dimensional phase spaces characterizing electronic structure of molecules tend to cluster in any natural way? Do we see a change in clustering patterns as we explore different electronic states of the same molecule? (2) Since it is hard to understand the high-dimensional space of trajectories, can we project these trajectories to a lower dimensional subspace to gain a better understanding of patterns? (3) Do trajectories inherently lie in a lower-dimensional manifold? Can we recover that manifold? After extensive statistical analysis, we are now in a better position to respond to these questions. (1) We definitely see clustering patterns, and differences between the H2 and H2tri datasets. These are revealed by the pamk method in a fairly reliable manner and can potentially be used to distinguish bonded and non-bonded systems and get insight into the nature of bonding. (2) Projecting to a lower dimensional subspace ({approx}4-5) using PCA or Kernel PCA reveals interesting patterns in the distribution of scalar values, which can be related to the existing descriptors of electronic structure of molecules. Also, these results can be immediately used to develop robust tools for analysis of noisy data obtained during QMC simulations (3) All dimensionality reduction and estimation techniques that we tried seem to indicate that one needs 4 or 5 components to account for most of the variance in the data, hence this 5D dataset does not necessarily lie on a well-defined, low dimensional manifold. In terms of specific clustering techniques, K-means was generally useful in exploring the dataset. The partition around medoids (pam) technique produced the most definitive results for our data showing distinctive patterns for both a sample of the complete data and time-series. The gap statistic with tibshirani criteria did not provide any distinction across the 2 dataset. The gap statistic w/DandF criteria, Model based clustering and hierarchical modeling simply failed to run on our datasets. Thankfully, the vanilla PCA technique was successful in handling our entire dataset. PCA revealed some interesting patterns for the scalar value distribution. Kernel PCA techniques (vanilladot, RBF, Polynomial) and MDS failed to run on the entire dataset, or even a significant fraction of the dataset, and we resorted to creating an explicit feature map followed by conventional PCA. Clustering using K-means and PAM in the new basis set seems to produce promising results. Understanding the new basis set in the scientific context of the problem is challenging, and we are currently working to further examine and interpret the results.« less
Segmentation of Unstructured Datasets
NASA Technical Reports Server (NTRS)
Bhat, Smitha
1996-01-01
Datasets generated by computer simulations and experiments in Computational Fluid Dynamics tend to be extremely large and complex. It is difficult to visualize these datasets using standard techniques like Volume Rendering and Ray Casting. Object Segmentation provides a technique to extract and quantify regions of interest within these massive datasets. This thesis explores basic algorithms to extract coherent amorphous regions from two-dimensional and three-dimensional scalar unstructured grids. The techniques are applied to datasets from Computational Fluid Dynamics and from Finite Element Analysis.
Nuclear Potential Clustering As a New Tool to Detect Patterns in High Dimensional Datasets
NASA Astrophysics Data System (ADS)
Tonkova, V.; Paulus, D.; Neeb, H.
2013-02-01
We present a new approach for the clustering of high dimensional data without prior assumptions about the structure of the underlying distribution. The proposed algorithm is based on a concept adapted from nuclear physics. To partition the data, we model the dynamic behaviour of nucleons interacting in an N-dimensional space. An adaptive nuclear potential, comprised of a short-range attractive (strong interaction) and a long-range repulsive term (Coulomb force) is assigned to each data point. By modelling the dynamics, nucleons that are densely distributed in space fuse to build nuclei (clusters) whereas single point clusters repel each other. The formation of clusters is completed when the system reaches the state of minimal potential energy. The data are then grouped according to the particles' final effective potential energy level. The performance of the algorithm is tested with several synthetic datasets showing that the proposed method can robustly identify clusters even when complex configurations are present. Furthermore, quantitative MRI data from 43 multiple sclerosis patients were analyzed, showing a reasonable splitting into subgroups according to the individual patients' disease grade. The good performance of the algorithm on such highly correlated non-spherical datasets, which are typical for MRI derived image features, shows that Nuclear Potential Clustering is a valuable tool for automated data analysis, not only in the MRI domain.
Schure, Mark R; Davis, Joe M
2017-11-10
Orthogonality metrics (OMs) for three and higher dimensional separations are proposed as extensions of previously developed OMs, which were used to evaluate the zone utilization of two-dimensional (2D) separations. These OMs include correlation coefficients, dimensionality, information theory metrics and convex-hull metrics. In a number of these cases, lower dimensional subspace metrics exist and can be readily calculated. The metrics are used to interpret previously generated experimental data. The experimental datasets are derived from Gilar's peptide data, now modified to be three dimensional (3D), and a comprehensive 3D chromatogram from Moore and Jorgenson. The Moore and Jorgenson chromatogram, which has 25 identifiable 3D volume elements or peaks, displayed good orthogonality values over all dimensions. However, OMs based on discretization of the 3D space changed substantially with changes in binning parameters. This example highlights the importance in higher dimensions of having an abundant number of retention times as data points, especially for methods that use discretization. The Gilar data, which in a previous study produced 21 2D datasets by the pairing of 7 one-dimensional separations, was reinterpreted to produce 35 3D datasets. These datasets show a number of interesting properties, one of which is that geometric and harmonic means of lower dimensional subspace (i.e., 2D) OMs correlate well with the higher dimensional (i.e., 3D) OMs. The space utilization of the Gilar 3D datasets was ranked using OMs, with the retention times of the datasets having the largest and smallest OMs presented as graphs. A discussion concerning the orthogonality of higher dimensional techniques is given with emphasis on molecular diversity in chromatographic separations. In the information theory work, an inconsistency is found in previous studies of orthogonality using the 2D metric often identified as %O. A new choice of metric is proposed, extended to higher dimensions, characterized by mixes of ordered and random retention times, and applied to the experimental datasets. In 2D, the new metric always equals or exceeds the original one. However, results from both the original and new methods are given. Copyright © 2017 Elsevier B.V. All rights reserved.
Extraction of drainage networks from large terrain datasets using high throughput computing
NASA Astrophysics Data System (ADS)
Gong, Jianya; Xie, Jibo
2009-02-01
Advanced digital photogrammetry and remote sensing technology produces large terrain datasets (LTD). How to process and use these LTD has become a big challenge for GIS users. Extracting drainage networks, which are basic for hydrological applications, from LTD is one of the typical applications of digital terrain analysis (DTA) in geographical information applications. Existing serial drainage algorithms cannot deal with large data volumes in a timely fashion, and few GIS platforms can process LTD beyond the GB size. High throughput computing (HTC), a distributed parallel computing mode, is proposed to improve the efficiency of drainage networks extraction from LTD. Drainage network extraction using HTC involves two key issues: (1) how to decompose the large DEM datasets into independent computing units and (2) how to merge the separate outputs into a final result. A new decomposition method is presented in which the large datasets are partitioned into independent computing units using natural watershed boundaries instead of using regular 1-dimensional (strip-wise) and 2-dimensional (block-wise) decomposition. Because the distribution of drainage networks is strongly related to watershed boundaries, the new decomposition method is more effective and natural. The method to extract natural watershed boundaries was improved by using multi-scale DEMs instead of single-scale DEMs. A HTC environment is employed to test the proposed methods with real datasets.
The semantic representation of prejudice and stereotypes.
Bhatia, Sudeep
2017-07-01
We use a theory of semantic representation to study prejudice and stereotyping. Particularly, we consider large datasets of newspaper articles published in the United States, and apply latent semantic analysis (LSA), a prominent model of human semantic memory, to these datasets to learn representations for common male and female, White, African American, and Latino names. LSA performs a singular value decomposition on word distribution statistics in order to recover word vector representations, and we find that our recovered representations display the types of biases observed in human participants using tasks such as the implicit association test. Importantly, these biases are strongest for vector representations with moderate dimensionality, and weaken or disappear for representations with very high or very low dimensionality. Moderate dimensional LSA models are also the best at learning race, ethnicity, and gender-based categories, suggesting that social category knowledge, acquired through dimensionality reduction on word distribution statistics, can facilitate prejudiced and stereotyped associations. Copyright © 2017 Elsevier B.V. All rights reserved.
NASA Astrophysics Data System (ADS)
Benedetti, Marcello; Realpe-Gómez, John; Perdomo-Ortiz, Alejandro
2018-07-01
Machine learning has been presented as one of the key applications for near-term quantum technologies, given its high commercial value and wide range of applicability. In this work, we introduce the quantum-assisted Helmholtz machine:a hybrid quantum–classical framework with the potential of tackling high-dimensional real-world machine learning datasets on continuous variables. Instead of using quantum computers only to assist deep learning, as previous approaches have suggested, we use deep learning to extract a low-dimensional binary representation of data, suitable for processing on relatively small quantum computers. Then, the quantum hardware and deep learning architecture work together to train an unsupervised generative model. We demonstrate this concept using 1644 quantum bits of a D-Wave 2000Q quantum device to model a sub-sampled version of the MNIST handwritten digit dataset with 16 × 16 continuous valued pixels. Although we illustrate this concept on a quantum annealer, adaptations to other quantum platforms, such as ion-trap technologies or superconducting gate-model architectures, could be explored within this flexible framework.
Rosa, Maria J; Mehta, Mitul A; Pich, Emilio M; Risterucci, Celine; Zelaya, Fernando; Reinders, Antje A T S; Williams, Steve C R; Dazzan, Paola; Doyle, Orla M; Marquand, Andre F
2015-01-01
An increasing number of neuroimaging studies are based on either combining more than one data modality (inter-modal) or combining more than one measurement from the same modality (intra-modal). To date, most intra-modal studies using multivariate statistics have focused on differences between datasets, for instance relying on classifiers to differentiate between effects in the data. However, to fully characterize these effects, multivariate methods able to measure similarities between datasets are needed. One classical technique for estimating the relationship between two datasets is canonical correlation analysis (CCA). However, in the context of high-dimensional data the application of CCA is extremely challenging. A recent extension of CCA, sparse CCA (SCCA), overcomes this limitation, by regularizing the model parameters while yielding a sparse solution. In this work, we modify SCCA with the aim of facilitating its application to high-dimensional neuroimaging data and finding meaningful multivariate image-to-image correspondences in intra-modal studies. In particular, we show how the optimal subset of variables can be estimated independently and we look at the information encoded in more than one set of SCCA transformations. We illustrate our framework using Arterial Spin Labeling data to investigate multivariate similarities between the effects of two antipsychotic drugs on cerebral blood flow.
Shah, Sohil Atul
2017-01-01
Clustering is a fundamental procedure in the analysis of scientific data. It is used ubiquitously across the sciences. Despite decades of research, existing clustering algorithms have limited effectiveness in high dimensions and often require tuning parameters for different domains and datasets. We present a clustering algorithm that achieves high accuracy across multiple domains and scales efficiently to high dimensions and large datasets. The presented algorithm optimizes a smooth continuous objective, which is based on robust statistics and allows heavily mixed clusters to be untangled. The continuous nature of the objective also allows clustering to be integrated as a module in end-to-end feature learning pipelines. We demonstrate this by extending the algorithm to perform joint clustering and dimensionality reduction by efficiently optimizing a continuous global objective. The presented approach is evaluated on large datasets of faces, hand-written digits, objects, newswire articles, sensor readings from the Space Shuttle, and protein expression levels. Our method achieves high accuracy across all datasets, outperforming the best prior algorithm by a factor of 3 in average rank. PMID:28851838
McKenzie, Grant; Janowicz, Krzysztof
2017-01-01
Gaining access to inexpensive, high-resolution, up-to-date, three-dimensional road network data is a top priority beyond research, as such data would fuel applications in industry, governments, and the broader public alike. Road network data are openly available via user-generated content such as OpenStreetMap (OSM) but lack the resolution required for many tasks, e.g., emergency management. More importantly, however, few publicly available data offer information on elevation and slope. For most parts of the world, up-to-date digital elevation products with a resolution of less than 10 meters are a distant dream and, if available, those datasets have to be matched to the road network through an error-prone process. In this paper we present a radically different approach by deriving road network elevation data from massive amounts of in-situ observations extracted from user-contributed data from an online social fitness tracking application. While each individual observation may be of low-quality in terms of resolution and accuracy, taken together they form an accurate, high-resolution, up-to-date, three-dimensional road network that excels where other technologies such as LiDAR fail, e.g., in case of overpasses, overhangs, and so forth. In fact, the 1m spatial resolution dataset created in this research based on 350 million individual 3D location fixes has an RMSE of approximately 3.11m compared to a LiDAR-based ground-truth and can be used to enhance existing road network datasets where individual elevation fixes differ by up to 60m. In contrast, using interpolated data from the National Elevation Dataset (NED) results in 4.75m RMSE compared to the base line. We utilize Linked Data technologies to integrate the proposed high-resolution dataset with OpenStreetMap road geometries without requiring any changes to the OSM data model.
Cattinelli, Isabella; Bolzoni, Elena; Barbieri, Carlo; Mari, Flavio; Martin-Guerrero, José David; Soria-Olivas, Emilio; Martinez-Martinez, José Maria; Gomez-Sanchis, Juan; Amato, Claudia; Stopper, Andrea; Gatti, Emanuele
2012-03-01
The Balanced Scorecard (BSC) is a validated tool to monitor enterprise performances against specific objectives. Through the choice and the evaluation of strategic Key Performance Indicators (KPIs), it provides a measure of the past company's outcome and allows planning future managerial strategies. The Fresenius Medical Care (FME) BSC makes use of 30 KPIs for a continuous quality improvement strategy within its dialysis clinics. Each KPI is monthly associated to a score that summarizes the clinic efficiency for that month. Standard statistical methods are currently used to analyze the BSC data and to give a comprehensive view of the corporate improvements to the top management. We herein propose the Self-Organizing Maps (SOMs) as an innovative approach to extrapolate information from the FME BSC data and to present it in an easy-readable informative form. A SOM is a computational technique that allows projecting high-dimensional datasets to a two-dimensional space (map), thus providing a compressed representation. The SOM unsupervised (self-organizing) training procedure results in a map that preserves similarity relations existing in the original dataset; in this way, the information contained in the high-dimensional space can be more easily visualized and understood. The present work demonstrates the effectiveness of the SOM approach in extracting useful information from the 30-dimensional BSC dataset: indeed, SOMs enabled both to highlight expected relationships between the KPIs and to uncover results not predictable with traditional analyses. Hence we suggest SOMs as a reliable complementary approach to the standard methods for BSC interpretation.
NASA Astrophysics Data System (ADS)
Doolittle, D. F.; Gharib, J. J.; Mitchell, G. A.
2015-12-01
Detailed photographic imagery and bathymetric maps of the seafloor acquired by deep submergence vehicles such as Autonomous Underwater Vehicles (AUV) and Remotely Operated Vehicles (ROV) are expanding how scientists and the public view and ultimately understand the seafloor and the processes that modify it. Several recently acquired optical and acoustic datasets, collected during ECOGIG (Ecosystem Impacts of Oil and Gas Inputs to the Gulf) and other Gulf of Mexico expeditions using the National Institute for Undersea Science Technology (NIUST) Eagle Ray, and Mola Mola AUVs, have been fused with lower resolution data to create unique three-dimensional geovisualizations. Included in these data are multi-scale and multi-resolution visualizations over hydrocarbon seeps and seep related features. Resolution of the data range from 10s of mm to 10s of m. When multi-resolution data is integrated into a single three-dimensional visual environment, new insights into seafloor and seep processes can be obtained from the intuitive nature of three-dimensional data exploration. We provide examples and demonstrate how integration of multibeam bathymetry, seafloor backscatter data, sub-bottom profiler data, textured photomosaics, and hull-mounted multibeam acoustic midwater imagery are made into a series a three-dimensional geovisualizations of actively seeping sites and associated chemosynthetic communities. From these combined and merged datasets, insights on seep community structure, morphology, ecology, fluid migration dynamics, and process geomorphology can be investigated from new spatial perspectives. Such datasets also promote valuable inter-comparisons of sensor resolution and performance.
Accelerated High-Dimensional MR Imaging with Sparse Sampling Using Low-Rank Tensors
He, Jingfei; Liu, Qiegen; Christodoulou, Anthony G.; Ma, Chao; Lam, Fan
2017-01-01
High-dimensional MR imaging often requires long data acquisition time, thereby limiting its practical applications. This paper presents a low-rank tensor based method for accelerated high-dimensional MR imaging using sparse sampling. This method represents high-dimensional images as low-rank tensors (or partially separable functions) and uses this mathematical structure for sparse sampling of the data space and for image reconstruction from highly undersampled data. More specifically, the proposed method acquires two datasets with complementary sampling patterns, one for subspace estimation and the other for image reconstruction; image reconstruction from highly undersampled data is accomplished by fitting the measured data with a sparsity constraint on the core tensor and a group sparsity constraint on the spatial coefficients jointly using the alternating direction method of multipliers. The usefulness of the proposed method is demonstrated in MRI applications; it may also have applications beyond MRI. PMID:27093543
A phenome-wide examination of neural and cognitive function.
Poldrack, R A; Congdon, E; Triplett, W; Gorgolewski, K J; Karlsgodt, K H; Mumford, J A; Sabb, F W; Freimer, N B; London, E D; Cannon, T D; Bilder, R M
2016-12-06
This data descriptor outlines a shared neuroimaging dataset from the UCLA Consortium for Neuropsychiatric Phenomics, which focused on understanding the dimensional structure of memory and cognitive control (response inhibition) functions in both healthy individuals (130 subjects) and individuals with neuropsychiatric disorders including schizophrenia (50 subjects), bipolar disorder (49 subjects), and attention deficit/hyperactivity disorder (43 subjects). The dataset includes an extensive set of task-based fMRI assessments, resting fMRI, structural MRI, and high angular resolution diffusion MRI. The dataset is shared through the OpenfMRI project, and is formatted according to the Brain Imaging Data Structure (BIDS) standard.
Accessing Multi-Dimensional Images and Data Cubes in the Virtual Observatory
NASA Astrophysics Data System (ADS)
Tody, Douglas; Plante, R. L.; Berriman, G. B.; Cresitello-Dittmar, M.; Good, J.; Graham, M.; Greene, G.; Hanisch, R. J.; Jenness, T.; Lazio, J.; Norris, P.; Pevunova, O.; Rots, A. H.
2014-01-01
Telescopes across the spectrum are routinely producing multi-dimensional images and datasets, such as Doppler velocity cubes, polarization datasets, and time-resolved “movies.” Examples of current telescopes producing such multi-dimensional images include the JVLA, ALMA, and the IFU instruments on large optical and near-infrared wavelength telescopes. In the near future, both the LSST and JWST will also produce such multi-dimensional images routinely. High-energy instruments such as Chandra produce event datasets that are also a form of multi-dimensional data, in effect being a very sparse multi-dimensional image. Ensuring that the data sets produced by these telescopes can be both discovered and accessed by the community is essential and is part of the mission of the Virtual Observatory (VO). The Virtual Astronomical Observatory (VAO, http://www.usvao.org/), in conjunction with its international partners in the International Virtual Observatory Alliance (IVOA), has developed a protocol and an initial demonstration service designed for the publication, discovery, and access of arbitrarily large multi-dimensional images. The protocol describing multi-dimensional images is the Simple Image Access Protocol, version 2, which provides the minimal set of metadata required to characterize a multi-dimensional image for its discovery and access. A companion Image Data Model formally defines the semantics and structure of multi-dimensional images independently of how they are serialized, while providing capabilities such as support for sparse data that are essential to deal effectively with large cubes. A prototype data access service has been deployed and tested, using a suite of multi-dimensional images from a variety of telescopes. The prototype has demonstrated the capability to discover and remotely access multi-dimensional data via standard VO protocols. The prototype informs the specification of a protocol that will be submitted to the IVOA for approval, with an operational data cube service to be delivered in mid-2014. An associated user-installable VO data service framework will provide the capabilities required to publish VO-compatible multi-dimensional images or data cubes.
Topic modeling for cluster analysis of large biological and medical datasets
2014-01-01
Background The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. Results In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Conclusion Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets. PMID:25350106
Topic modeling for cluster analysis of large biological and medical datasets.
Zhao, Weizhong; Zou, Wen; Chen, James J
2014-01-01
The big data moniker is nowhere better deserved than to describe the ever-increasing prodigiousness and complexity of biological and medical datasets. New methods are needed to generate and test hypotheses, foster biological interpretation, and build validated predictors. Although multivariate techniques such as cluster analysis may allow researchers to identify groups, or clusters, of related variables, the accuracies and effectiveness of traditional clustering methods diminish for large and hyper dimensional datasets. Topic modeling is an active research field in machine learning and has been mainly used as an analytical tool to structure large textual corpora for data mining. Its ability to reduce high dimensionality to a small number of latent variables makes it suitable as a means for clustering or overcoming clustering difficulties in large biological and medical datasets. In this study, three topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, are proposed and tested on the cluster analysis of three large datasets: Salmonella pulsed-field gel electrophoresis (PFGE) dataset, lung cancer dataset, and breast cancer dataset, which represent various types of large biological or medical datasets. All three various methods are shown to improve the efficacy/effectiveness of clustering results on the three datasets in comparison to traditional methods. A preferable cluster analysis method emerged for each of the three datasets on the basis of replicating known biological truths. Topic modeling could be advantageously applied to the large datasets of biological or medical research. The three proposed topic model-derived clustering methods, highest probable topic assignment, feature selection and feature extraction, yield clustering improvements for the three different data types. Clusters more efficaciously represent truthful groupings and subgroupings in the data than traditional methods, suggesting that topic model-based methods could provide an analytic advancement in the analysis of large biological or medical datasets.
Fast and Adaptive Sparse Precision Matrix Estimation in High Dimensions
Liu, Weidong; Luo, Xi
2014-01-01
This paper proposes a new method for estimating sparse precision matrices in the high dimensional setting. It has been popular to study fast computation and adaptive procedures for this problem. We propose a novel approach, called Sparse Column-wise Inverse Operator, to address these two issues. We analyze an adaptive procedure based on cross validation, and establish its convergence rate under the Frobenius norm. The convergence rates under other matrix norms are also established. This method also enjoys the advantage of fast computation for large-scale problems, via a coordinate descent algorithm. Numerical merits are illustrated using both simulated and real datasets. In particular, it performs favorably on an HIV brain tissue dataset and an ADHD resting-state fMRI dataset. PMID:25750463
Rapid, semi-automatic fracture and contact mapping for point clouds, images and geophysical data
NASA Astrophysics Data System (ADS)
Thiele, Samuel T.; Grose, Lachlan; Samsu, Anindita; Micklethwaite, Steven; Vollgger, Stefan A.; Cruden, Alexander R.
2017-12-01
The advent of large digital datasets from unmanned aerial vehicle (UAV) and satellite platforms now challenges our ability to extract information across multiple scales in a timely manner, often meaning that the full value of the data is not realised. Here we adapt a least-cost-path solver and specially tailored cost functions to rapidly interpolate structural features between manually defined control points in point cloud and raster datasets. We implement the method in the geographic information system QGIS and the point cloud and mesh processing software CloudCompare. Using these implementations, the method can be applied to a variety of three-dimensional (3-D) and two-dimensional (2-D) datasets, including high-resolution aerial imagery, digital outcrop models, digital elevation models (DEMs) and geophysical grids. We demonstrate the algorithm with four diverse applications in which we extract (1) joint and contact patterns in high-resolution orthophotographs, (2) fracture patterns in a dense 3-D point cloud, (3) earthquake surface ruptures of the Greendale Fault associated with the Mw7.1 Darfield earthquake (New Zealand) from high-resolution light detection and ranging (lidar) data, and (4) oceanic fracture zones from bathymetric data of the North Atlantic. The approach improves the consistency of the interpretation process while retaining expert guidance and achieves significant improvements (35-65 %) in digitisation time compared to traditional methods. Furthermore, it opens up new possibilities for data synthesis and can quantify the agreement between datasets and an interpretation.
A global × global test for testing associations between two large sets of variables.
Chaturvedi, Nimisha; de Menezes, Renée X; Goeman, Jelle J
2017-01-01
In high-dimensional omics studies where multiple molecular profiles are obtained for each set of patients, there is often interest in identifying complex multivariate associations, for example, copy number regulated expression levels in a certain pathway or in a genomic region. To detect such associations, we present a novel approach to test for association between two sets of variables. Our approach generalizes the global test, which tests for association between a group of covariates and a single univariate response, to allow high-dimensional multivariate response. We apply the method to several simulated datasets as well as two publicly available datasets, where we compare the performance of multivariate global test (G2) with univariate global test. The method is implemented in R and will be available as a part of the globaltest package in R. © 2016 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Dimensionality reduction of collective motion by principal manifolds
NASA Astrophysics Data System (ADS)
Gajamannage, Kelum; Butail, Sachit; Porfiri, Maurizio; Bollt, Erik M.
2015-01-01
While the existence of low-dimensional embedding manifolds has been shown in patterns of collective motion, the current battery of nonlinear dimensionality reduction methods is not amenable to the analysis of such manifolds. This is mainly due to the necessary spectral decomposition step, which limits control over the mapping from the original high-dimensional space to the embedding space. Here, we propose an alternative approach that demands a two-dimensional embedding which topologically summarizes the high-dimensional data. In this sense, our approach is closely related to the construction of one-dimensional principal curves that minimize orthogonal error to data points subject to smoothness constraints. Specifically, we construct a two-dimensional principal manifold directly in the high-dimensional space using cubic smoothing splines, and define the embedding coordinates in terms of geodesic distances. Thus, the mapping from the high-dimensional data to the manifold is defined in terms of local coordinates. Through representative examples, we show that compared to existing nonlinear dimensionality reduction methods, the principal manifold retains the original structure even in noisy and sparse datasets. The principal manifold finding algorithm is applied to configurations obtained from a dynamical system of multiple agents simulating a complex maneuver called predator mobbing, and the resulting two-dimensional embedding is compared with that of a well-established nonlinear dimensionality reduction method.
Integrative Exploratory Analysis of Two or More Genomic Datasets.
Meng, Chen; Culhane, Aedin
2016-01-01
Exploratory analysis is an essential step in the analysis of high throughput data. Multivariate approaches such as correspondence analysis (CA), principal component analysis, and multidimensional scaling are widely used in the exploratory analysis of single dataset. Modern biological studies often assay multiple types of biological molecules (e.g., mRNA, protein, phosphoproteins) on a same set of biological samples, thereby creating multiple different types of omics data or multiassay data. Integrative exploratory analysis of these multiple omics data is required to leverage the potential of multiple omics studies. In this chapter, we describe the application of co-inertia analysis (CIA; for analyzing two datasets) and multiple co-inertia analysis (MCIA; for three or more datasets) to address this problem. These methods are powerful yet simple multivariate approaches that represent samples using a lower number of variables, allowing a more easily identification of the correlated structure in and between multiple high dimensional datasets. Graphical representations can be employed to this purpose. In addition, the methods simultaneously project samples and variables (genes, proteins) onto the same lower dimensional space, so the most variant variables from each dataset can be selected and associated with samples, which can be further used to facilitate biological interpretation and pathway analysis. We applied CIA to explore the concordance between mRNA and protein expression in a panel of 60 tumor cell lines from the National Cancer Institute. In the same 60 cell lines, we used MCIA to perform a cross-platform comparison of mRNA gene expression profiles obtained on four different microarray platforms. Last, as an example of integrative analysis of multiassay or multi-omics data we analyzed transcriptomic, proteomic, and phosphoproteomic data from pluripotent (iPS) and embryonic stem (ES) cell lines.
Pairwise gene GO-based measures for biclustering of high-dimensional expression data.
Nepomuceno, Juan A; Troncoso, Alicia; Nepomuceno-Chamorro, Isabel A; Aguilar-Ruiz, Jesús S
2018-01-01
Biclustering algorithms search for groups of genes that share the same behavior under a subset of samples in gene expression data. Nowadays, the biological knowledge available in public repositories can be used to drive these algorithms to find biclusters composed of groups of genes functionally coherent. On the other hand, a distance among genes can be defined according to their information stored in Gene Ontology (GO). Gene pairwise GO semantic similarity measures report a value for each pair of genes which establishes their functional similarity. A scatter search-based algorithm that optimizes a merit function that integrates GO information is studied in this paper. This merit function uses a term that addresses the information through a GO measure. The effect of two possible different gene pairwise GO measures on the performance of the algorithm is analyzed. Firstly, three well known yeast datasets with approximately one thousand of genes are studied. Secondly, a group of human datasets related to clinical data of cancer is also explored by the algorithm. Most of these data are high-dimensional datasets composed of a huge number of genes. The resultant biclusters reveal groups of genes linked by a same functionality when the search procedure is driven by one of the proposed GO measures. Furthermore, a qualitative biological study of a group of biclusters show their relevance from a cancer disease perspective. It can be concluded that the integration of biological information improves the performance of the biclustering process. The two different GO measures studied show an improvement in the results obtained for the yeast dataset. However, if datasets are composed of a huge number of genes, only one of them really improves the algorithm performance. This second case constitutes a clear option to explore interesting datasets from a clinical point of view.
Interpretable Categorization of Heterogeneous Time Series Data
NASA Technical Reports Server (NTRS)
Lee, Ritchie; Kochenderfer, Mykel J.; Mengshoel, Ole J.; Silbermann, Joshua
2017-01-01
We analyze data from simulated aircraft encounters to validate and inform the development of a prototype aircraft collision avoidance system. The high-dimensional and heterogeneous time series dataset is analyzed to discover properties of near mid-air collisions (NMACs) and categorize the NMAC encounters. Domain experts use these properties to better organize and understand NMAC occurrences. Existing solutions either are not capable of handling high-dimensional and heterogeneous time series datasets or do not provide explanations that are interpretable by a domain expert. The latter is critical to the acceptance and deployment of safety-critical systems. To address this gap, we propose grammar-based decision trees along with a learning algorithm. Our approach extends decision trees with a grammar framework for classifying heterogeneous time series data. A context-free grammar is used to derive decision expressions that are interpretable, application-specific, and support heterogeneous data types. In addition to classification, we show how grammar-based decision trees can also be used for categorization, which is a combination of clustering and generating interpretable explanations for each cluster. We apply grammar-based decision trees to a simulated aircraft encounter dataset and evaluate the performance of four variants of our learning algorithm. The best algorithm is used to analyze and categorize near mid-air collisions in the aircraft encounter dataset. We describe each discovered category in detail and discuss its relevance to aircraft collision avoidance.
A Bayesian trans-dimensional approach for the fusion of multiple geophysical datasets
NASA Astrophysics Data System (ADS)
JafarGandomi, Arash; Binley, Andrew
2013-09-01
We propose a Bayesian fusion approach to integrate multiple geophysical datasets with different coverage and sensitivity. The fusion strategy is based on the capability of various geophysical methods to provide enough resolution to identify either subsurface material parameters or subsurface structure, or both. We focus on electrical resistivity as the target material parameter and electrical resistivity tomography (ERT), electromagnetic induction (EMI), and ground penetrating radar (GPR) as the set of geophysical methods. However, extending the approach to different sets of geophysical parameters and methods is straightforward. Different geophysical datasets are entered into a trans-dimensional Markov chain Monte Carlo (McMC) search-based joint inversion algorithm. The trans-dimensional property of the McMC algorithm allows dynamic parameterisation of the model space, which in turn helps to avoid bias of the post-inversion results towards a particular model. Given that we are attempting to develop an approach that has practical potential, we discretize the subsurface into an array of one-dimensional earth-models. Accordingly, the ERT data that are collected by using two-dimensional acquisition geometry are re-casted to a set of equivalent vertical electric soundings. Different data are inverted either individually or jointly to estimate one-dimensional subsurface models at discrete locations. We use Shannon's information measure to quantify the information obtained from the inversion of different combinations of geophysical datasets. Information from multiple methods is brought together via introducing joint likelihood function and/or constraining the prior information. A Bayesian maximum entropy approach is used for spatial fusion of spatially dispersed estimated one-dimensional models and mapping of the target parameter. We illustrate the approach with a synthetic dataset and then apply it to a field dataset. We show that the proposed fusion strategy is successful not only in enhancing the subsurface information but also as a survey design tool to identify the appropriate combination of the geophysical tools and show whether application of an individual method for further investigation of a specific site is beneficial.
Identifying Talent in Youth Sport: A Novel Methodology Using Higher-Dimensional Analysis.
Till, Kevin; Jones, Ben L; Cobley, Stephen; Morley, David; O'Hara, John; Chapman, Chris; Cooke, Carlton; Beggs, Clive B
2016-01-01
Prediction of adult performance from early age talent identification in sport remains difficult. Talent identification research has generally been performed using univariate analysis, which ignores multivariate relationships. To address this issue, this study used a novel higher-dimensional model to orthogonalize multivariate anthropometric and fitness data from junior rugby league players, with the aim of differentiating future career attainment. Anthropometric and fitness data from 257 Under-15 rugby league players was collected. Players were grouped retrospectively according to their future career attainment (i.e., amateur, academy, professional). Players were blindly and randomly divided into an exploratory (n = 165) and validation dataset (n = 92). The exploratory dataset was used to develop and optimize a novel higher-dimensional model, which combined singular value decomposition (SVD) with receiver operating characteristic analysis. Once optimized, the model was tested using the validation dataset. SVD analysis revealed 60 m sprint and agility 505 performance were the most influential characteristics in distinguishing future professional players from amateur and academy players. The exploratory dataset model was able to distinguish between future amateur and professional players with a high degree of accuracy (sensitivity = 85.7%, specificity = 71.1%; p<0.001), although it could not distinguish between future professional and academy players. The validation dataset model was able to distinguish future professionals from the rest with reasonable accuracy (sensitivity = 83.3%, specificity = 63.8%; p = 0.003). Through the use of SVD analysis it was possible to objectively identify criteria to distinguish future career attainment with a sensitivity over 80% using anthropometric and fitness data alone. As such, this suggests that SVD analysis may be a useful analysis tool for research and practice within talent identification.
Identifying Talent in Youth Sport: A Novel Methodology Using Higher-Dimensional Analysis
Till, Kevin; Jones, Ben L.; Cobley, Stephen; Morley, David; O'Hara, John; Chapman, Chris; Cooke, Carlton; Beggs, Clive B.
2016-01-01
Prediction of adult performance from early age talent identification in sport remains difficult. Talent identification research has generally been performed using univariate analysis, which ignores multivariate relationships. To address this issue, this study used a novel higher-dimensional model to orthogonalize multivariate anthropometric and fitness data from junior rugby league players, with the aim of differentiating future career attainment. Anthropometric and fitness data from 257 Under-15 rugby league players was collected. Players were grouped retrospectively according to their future career attainment (i.e., amateur, academy, professional). Players were blindly and randomly divided into an exploratory (n = 165) and validation dataset (n = 92). The exploratory dataset was used to develop and optimize a novel higher-dimensional model, which combined singular value decomposition (SVD) with receiver operating characteristic analysis. Once optimized, the model was tested using the validation dataset. SVD analysis revealed 60 m sprint and agility 505 performance were the most influential characteristics in distinguishing future professional players from amateur and academy players. The exploratory dataset model was able to distinguish between future amateur and professional players with a high degree of accuracy (sensitivity = 85.7%, specificity = 71.1%; p<0.001), although it could not distinguish between future professional and academy players. The validation dataset model was able to distinguish future professionals from the rest with reasonable accuracy (sensitivity = 83.3%, specificity = 63.8%; p = 0.003). Through the use of SVD analysis it was possible to objectively identify criteria to distinguish future career attainment with a sensitivity over 80% using anthropometric and fitness data alone. As such, this suggests that SVD analysis may be a useful analysis tool for research and practice within talent identification. PMID:27224653
NASA Astrophysics Data System (ADS)
Kobor, J. S.; O'Connor, M. D.; Sherwood, M. N.
2013-12-01
Effective floodplain management and restoration requires a detailed understanding of floodplain processes not readily achieved using standard one-dimensional hydraulic modeling approaches. The application of more advanced numerical models is, however, often limited by the relatively high costs of acquiring the high-resolution topographic data needed for model development using traditional surveying methods. The increasing availability of LiDAR data has the potential to significantly reduce these costs and thus facilitate application of multi-dimensional hydraulic models where budget constraints would have otherwise prohibited their use. The accuracy and suitability of LiDAR data for supporting model development can vary widely depending on the resolution of channel and floodplain features, the data collection density, and the degree of vegetation canopy interference among other factors. More work is needed to develop guidelines for evaluating LiDAR accuracy and determining when and how best the data can be used to support numerical modeling activities. Here we present two recent case studies where LiDAR datasets were used to support floodplain and sediment transport modeling efforts. One LiDAR dataset was collected with a relatively low point density and used to study a small stream channel in coastal Marin County and a second dataset was collected with a higher point density and applied to a larger stream channel in western Sonoma County. Traditional topographic surveying was performed at both sites which provided a quantitative means of evaluating the LiDAR accuracy. We found that with the lower point density dataset, the accuracy of the LiDAR varied significantly between the active stream channel and floodplain whereas the accuracy across the channel/floodplain interface was more uniform with the higher density dataset. Accuracy also varied widely as a function of the density of the riparian vegetation canopy. We found that coupled 1- and 2-dimensional hydraulic models whereby the active channel is simulated in 1-dimension and the floodplain in 2-dimensions provided the best means of utilizing the LiDAR data to evaluate existing conditions and develop alternative flood hazard mitigation and habitat restoration strategies. Such an approach recognizes the limitations of the LiDAR data within active channel areas with dense riparian cover and is cost-effective in that it allows field survey efforts to focus primarily on characterizing active stream channel areas. The multi-dimensional modeling approach also conforms well to the physical realties of the stream system whereby in-channel flows can generally be well-described as a one-dimensional flow problem and floodplain flows are often characterized by multiple and often poorly understood flowpaths. The multi-dimensional modeling approach has the additional advantages of allowing for accurate simulation of the effects of hydraulic structures using well-tested one-dimensional formulae and minimizing the computational burden of the models by not requiring the small spatial resolutions necessary to resolve the geometries of small stream channels in two-dimensions.
NASA Astrophysics Data System (ADS)
Hou, C. Y.; Dattore, R.; Peng, G. S.
2014-12-01
The National Center for Atmospheric Research's Global Climate Four-Dimensional Data Assimilation (CFDDA) Hourly 40km Reanalysis dataset is a dynamically downscaled dataset with high temporal and spatial resolution. The dataset contains three-dimensional hourly analyses in netCDF format for the global atmospheric state from 1985 to 2005 on a 40km horizontal grid (0.4°grid increment) with 28 vertical levels, providing good representation of local forcing and diurnal variation of processes in the planetary boundary layer. This project aimed to make the dataset publicly available, accessible, and usable in order to provide a unique resource to allow and promote studies of new climate characteristics. When the curation project started, it had been five years since the data files were generated. Also, although the Principal Investigator (PI) had generated a user document at the end of the project in 2009, the document had not been maintained. Furthermore, the PI had moved to a new institution, and the remaining team members were reassigned to other projects. These factors made data curation in the areas of verifying data quality, harvest metadata descriptions, documenting provenance information especially challenging. As a result, the project's curation process found that: Data curator's skill and knowledge helped make decisions, such as file format and structure and workflow documentation, that had significant, positive impact on the ease of the dataset's management and long term preservation. Use of data curation tools, such as the Data Curation Profiles Toolkit's guidelines, revealed important information for promoting the data's usability and enhancing preservation planning. Involving data curators during each stage of the data curation life cycle instead of at the end could improve the curation process' efficiency. Overall, the project showed that proper resources invested in the curation process would give datasets the best chance to fulfill their potential to help with new climate pattern discovery.
NASA Astrophysics Data System (ADS)
Castonguay, Alexandre; Lefebvre, Joël; Pouliot, Philippe; Lesage, Frédéric
2018-01-01
An automated serial histology setup combining optical coherence tomography (OCT) imaging with vibratome sectioning was used to image eight wild type mouse brains. The datasets resulted in thousands of volumetric tiles resolved at a voxel size of (4.9×4.9×6.5) μm3 stitched back together to give a three-dimensional map of the brain from which a template OCT brain was obtained. To assess deformation caused by tissue sectioning, reconstruction algorithms, and fixation, OCT datasets were compared to both in vivo and ex vivo magnetic resonance imaging (MRI) imaging. The OCT brain template yielded a highly detailed map of the brain structure, with a high contrast in white matter fiber bundles and was highly resemblant to the in vivo MRI template. Brain labeling using the Allen brain framework showed little variation in regional brain volume among imaging modalities with no statistical differences. The high correspondence between the OCT template brain and its in vivo counterpart demonstrates the potential of whole brain histology to validate in vivo imaging.
Sparse models for correlative and integrative analysis of imaging and genetic data
Lin, Dongdong; Cao, Hongbao; Calhoun, Vince D.
2014-01-01
The development of advanced medical imaging technologies and high-throughput genomic measurements has enhanced our ability to understand their interplay as well as their relationship with human behavior by integrating these two types of datasets. However, the high dimensionality and heterogeneity of these datasets presents a challenge to conventional statistical methods; there is a high demand for the development of both correlative and integrative analysis approaches. Here, we review our recent work on developing sparse representation based approaches to address this challenge. We show how sparse models are applied to the correlation and integration of imaging and genetic data for biomarker identification. We present examples on how these approaches are used for the detection of risk genes and classification of complex diseases such as schizophrenia. Finally, we discuss future directions on the integration of multiple imaging and genomic datasets including their interactions such as epistasis. PMID:25218561
Tensor-driven extraction of developmental features from varying paediatric EEG datasets.
Kinney-Lang, Eli; Spyrou, Loukianos; Ebied, Ahmed; Chin, Richard Fm; Escudero, Javier
2018-05-21
Constant changes in developing children's brains can pose a challenge in EEG dependant technologies. Advancing signal processing methods to identify developmental differences in paediatric populations could help improve function and usability of such technologies. Taking advantage of the multi-dimensional structure of EEG data through tensor analysis may offer a framework for extracting relevant developmental features of paediatric datasets. A proof of concept is demonstrated through identifying latent developmental features in resting-state EEG. Approach. Three paediatric datasets (n = 50, 17, 44) were analyzed using a two-step constrained parallel factor (PARAFAC) tensor decomposition. Subject age was used as a proxy measure of development. Classification used support vector machines (SVM) to test if PARAFAC identified features could predict subject age. The results were cross-validated within each dataset. Classification analysis was complemented by visualization of the high-dimensional feature structures using t-distributed Stochastic Neighbour Embedding (t-SNE) maps. Main Results. Development-related features were successfully identified for the developmental conditions of each dataset. SVM classification showed the identified features could accurately predict subject at a significant level above chance for both healthy and impaired populations. t-SNE maps revealed suitable tensor factorization was key in extracting the developmental features. Significance. The described methods are a promising tool for identifying latent developmental features occurring throughout childhood EEG. © 2018 IOP Publishing Ltd.
Challenges in Extracting Information From Large Hydrogeophysical-monitoring Datasets
NASA Astrophysics Data System (ADS)
Day-Lewis, F. D.; Slater, L. D.; Johnson, T.
2012-12-01
Over the last decade, new automated geophysical data-acquisition systems have enabled collection of increasingly large and information-rich geophysical datasets. Concurrent advances in field instrumentation, web services, and high-performance computing have made real-time processing, inversion, and visualization of large three-dimensional tomographic datasets practical. Geophysical-monitoring datasets have provided high-resolution insights into diverse hydrologic processes including groundwater/surface-water exchange, infiltration, solute transport, and bioremediation. Despite the high information content of such datasets, extraction of quantitative or diagnostic hydrologic information is challenging. Visual inspection and interpretation for specific hydrologic processes is difficult for datasets that are large, complex, and (or) affected by forcings (e.g., seasonal variations) unrelated to the target hydrologic process. New strategies are needed to identify salient features in spatially distributed time-series data and to relate temporal changes in geophysical properties to hydrologic processes of interest while effectively filtering unrelated changes. Here, we review recent work using time-series and digital-signal-processing approaches in hydrogeophysics. Examples include applications of cross-correlation, spectral, and time-frequency (e.g., wavelet and Stockwell transforms) approaches to (1) identify salient features in large geophysical time series; (2) examine correlation or coherence between geophysical and hydrologic signals, even in the presence of non-stationarity; and (3) condense large datasets while preserving information of interest. Examples demonstrate analysis of large time-lapse electrical tomography and fiber-optic temperature datasets to extract information about groundwater/surface-water exchange and contaminant transport.
Hypergraph Based Feature Selection Technique for Medical Diagnosis.
Somu, Nivethitha; Raman, M R Gauthama; Kirthivasan, Kannan; Sriram, V S Shankar
2016-11-01
The impact of internet and information systems across various domains have resulted in substantial generation of multidimensional datasets. The use of data mining and knowledge discovery techniques to extract the original information contained in the multidimensional datasets play a significant role in the exploitation of complete benefit provided by them. The presence of large number of features in the high dimensional datasets incurs high computational cost in terms of computing power and time. Hence, feature selection technique has been commonly used to build robust machine learning models to select a subset of relevant features which projects the maximal information content of the original dataset. In this paper, a novel Rough Set based K - Helly feature selection technique (RSKHT) which hybridize Rough Set Theory (RST) and K - Helly property of hypergraph representation had been designed to identify the optimal feature subset or reduct for medical diagnostic applications. Experiments carried out using the medical datasets from the UCI repository proves the dominance of the RSKHT over other feature selection techniques with respect to the reduct size, classification accuracy and time complexity. The performance of the RSKHT had been validated using WEKA tool, which shows that RSKHT had been computationally attractive and flexible over massive datasets.
Integrative prescreening in analysis of multiple cancer genomic studies
2012-01-01
Background In high throughput cancer genomic studies, results from the analysis of single datasets often suffer from a lack of reproducibility because of small sample sizes. Integrative analysis can effectively pool and analyze multiple datasets and provides a cost effective way to improve reproducibility. In integrative analysis, simultaneously analyzing all genes profiled may incur high computational cost. A computationally affordable remedy is prescreening, which fits marginal models, can be conducted in a parallel manner, and has low computational cost. Results An integrative prescreening approach is developed for the analysis of multiple cancer genomic datasets. Simulation shows that the proposed integrative prescreening has better performance than alternatives, particularly including prescreening with individual datasets, an intensity approach and meta-analysis. We also analyze multiple microarray gene profiling studies on liver and pancreatic cancers using the proposed approach. Conclusions The proposed integrative prescreening provides an effective way to reduce the dimensionality in cancer genomic studies. It can be coupled with existing analysis methods to identify cancer markers. PMID:22799431
GODIVA2: interactive visualization of environmental data on the Web.
Blower, J D; Haines, K; Santokhee, A; Liu, C L
2009-03-13
GODIVA2 is a dynamic website that provides visual access to several terabytes of physically distributed, four-dimensional environmental data. It allows users to explore large datasets interactively without the need to install new software or download and understand complex data. Through the use of open international standards, GODIVA2 maintains a high level of interoperability with third-party systems, allowing diverse datasets to be mutually compared. Scientists can use the system to search for features in large datasets and to diagnose the output from numerical simulations and data processing algorithms. Data providers around Europe have adopted GODIVA2 as an INSPIRE-compliant dynamic quick-view system for providing visual access to their data.
Derivation of an artificial gene to improve classification accuracy upon gene selection.
Seo, Minseok; Oh, Sejong
2012-02-01
Classification analysis has been developed continuously since 1936. This research field has advanced as a result of development of classifiers such as KNN, ANN, and SVM, as well as through data preprocessing areas. Feature (gene) selection is required for very high dimensional data such as microarray before classification work. The goal of feature selection is to choose a subset of informative features that reduces processing time and provides higher classification accuracy. In this study, we devised a method of artificial gene making (AGM) for microarray data to improve classification accuracy. Our artificial gene was derived from a whole microarray dataset, and combined with a result of gene selection for classification analysis. We experimentally confirmed a clear improvement of classification accuracy after inserting artificial gene. Our artificial gene worked well for popular feature (gene) selection algorithms and classifiers. The proposed approach can be applied to any type of high dimensional dataset. Copyright © 2011 Elsevier Ltd. All rights reserved.
Dashtban, M; Balafar, Mohammadali
2017-03-01
Gene selection is a demanding task for microarray data analysis. The diverse complexity of different cancers makes this issue still challenging. In this study, a novel evolutionary method based on genetic algorithms and artificial intelligence is proposed to identify predictive genes for cancer classification. A filter method was first applied to reduce the dimensionality of feature space followed by employing an integer-coded genetic algorithm with dynamic-length genotype, intelligent parameter settings, and modified operators. The algorithmic behaviors including convergence trends, mutation and crossover rate changes, and running time were studied, conceptually discussed, and shown to be coherent with literature findings. Two well-known filter methods, Laplacian and Fisher score, were examined considering similarities, the quality of selected genes, and their influences on the evolutionary approach. Several statistical tests concerning choice of classifier, choice of dataset, and choice of filter method were performed, and they revealed some significant differences between the performance of different classifiers and filter methods over datasets. The proposed method was benchmarked upon five popular high-dimensional cancer datasets; for each, top explored genes were reported. Comparing the experimental results with several state-of-the-art methods revealed that the proposed method outperforms previous methods in DLBCL dataset. Copyright © 2017 Elsevier Inc. All rights reserved.
Joint principal trend analysis for longitudinal high-dimensional data.
Zhang, Yuping; Ouyang, Zhengqing
2018-06-01
We consider a research scenario motivated by integrating multiple sources of information for better knowledge discovery in diverse dynamic biological processes. Given two longitudinal high-dimensional datasets for a group of subjects, we want to extract shared latent trends and identify relevant features. To solve this problem, we present a new statistical method named as joint principal trend analysis (JPTA). We demonstrate the utility of JPTA through simulations and applications to gene expression data of the mammalian cell cycle and longitudinal transcriptional profiling data in response to influenza viral infections. © 2017, The International Biometric Society.
Mapping High Dimensional Sparse Customer Requirements into Product Configurations
NASA Astrophysics Data System (ADS)
Jiao, Yao; Yang, Yu; Zhang, Hongshan
2017-10-01
Mapping customer requirements into product configurations is a crucial step for product design, while, customers express their needs ambiguously and locally due to the lack of domain knowledge. Thus the data mining process of customer requirements might result in fragmental information with high dimensional sparsity, leading the mapping procedure risk uncertainty and complexity. The Expert Judgment is widely applied against that background since there is no formal requirements for systematic or structural data. However, there are concerns on the repeatability and bias for Expert Judgment. In this study, an integrated method by adjusted Local Linear Embedding (LLE) and Naïve Bayes (NB) classifier is proposed to map high dimensional sparse customer requirements to product configurations. The integrated method adjusts classical LLE to preprocess high dimensional sparse dataset to satisfy the prerequisite of NB for classifying different customer requirements to corresponding product configurations. Compared with Expert Judgment, the adjusted LLE with NB performs much better in a real-world Tablet PC design case both in accuracy and robustness.
Deep neural networks for texture classification-A theoretical analysis.
Basu, Saikat; Mukhopadhyay, Supratik; Karki, Manohar; DiBiano, Robert; Ganguly, Sangram; Nemani, Ramakrishna; Gayaka, Shreekant
2018-01-01
We investigate the use of Deep Neural Networks for the classification of image datasets where texture features are important for generating class-conditional discriminative representations. To this end, we first derive the size of the feature space for some standard textural features extracted from the input dataset and then use the theory of Vapnik-Chervonenkis dimension to show that hand-crafted feature extraction creates low-dimensional representations which help in reducing the overall excess error rate. As a corollary to this analysis, we derive for the first time upper bounds on the VC dimension of Convolutional Neural Network as well as Dropout and Dropconnect networks and the relation between excess error rate of Dropout and Dropconnect networks. The concept of intrinsic dimension is used to validate the intuition that texture-based datasets are inherently higher dimensional as compared to handwritten digits or other object recognition datasets and hence more difficult to be shattered by neural networks. We then derive the mean distance from the centroid to the nearest and farthest sampling points in an n-dimensional manifold and show that the Relative Contrast of the sample data vanishes as dimensionality of the underlying vector space tends to infinity. Copyright © 2017 Elsevier Ltd. All rights reserved.
Hierarchical Discriminant Analysis.
Lu, Di; Ding, Chuntao; Xu, Jinliang; Wang, Shangguang
2018-01-18
The Internet of Things (IoT) generates lots of high-dimensional sensor intelligent data. The processing of high-dimensional data (e.g., data visualization and data classification) is very difficult, so it requires excellent subspace learning algorithms to learn a latent subspace to preserve the intrinsic structure of the high-dimensional data, and abandon the least useful information in the subsequent processing. In this context, many subspace learning algorithms have been presented. However, in the process of transforming the high-dimensional data into the low-dimensional space, the huge difference between the sum of inter-class distance and the sum of intra-class distance for distinct data may cause a bias problem. That means that the impact of intra-class distance is overwhelmed. To address this problem, we propose a novel algorithm called Hierarchical Discriminant Analysis (HDA). It minimizes the sum of intra-class distance first, and then maximizes the sum of inter-class distance. This proposed method balances the bias from the inter-class and that from the intra-class to achieve better performance. Extensive experiments are conducted on several benchmark face datasets. The results reveal that HDA obtains better performance than other dimensionality reduction algorithms.
Using Graph Indices for the Analysis and Comparison of Chemical Datasets.
Fourches, Denis; Tropsha, Alexander
2013-10-01
In cheminformatics, compounds are represented as points in multidimensional space of chemical descriptors. When all pairs of points found within certain distance threshold in the original high dimensional chemistry space are connected by distance-labeled edges, the resulting data structure can be defined as Dataset Graph (DG). We show that, similarly to the conventional description of organic molecules, many graph indices can be computed for DGs as well. We demonstrate that chemical datasets can be effectively characterized and compared by computing simple graph indices such as the average vertex degree or Randic connectivity index. This approach is used to characterize and quantify the similarity between different datasets or subsets of the same dataset (e.g., training, test, and external validation sets used in QSAR modeling). The freely available ADDAGRA program has been implemented to build and visualize DGs. The approach proposed and discussed in this report could be further explored and utilized for different cheminformatics applications such as dataset diversification by acquiring external compounds, dataset processing prior to QSAR modeling, or (dis)similarity modeling of multiple datasets studied in chemical genomics applications. Copyright © 2013 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Information Gain Based Dimensionality Selection for Classifying Text Documents
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dumidu Wijayasekara; Milos Manic; Miles McQueen
2013-06-01
Selecting the optimal dimensions for various knowledge extraction applications is an essential component of data mining. Dimensionality selection techniques are utilized in classification applications to increase the classification accuracy and reduce the computational complexity. In text classification, where the dimensionality of the dataset is extremely high, dimensionality selection is even more important. This paper presents a novel, genetic algorithm based methodology, for dimensionality selection in text mining applications that utilizes information gain. The presented methodology uses information gain of each dimension to change the mutation probability of chromosomes dynamically. Since the information gain is calculated a priori, the computational complexitymore » is not affected. The presented method was tested on a specific text classification problem and compared with conventional genetic algorithm based dimensionality selection. The results show an improvement of 3% in the true positives and 1.6% in the true negatives over conventional dimensionality selection methods.« less
Statistical Projections for Multi-resolution, Multi-dimensional Visual Data Exploration and Analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hoa T. Nguyen; Stone, Daithi; E. Wes Bethel
2016-01-01
An ongoing challenge in visual exploration and analysis of large, multi-dimensional datasets is how to present useful, concise information to a user for some specific visualization tasks. Typical approaches to this problem have proposed either reduced-resolution versions of data, or projections of data, or both. These approaches still have some limitations such as consuming high computation or suffering from errors. In this work, we explore the use of a statistical metric as the basis for both projections and reduced-resolution versions of data, with a particular focus on preserving one key trait in data, namely variation. We use two different casemore » studies to explore this idea, one that uses a synthetic dataset, and another that uses a large ensemble collection produced by an atmospheric modeling code to study long-term changes in global precipitation. The primary findings of our work are that in terms of preserving the variation signal inherent in data, that using a statistical measure more faithfully preserves this key characteristic across both multi-dimensional projections and multi-resolution representations than a methodology based upon averaging.« less
Feature weight estimation for gene selection: a local hyperlinear learning approach
2014-01-01
Background Modeling high-dimensional data involving thousands of variables is particularly important for gene expression profiling experiments, nevertheless,it remains a challenging task. One of the challenges is to implement an effective method for selecting a small set of relevant genes, buried in high-dimensional irrelevant noises. RELIEF is a popular and widely used approach for feature selection owing to its low computational cost and high accuracy. However, RELIEF based methods suffer from instability, especially in the presence of noisy and/or high-dimensional outliers. Results We propose an innovative feature weighting algorithm, called LHR, to select informative genes from highly noisy data. LHR is based on RELIEF for feature weighting using classical margin maximization. The key idea of LHR is to estimate the feature weights through local approximation rather than global measurement, which is typically used in existing methods. The weights obtained by our method are very robust in terms of degradation of noisy features, even those with vast dimensions. To demonstrate the performance of our method, extensive experiments involving classification tests have been carried out on both synthetic and real microarray benchmark datasets by combining the proposed technique with standard classifiers, including the support vector machine (SVM), k-nearest neighbor (KNN), hyperplane k-nearest neighbor (HKNN), linear discriminant analysis (LDA) and naive Bayes (NB). Conclusion Experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed feature selection method combined with supervised learning in three aspects: 1) high classification accuracy, 2) excellent robustness to noise and 3) good stability using to various classification algorithms. PMID:24625071
Ekeberg, Tomas
2015-05-26
This dataset contains the diffraction patterns that were used for the first three-dimensional reconstruction of a virus using FEL data. The sample was the giant mimivirus particle, which is one of the largest known viruses with a diameter of 450 nm. The dataset consists of the 198 diffraction patterns that were used in the analysis.
VizieR Online Data Catalog: Outliers and similarity in APOGEE (Reis+, 2018)
NASA Astrophysics Data System (ADS)
Reis, I.; Poznanski, D.; Baron, D.; Zasowski, G.; Shahaf, S.
2017-11-01
t-SNE is a dimensionality reduction algorithm that is particularly well suited for the visualization of high-dimensional datasets. We use t-SNE to visualize our distance matrix. A-priori, these distances could define a space with almost as many dimensions as objects, i.e., tens of thousand of dimensions. Obviously, since many stars are quite similar, and their spectra are defined by a few physical parameters, the minimal spanning space might be smaller. By using t-SNE we can examine the structure of our sample projected into 2D. We use our distance matrix as input to the t-SNE algorithm and in return get a 2D map of the objects in our dataset. For each star in a sample of 183232 APOGEE stars, the APOGEE IDs of the 99 stars with most similar spectra (according to the method described in paper), ordered by similarity. (3 data files).
Kireeva, N; Baskin, I I; Gaspar, H A; Horvath, D; Marcou, G; Varnek, A
2012-04-01
Here, the utility of Generative Topographic Maps (GTM) for data visualization, structure-activity modeling and database comparison is evaluated, on hand of subsets of the Database of Useful Decoys (DUD). Unlike other popular dimensionality reduction approaches like Principal Component Analysis, Sammon Mapping or Self-Organizing Maps, the great advantage of GTMs is providing data probability distribution functions (PDF), both in the high-dimensional space defined by molecular descriptors and in 2D latent space. PDFs for the molecules of different activity classes were successfully used to build classification models in the framework of the Bayesian approach. Because PDFs are represented by a mixture of Gaussian functions, the Bhattacharyya kernel has been proposed as a measure of the overlap of datasets, which leads to an elegant method of global comparison of chemical libraries. Copyright © 2012 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim.
Parsimonious description for predicting high-dimensional dynamics
Hirata, Yoshito; Takeuchi, Tomoya; Horai, Shunsuke; Suzuki, Hideyuki; Aihara, Kazuyuki
2015-01-01
When we observe a system, we often cannot observe all its variables and may have some of its limited measurements. Under such a circumstance, delay coordinates, vectors made of successive measurements, are useful to reconstruct the states of the whole system. Although the method of delay coordinates is theoretically supported for high-dimensional dynamical systems, practically there is a limitation because the calculation for higher-dimensional delay coordinates becomes more expensive. Here, we propose a parsimonious description of virtually infinite-dimensional delay coordinates by evaluating their distances with exponentially decaying weights. This description enables us to predict the future values of the measurements faster because we can reuse the calculated distances, and more accurately because the description naturally reduces the bias of the classical delay coordinates toward the stable directions. We demonstrate the proposed method with toy models of the atmosphere and real datasets related to renewable energy. PMID:26510518
Choi, Ji Yeh; Hwang, Heungsun; Yamamoto, Michio; Jung, Kwanghee; Woodward, Todd S
2017-06-01
Functional principal component analysis (FPCA) and functional multiple-set canonical correlation analysis (FMCCA) are data reduction techniques for functional data that are collected in the form of smooth curves or functions over a continuum such as time or space. In FPCA, low-dimensional components are extracted from a single functional dataset such that they explain the most variance of the dataset, whereas in FMCCA, low-dimensional components are obtained from each of multiple functional datasets in such a way that the associations among the components are maximized across the different sets. In this paper, we propose a unified approach to FPCA and FMCCA. The proposed approach subsumes both techniques as special cases. Furthermore, it permits a compromise between the techniques, such that components are obtained from each set of functional data to maximize their associations across different datasets, while accounting for the variance of the data well. We propose a single optimization criterion for the proposed approach, and develop an alternating regularized least squares algorithm to minimize the criterion in combination with basis function approximations to functions. We conduct a simulation study to investigate the performance of the proposed approach based on synthetic data. We also apply the approach for the analysis of multiple-subject functional magnetic resonance imaging data to obtain low-dimensional components of blood-oxygen level-dependent signal changes of the brain over time, which are highly correlated across the subjects as well as representative of the data. The extracted components are used to identify networks of neural activity that are commonly activated across the subjects while carrying out a working memory task.
Lessons learned in the analysis of high-dimensional data in vaccinomics
Oberg, Ann L.; McKinney, Brett A.; Schaid, Daniel J.; Pankratz, V. Shane; Kennedy, Richard B.; Poland, Gregory A.
2015-01-01
The field of vaccinology is increasingly moving toward the generation, analysis, and modeling of extremely large and complex high-dimensional datasets. We have used data such as these in the development and advancement of the field of vaccinomics to enable prediction of vaccine responses and to develop new vaccine candidates. However, the application of systems biology to what has been termed “big data,” or “high-dimensional data,” is not without significant challenges—chief among them a paucity of gold standard analysis and modeling paradigms with which to interpret the data. In this article, we relate some of the lessons we have learned over the last decade of working with high-dimensional, high-throughput data as applied to the field of vaccinomics. The value of such efforts, however, is ultimately to better understand the immune mechanisms by which protective and non-protective responses to vaccines are generated, and to use this information to support a personalized vaccinology approach in creating better, and safer, vaccines for the public health. PMID:25957070
Lessons learned in the analysis of high-dimensional data in vaccinomics.
Oberg, Ann L; McKinney, Brett A; Schaid, Daniel J; Pankratz, V Shane; Kennedy, Richard B; Poland, Gregory A
2015-09-29
The field of vaccinology is increasingly moving toward the generation, analysis, and modeling of extremely large and complex high-dimensional datasets. We have used data such as these in the development and advancement of the field of vaccinomics to enable prediction of vaccine responses and to develop new vaccine candidates. However, the application of systems biology to what has been termed "big data," or "high-dimensional data," is not without significant challenges-chief among them a paucity of gold standard analysis and modeling paradigms with which to interpret the data. In this article, we relate some of the lessons we have learned over the last decade of working with high-dimensional, high-throughput data as applied to the field of vaccinomics. The value of such efforts, however, is ultimately to better understand the immune mechanisms by which protective and non-protective responses to vaccines are generated, and to use this information to support a personalized vaccinology approach in creating better, and safer, vaccines for the public health. Copyright © 2015 Elsevier Ltd. All rights reserved.
Dashti, Ali; Komarov, Ivan; D'Souza, Roshan M
2013-01-01
This paper presents an implementation of the brute-force exact k-Nearest Neighbor Graph (k-NNG) construction for ultra-large high-dimensional data cloud. The proposed method uses Graphics Processing Units (GPUs) and is scalable with multi-levels of parallelism (between nodes of a cluster, between different GPUs on a single node, and within a GPU). The method is applicable to homogeneous computing clusters with a varying number of nodes and GPUs per node. We achieve a 6-fold speedup in data processing as compared with an optimized method running on a cluster of CPUs and bring a hitherto impossible [Formula: see text]-NNG generation for a dataset of twenty million images with 15 k dimensionality into the realm of practical possibility.
Blazing Signature Filter: a library for fast pairwise similarity comparisons
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lee, Joon-Yong; Fujimoto, Grant M.; Wilson, Ryan
Identifying similarities between datasets is a fundamental task in data mining and has become an integral part of modern scientific investigation. Whether the task is to identify co-expressed genes in large-scale expression surveys or to predict combinations of gene knockouts which would elicit a similar phenotype, the underlying computational task is often a multi-dimensional similarity test. As datasets continue to grow, improvements to the efficiency, sensitivity or specificity of such computation will have broad impacts as it allows scientists to more completely explore the wealth of scientific data. A significant practical drawback of large-scale data mining is the vast majoritymore » of pairwise comparisons are unlikely to be relevant, meaning that they do not share a signature of interest. It is therefore essential to efficiently identify these unproductive comparisons as rapidly as possible and exclude them from more time-intensive similarity calculations. The Blazing Signature Filter (BSF) is a highly efficient pairwise similarity algorithm which enables extensive data mining within a reasonable amount of time. The algorithm transforms datasets into binary metrics, allowing it to utilize the computationally efficient bit operators and provide a coarse measure of similarity. As a result, the BSF can scale to high dimensionality and rapidly filter unproductive pairwise comparison. Two bioinformatics applications of the tool are presented to demonstrate the ability to scale to billions of pairwise comparisons and the usefulness of this approach.« less
Three-dimensional scanning transmission electron microscopy of biological specimens.
de Jonge, Niels; Sougrat, Rachid; Northan, Brian M; Pennycook, Stephen J
2010-02-01
A three-dimensional (3D) reconstruction of the cytoskeleton and a clathrin-coated pit in mammalian cells has been achieved from a focal-series of images recorded in an aberration-corrected scanning transmission electron microscope (STEM). The specimen was a metallic replica of the biological structure comprising Pt nanoparticles 2-3 nm in diameter, with a high stability under electron beam radiation. The 3D dataset was processed by an automated deconvolution procedure. The lateral resolution was 1.1 nm, set by pixel size. Particles differing by only 10 nm in vertical position were identified as separate objects with greater than 20% dip in contrast between them. We refer to this value as the axial resolution of the deconvolution or reconstruction, the ability to recognize two objects, which were unresolved in the original dataset. The resolution of the reconstruction is comparable to that achieved by tilt-series transmission electron microscopy. However, the focal-series method does not require mechanical tilting and is therefore much faster. 3D STEM images were also recorded of the Golgi ribbon in conventional thin sections containing 3T3 cells with a comparable axial resolution in the deconvolved dataset.
Three-Dimensional Scanning Transmission Electron Microscopy of Biological Specimens
de Jonge, Niels; Sougrat, Rachid; Northan, Brian M.; Pennycook, Stephen J.
2010-01-01
A three-dimensional (3D) reconstruction of the cytoskeleton and a clathrin-coated pit in mammalian cells has been achieved from a focal-series of images recorded in an aberration-corrected scanning transmission electron microscope (STEM). The specimen was a metallic replica of the biological structure comprising Pt nanoparticles 2–3 nm in diameter, with a high stability under electron beam radiation. The 3D dataset was processed by an automated deconvolution procedure. The lateral resolution was 1.1 nm, set by pixel size. Particles differing by only 10 nm in vertical position were identified as separate objects with greater than 20% dip in contrast between them. We refer to this value as the axial resolution of the deconvolution or reconstruction, the ability to recognize two objects, which were unresolved in the original dataset. The resolution of the reconstruction is comparable to that achieved by tilt-series transmission electron microscopy. However, the focal-series method does not require mechanical tilting and is therefore much faster. 3D STEM images were also recorded of the Golgi ribbon in conventional thin sections containing 3T3 cells with a comparable axial resolution in the deconvolved dataset. PMID:20082729
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ruebel, Oliver
2009-11-20
Knowledge discovery from large and complex collections of today's scientific datasets is a challenging task. With the ability to measure and simulate more processes at increasingly finer spatial and temporal scales, the increasing number of data dimensions and data objects is presenting tremendous challenges for data analysis and effective data exploration methods and tools. Researchers are overwhelmed with data and standard tools are often insufficient to enable effective data analysis and knowledge discovery. The main objective of this thesis is to provide important new capabilities to accelerate scientific knowledge discovery form large, complex, and multivariate scientific data. The research coveredmore » in this thesis addresses these scientific challenges using a combination of scientific visualization, information visualization, automated data analysis, and other enabling technologies, such as efficient data management. The effectiveness of the proposed analysis methods is demonstrated via applications in two distinct scientific research fields, namely developmental biology and high-energy physics.Advances in microscopy, image analysis, and embryo registration enable for the first time measurement of gene expression at cellular resolution for entire organisms. Analysis of high-dimensional spatial gene expression datasets is a challenging task. By integrating data clustering and visualization, analysis of complex, time-varying, spatial gene expression patterns and their formation becomes possible. The analysis framework MATLAB and the visualization have been integrated, making advanced analysis tools accessible to biologist and enabling bioinformatic researchers to directly integrate their analysis with the visualization. Laser wakefield particle accelerators (LWFAs) promise to be a new compact source of high-energy particles and radiation, with wide applications ranging from medicine to physics. To gain insight into the complex physical processes of particle acceleration, physicists model LWFAs computationally. The datasets produced by LWFA simulations are (i) extremely large, (ii) of varying spatial and temporal resolution, (iii) heterogeneous, and (iv) high-dimensional, making analysis and knowledge discovery from complex LWFA simulation data a challenging task. To address these challenges this thesis describes the integration of the visualization system VisIt and the state-of-the-art index/query system FastBit, enabling interactive visual exploration of extremely large three-dimensional particle datasets. Researchers are especially interested in beams of high-energy particles formed during the course of a simulation. This thesis describes novel methods for automatic detection and analysis of particle beams enabling a more accurate and efficient data analysis process. By integrating these automated analysis methods with visualization, this research enables more accurate, efficient, and effective analysis of LWFA simulation data than previously possible.« less
Three-Dimensional Anisotropic Acoustic and Elastic Full-Waveform Seismic Inversion
NASA Astrophysics Data System (ADS)
Warner, M.; Morgan, J. V.
2013-12-01
Three-dimensional full-waveform inversion is a high-resolution, high-fidelity, quantitative, seismic imaging technique that has advanced rapidly within the oil and gas industry. The method involves the iterative improvement of a starting model using a series of local linearized updates to solve the full non-linear inversion problem. During the inversion, forward modeling employs the full two-way three-dimensional heterogeneous anisotropic acoustic or elastic wave equation to predict the observed raw field data, wiggle-for-wiggle, trace-by-trace. The method is computationally demanding; it is highly parallelized, and runs on large multi-core multi-node clusters. Here, we demonstrate what can be achieved by applying this newly practical technique to several high-density 3D seismic datasets that were acquired to image four contrasting sedimentary targets: a gas cloud above an oil reservoir, a radially faulted dome, buried fluvial channels, and collapse structures overlying an evaporate sequence. We show that the resulting anisotropic p-wave velocity models match in situ measurements in deep boreholes, reproduce detailed structure observed independently on high-resolution seismic reflection sections, accurately predict the raw seismic data, simplify and sharpen reverse-time-migrated reflection images of deeper horizons, and flatten Kirchhoff-migrated common-image gathers. We also show that full-elastic 3D full-waveform inversion of pure pressure data can generate a reasonable shear-wave velocity model for one of these datasets. For two of the four datasets, the inclusion of significant transversely isotropic anisotropy with a vertical axis of symmetry was necessary in order to fit the kinematics of the field data properly. For the faulted dome, the full-waveform-inversion p-wave velocity model recovers the detailed structure of every fault that can be seen on coincident seismic reflection data. Some of the individual faults represent high-velocity zones, some represent low-velocity zones, some have more-complex internal structure, and some are visible merely as offsets between two regions with contrasting velocity. Although this has not yet been demonstrated quantitatively for this dataset, it seems likely that at least some of this fine structure in the recovered velocity model is related to the detailed lithology, strain history and fluid properties within the individual faults. We have here applied this technique to seismic data that were acquired by the extractive industries, however this inversion scheme is immediately scalable and applicable to a much wider range of problems given sufficient quality and density of observed data. Potential targets range from shallow magma chambers beneath active volcanoes, through whole-crustal sections across plate boundaries, to regional and whole-Earth models.
Using Betweenness Centrality to Identify Manifold Shortcuts
Cukierski, William J.; Foran, David J.
2010-01-01
High-dimensional data presents a challenge to tasks of pattern recognition and machine learning. Dimensionality reduction (DR) methods remove the unwanted variance and make these tasks tractable. Several nonlinear DR methods, such as the well known ISOMAP algorithm, rely on a neighborhood graph to compute geodesic distances between data points. These graphs can contain unwanted edges which connect disparate regions of one or more manifolds. This topological sensitivity is well known [1], [2], [3], yet handling high-dimensional, noisy data in the absence of a priori manifold knowledge, remains an open and difficult problem. This work introduces a divisive, edge-removal method based on graph betweenness centrality which can robustly identify manifold-shorting edges. The problem of graph construction in high dimension is discussed and the proposed algorithm is fit into the ISOMAP workflow. ROC analysis is performed and the performance is tested on synthetic and real datasets. PMID:20607142
Algamal, Z Y; Lee, M H
2017-01-01
A high-dimensional quantitative structure-activity relationship (QSAR) classification model typically contains a large number of irrelevant and redundant descriptors. In this paper, a new design of descriptor selection for the QSAR classification model estimation method is proposed by adding a new weight inside L1-norm. The experimental results of classifying the anti-hepatitis C virus activity of thiourea derivatives demonstrate that the proposed descriptor selection method in the QSAR classification model performs effectively and competitively compared with other existing penalized methods in terms of classification performance on both the training and the testing datasets. Moreover, it is noteworthy that the results obtained in terms of stability test and applicability domain provide a robust QSAR classification model. It is evident from the results that the developed QSAR classification model could conceivably be employed for further high-dimensional QSAR classification studies.
2012-01-01
Background Dimensionality reduction (DR) enables the construction of a lower dimensional space (embedding) from a higher dimensional feature space while preserving object-class discriminability. However several popular DR approaches suffer from sensitivity to choice of parameters and/or presence of noise in the data. In this paper, we present a novel DR technique known as consensus embedding that aims to overcome these problems by generating and combining multiple low-dimensional embeddings, hence exploiting the variance among them in a manner similar to ensemble classifier schemes such as Bagging. We demonstrate theoretical properties of consensus embedding which show that it will result in a single stable embedding solution that preserves information more accurately as compared to any individual embedding (generated via DR schemes such as Principal Component Analysis, Graph Embedding, or Locally Linear Embedding). Intelligent sub-sampling (via mean-shift) and code parallelization are utilized to provide for an efficient implementation of the scheme. Results Applications of consensus embedding are shown in the context of classification and clustering as applied to: (1) image partitioning of white matter and gray matter on 10 different synthetic brain MRI images corrupted with 18 different combinations of noise and bias field inhomogeneity, (2) classification of 4 high-dimensional gene-expression datasets, (3) cancer detection (at a pixel-level) on 16 image slices obtained from 2 different high-resolution prostate MRI datasets. In over 200 different experiments concerning classification and segmentation of biomedical data, consensus embedding was found to consistently outperform both linear and non-linear DR methods within all applications considered. Conclusions We have presented a novel framework termed consensus embedding which leverages ensemble classification theory within dimensionality reduction, allowing for application to a wide range of high-dimensional biomedical data classification and segmentation problems. Our generalizable framework allows for improved representation and classification in the context of both imaging and non-imaging data. The algorithm offers a promising solution to problems that currently plague DR methods, and may allow for extension to other areas of biomedical data analysis. PMID:22316103
Four-dimensional MRI of renal function in the developing mouse.
Xie, Luke; Subashi, Ergys; Qi, Yi; Knepper, Mark A; Johnson, G Allan
2014-09-01
The major roles of filtration, metabolism and high blood flow make the kidney highly vulnerable to drug-induced toxicity and other renal injuries. A method to follow kidney function is essential for the early screening of toxicity and malformations. In this study, we acquired high spatiotemporal resolution (four dimensional) datasets of normal mice to follow changes in kidney structure and function during development. The data were acquired with dynamic contrast-enhanced MRI (via keyhole imaging) and a cryogenic surface coil, allowing us to obtain a full three-dimensional image (isotropic resolution, 125 microns) every 7.7 s over a 50-min scan. This time course permitted the demonstration of both contrast enhancement and clearance. Functional changes were measured over a 17-week course (at 3, 5, 7, 9, 13 and 17 weeks). The time dimension of the MRI dataset was processed to produce unique image contrasts to segment the four regions of the kidney: cortex (CO), outer stripe (OS) of the outer medulla (OM), inner stripe (IS) of the OM and inner medulla (IM). Local volumes, time-to-peak (TTP) values and decay constants (DC) were measured in each renal region. These metrics increased significantly with age, with the exception of DC values in the IS and OS. These data will serve as a foundation for studies of normal renal physiology and future studies of renal diseases that require early detection and intervention. Copyright © 2014 John Wiley & Sons, Ltd.
Numericware i: Identical by State Matrix Calculator
Kim, Bongsong; Beavis, William D
2017-01-01
We introduce software, Numericware i, to compute identical by state (IBS) matrix based on genotypic data. Calculating an IBS matrix with a large dataset requires large computer memory and takes lengthy processing time. Numericware i addresses these challenges with 2 algorithmic methods: multithreading and forward chopping. The multithreading allows computational routines to concurrently run on multiple central processing unit (CPU) processors. The forward chopping addresses memory limitation by dividing a dataset into appropriately sized subsets. Numericware i allows calculation of the IBS matrix for a large genotypic dataset using a laptop or a desktop computer. For comparison with different software, we calculated genetic relationship matrices using Numericware i, SPAGeDi, and TASSEL with the same genotypic dataset. Numericware i calculates IBS coefficients between 0 and 2, whereas SPAGeDi and TASSEL produce different ranges of values including negative values. The Pearson correlation coefficient between the matrices from Numericware i and TASSEL was high at .9972, whereas SPAGeDi showed low correlation with Numericware i (.0505) and TASSEL (.0587). With a high-dimensional dataset of 500 entities by 10 000 000 SNPs, Numericware i spent 382 minutes using 19 CPU threads and 64 GB memory by dividing the dataset into 3 pieces, whereas SPAGeDi and TASSEL failed with the same dataset. Numericware i is freely available for Windows and Linux under CC-BY 4.0 license at https://figshare.com/s/f100f33a8857131eb2db. PMID:28469375
Jones, Nia W; Raine-Fenning, Nick J; Mousa, Hatem A; Bradley, Eileen; Bugg, George J
2011-03-01
Three-dimensional (3-D) power Doppler angiography (3-D-PDA) allows visualisation of Doppler signals within the placenta and their quantification is possible by the generation of vascular indices by the 4-D View software programme. This study aimed to investigate intra- and interobserver reproducibility of 3-D-PDA analysis of stored datasets at varying gestations with the ultimate goal being to develop a tool for predicting placental dysfunction. Women with an uncomplicated, viable singleton pregnancy were scanned at 12, 16 or 20 weeks gestational age groups. 3-D-PDA datasets acquired of the whole placenta were analysed using the VOCAL software processing tool. Each volume was analysed by three observers twice in the A plane. Intra- and interobserver reliability was assessed by intraclass correlation coefficients (ICCs) and Bland Altman plots. At each gestational age group, 20 low risk women were scanned resulting in 60 datasets in total. The ICC demonstrated a high level of measurement reliability at each gestation with intraobserver values >0.90 and interobserver values of >0.6 for the vascular indices. Bland Altman plots also showed high levels of agreement. Systematic bias was seen at 20 weeks in the vascular indices obtained by different observers. This study demonstrates that 3-D-PDA data can be measured reliably by different observers from stored datasets up to 18 weeks gestation. Measurements become less reliable as gestation advances with bias between observers evident at 20 weeks. Copyright © 2011 World Federation for Ultrasound in Medicine & Biology. Published by Elsevier Inc. All rights reserved.
A model-based 3D phase unwrapping algorithm using Gegenbauer polynomials.
Langley, Jason; Zhao, Qun
2009-09-07
The application of a two-dimensional (2D) phase unwrapping algorithm to a three-dimensional (3D) phase map may result in an unwrapped phase map that is discontinuous in the direction normal to the unwrapped plane. This work investigates the problem of phase unwrapping for 3D phase maps. The phase map is modeled as a product of three one-dimensional Gegenbauer polynomials. The orthogonality of Gegenbauer polynomials and their derivatives on the interval [-1, 1] are exploited to calculate the expansion coefficients. The algorithm was implemented using two well-known Gegenbauer polynomials: Chebyshev polynomials of the first kind and Legendre polynomials. Both implementations of the phase unwrapping algorithm were tested on 3D datasets acquired from a magnetic resonance imaging (MRI) scanner. The first dataset was acquired from a homogeneous spherical phantom. The second dataset was acquired using the same spherical phantom but magnetic field inhomogeneities were introduced by an external coil placed adjacent to the phantom, which provided an additional burden to the phase unwrapping algorithm. Then Gaussian noise was added to generate a low signal-to-noise ratio dataset. The third dataset was acquired from the brain of a human volunteer. The results showed that Chebyshev implementation and the Legendre implementation of the phase unwrapping algorithm give similar results on the 3D datasets. Both implementations of the phase unwrapping algorithm compare well to PRELUDE 3D, 3D phase unwrapping software well recognized for functional MRI.
de Dumast, Priscille; Mirabel, Clément; Cevidanes, Lucia; Ruellas, Antonio; Yatabe, Marilia; Ioshida, Marcos; Ribera, Nina Tubau; Michoud, Loic; Gomes, Liliane; Huang, Chao; Zhu, Hongtu; Muniz, Luciana; Shoukri, Brandon; Paniagua, Beatriz; Styner, Martin; Pieper, Steve; Budin, Francois; Vimort, Jean-Baptiste; Pascal, Laura; Prieto, Juan Carlos
2018-07-01
The purpose of this study is to describe the methodological innovations of a web-based system for storage, integration and computation of biomedical data, using a training imaging dataset to remotely compute a deep neural network classifier of temporomandibular joint osteoarthritis (TMJOA). This study imaging dataset consisted of three-dimensional (3D) surface meshes of mandibular condyles constructed from cone beam computed tomography (CBCT) scans. The training dataset consisted of 259 condyles, 105 from control subjects and 154 from patients with diagnosis of TMJ OA. For the image analysis classification, 34 right and left condyles from 17 patients (39.9 ± 11.7 years), who experienced signs and symptoms of the disease for less than 5 years, were included as the testing dataset. For the integrative statistical model of clinical, biological and imaging markers, the sample consisted of the same 17 test OA subjects and 17 age and sex matched control subjects (39.4 ± 15.4 years), who did not show any sign or symptom of OA. For these 34 subjects, a standardized clinical questionnaire, blood and saliva samples were also collected. The technological methodologies in this study include a deep neural network classifier of 3D condylar morphology (ShapeVariationAnalyzer, SVA), and a flexible web-based system for data storage, computation and integration (DSCI) of high dimensional imaging, clinical, and biological data. The DSCI system trained and tested the neural network, indicating 5 stages of structural degenerative changes in condylar morphology in the TMJ with 91% close agreement between the clinician consensus and the SVA classifier. The DSCI remotely ran with a novel application of a statistical analysis, the Multivariate Functional Shape Data Analysis, that computed high dimensional correlations between shape 3D coordinates, clinical pain levels and levels of biological markers, and then graphically displayed the computation results. The findings of this study demonstrate a comprehensive phenotypic characterization of TMJ health and disease at clinical, imaging and biological levels, using novel flexible and versatile open-source tools for a web-based system that provides advanced shape statistical analysis and a neural network based classification of temporomandibular joint osteoarthritis. Published by Elsevier Ltd.
Mathew, B; Schmitz, A; Muñoz-Descalzo, S; Ansari, N; Pampaloni, F; Stelzer, E H K; Fischer, S C
2015-06-08
Due to the large amount of data produced by advanced microscopy, automated image analysis is crucial in modern biology. Most applications require reliable cell nuclei segmentation. However, in many biological specimens cell nuclei are densely packed and appear to touch one another in the images. Therefore, a major difficulty of three-dimensional cell nuclei segmentation is the decomposition of cell nuclei that apparently touch each other. Current methods are highly adapted to a certain biological specimen or a specific microscope. They do not ensure similarly accurate segmentation performance, i.e. their robustness for different datasets is not guaranteed. Hence, these methods require elaborate adjustments to each dataset. We present an advanced three-dimensional cell nuclei segmentation algorithm that is accurate and robust. Our approach combines local adaptive pre-processing with decomposition based on Lines-of-Sight (LoS) to separate apparently touching cell nuclei into approximately convex parts. We demonstrate the superior performance of our algorithm using data from different specimens recorded with different microscopes. The three-dimensional images were recorded with confocal and light sheet-based fluorescence microscopes. The specimens are an early mouse embryo and two different cellular spheroids. We compared the segmentation accuracy of our algorithm with ground truth data for the test images and results from state-of-the-art methods. The analysis shows that our method is accurate throughout all test datasets (mean F-measure: 91%) whereas the other methods each failed for at least one dataset (F-measure≤69%). Furthermore, nuclei volume measurements are improved for LoS decomposition. The state-of-the-art methods required laborious adjustments of parameter values to achieve these results. Our LoS algorithm did not require parameter value adjustments. The accurate performance was achieved with one fixed set of parameter values. We developed a novel and fully automated three-dimensional cell nuclei segmentation method incorporating LoS decomposition. LoS are easily accessible features that ensure correct splitting of apparently touching cell nuclei independent of their shape, size or intensity. Our method showed superior performance compared to state-of-the-art methods, performing accurately for a variety of test images. Hence, our LoS approach can be readily applied to quantitative evaluation in drug testing, developmental and cell biology.
Discriminant projective non-negative matrix factorization.
Guan, Naiyang; Zhang, Xiang; Luo, Zhigang; Tao, Dacheng; Yang, Xuejun
2013-01-01
Projective non-negative matrix factorization (PNMF) projects high-dimensional non-negative examples X onto a lower-dimensional subspace spanned by a non-negative basis W and considers W(T) X as their coefficients, i.e., X≈WW(T) X. Since PNMF learns the natural parts-based representation Wof X, it has been widely used in many fields such as pattern recognition and computer vision. However, PNMF does not perform well in classification tasks because it completely ignores the label information of the dataset. This paper proposes a Discriminant PNMF method (DPNMF) to overcome this deficiency. In particular, DPNMF exploits Fisher's criterion to PNMF for utilizing the label information. Similar to PNMF, DPNMF learns a single non-negative basis matrix and needs less computational burden than NMF. In contrast to PNMF, DPNMF maximizes the distance between centers of any two classes of examples meanwhile minimizes the distance between any two examples of the same class in the lower-dimensional subspace and thus has more discriminant power. We develop a multiplicative update rule to solve DPNMF and prove its convergence. Experimental results on four popular face image datasets confirm its effectiveness comparing with the representative NMF and PNMF algorithms.
Discriminant Projective Non-Negative Matrix Factorization
Guan, Naiyang; Zhang, Xiang; Luo, Zhigang; Tao, Dacheng; Yang, Xuejun
2013-01-01
Projective non-negative matrix factorization (PNMF) projects high-dimensional non-negative examples X onto a lower-dimensional subspace spanned by a non-negative basis W and considers WT X as their coefficients, i.e., X≈WWT X. Since PNMF learns the natural parts-based representation Wof X, it has been widely used in many fields such as pattern recognition and computer vision. However, PNMF does not perform well in classification tasks because it completely ignores the label information of the dataset. This paper proposes a Discriminant PNMF method (DPNMF) to overcome this deficiency. In particular, DPNMF exploits Fisher's criterion to PNMF for utilizing the label information. Similar to PNMF, DPNMF learns a single non-negative basis matrix and needs less computational burden than NMF. In contrast to PNMF, DPNMF maximizes the distance between centers of any two classes of examples meanwhile minimizes the distance between any two examples of the same class in the lower-dimensional subspace and thus has more discriminant power. We develop a multiplicative update rule to solve DPNMF and prove its convergence. Experimental results on four popular face image datasets confirm its effectiveness comparing with the representative NMF and PNMF algorithms. PMID:24376680
NASA Astrophysics Data System (ADS)
Madricardo, Fantina; Foglini, Federica; Kruss, Aleksandra; Ferrarin, Christian; Pizzeghello, Nicola Marco; Murri, Chiara; Rossi, Monica; Bajo, Marco; Bellafiore, Debora; Campiani, Elisabetta; Fogarin, Stefano; Grande, Valentina; Janowski, Lukasz; Keppel, Erica; Leidi, Elisa; Lorenzetti, Giuliano; Maicu, Francesco; Maselli, Vittorio; Mercorella, Alessandra; Montereale Gavazzi, Giacomo; Minuzzo, Tiziano; Pellegrini, Claudio; Petrizzo, Antonio; Prampolini, Mariacristina; Remia, Alessandro; Rizzetto, Federica; Rovere, Marzia; Sarretta, Alessandro; Sigovini, Marco; Sinapi, Luigi; Umgiesser, Georg; Trincardi, Fabio
2017-09-01
Tidal channels are crucial for the functioning of wetlands, though their morphological properties, which are relevant for seafloor habitats and flow, have been understudied so far. Here, we release a dataset composed of Digital Terrain Models (DTMs) extracted from a total of 2,500 linear kilometres of high-resolution multibeam echosounder (MBES) data collected in 2013 covering the entire network of tidal channels and inlets of the Venice Lagoon, Italy. The dataset comprises also the backscatter (BS) data, which reflect the acoustic properties of the seafloor, and the tidal current fields simulated by means of a high-resolution three-dimensional unstructured hydrodynamic model. The DTMs and the current fields help define how morphological and benthic properties of tidal channels are affected by the action of currents. These data are of potential broad interest not only to geomorphologists, oceanographers and ecologists studying the morphology, hydrodynamics, sediment transport and benthic habitats of tidal environments, but also to coastal engineers and stakeholders for cost-effective monitoring and sustainable management of this peculiar shallow coastal system.
Madricardo, Fantina; Foglini, Federica; Kruss, Aleksandra; Ferrarin, Christian; Pizzeghello, Nicola Marco; Murri, Chiara; Rossi, Monica; Bajo, Marco; Bellafiore, Debora; Campiani, Elisabetta; Fogarin, Stefano; Grande, Valentina; Janowski, Lukasz; Keppel, Erica; Leidi, Elisa; Lorenzetti, Giuliano; Maicu, Francesco; Maselli, Vittorio; Mercorella, Alessandra; Montereale Gavazzi, Giacomo; Minuzzo, Tiziano; Pellegrini, Claudio; Petrizzo, Antonio; Prampolini, Mariacristina; Remia, Alessandro; Rizzetto, Federica; Rovere, Marzia; Sarretta, Alessandro; Sigovini, Marco; Sinapi, Luigi; Umgiesser, Georg; Trincardi, Fabio
2017-01-01
Tidal channels are crucial for the functioning of wetlands, though their morphological properties, which are relevant for seafloor habitats and flow, have been understudied so far. Here, we release a dataset composed of Digital Terrain Models (DTMs) extracted from a total of 2,500 linear kilometres of high-resolution multibeam echosounder (MBES) data collected in 2013 covering the entire network of tidal channels and inlets of the Venice Lagoon, Italy. The dataset comprises also the backscatter (BS) data, which reflect the acoustic properties of the seafloor, and the tidal current fields simulated by means of a high-resolution three-dimensional unstructured hydrodynamic model. The DTMs and the current fields help define how morphological and benthic properties of tidal channels are affected by the action of currents. These data are of potential broad interest not only to geomorphologists, oceanographers and ecologists studying the morphology, hydrodynamics, sediment transport and benthic habitats of tidal environments, but also to coastal engineers and stakeholders for cost-effective monitoring and sustainable management of this peculiar shallow coastal system. PMID:28872636
Madricardo, Fantina; Foglini, Federica; Kruss, Aleksandra; Ferrarin, Christian; Pizzeghello, Nicola Marco; Murri, Chiara; Rossi, Monica; Bajo, Marco; Bellafiore, Debora; Campiani, Elisabetta; Fogarin, Stefano; Grande, Valentina; Janowski, Lukasz; Keppel, Erica; Leidi, Elisa; Lorenzetti, Giuliano; Maicu, Francesco; Maselli, Vittorio; Mercorella, Alessandra; Montereale Gavazzi, Giacomo; Minuzzo, Tiziano; Pellegrini, Claudio; Petrizzo, Antonio; Prampolini, Mariacristina; Remia, Alessandro; Rizzetto, Federica; Rovere, Marzia; Sarretta, Alessandro; Sigovini, Marco; Sinapi, Luigi; Umgiesser, Georg; Trincardi, Fabio
2017-09-05
Tidal channels are crucial for the functioning of wetlands, though their morphological properties, which are relevant for seafloor habitats and flow, have been understudied so far. Here, we release a dataset composed of Digital Terrain Models (DTMs) extracted from a total of 2,500 linear kilometres of high-resolution multibeam echosounder (MBES) data collected in 2013 covering the entire network of tidal channels and inlets of the Venice Lagoon, Italy. The dataset comprises also the backscatter (BS) data, which reflect the acoustic properties of the seafloor, and the tidal current fields simulated by means of a high-resolution three-dimensional unstructured hydrodynamic model. The DTMs and the current fields help define how morphological and benthic properties of tidal channels are affected by the action of currents. These data are of potential broad interest not only to geomorphologists, oceanographers and ecologists studying the morphology, hydrodynamics, sediment transport and benthic habitats of tidal environments, but also to coastal engineers and stakeholders for cost-effective monitoring and sustainable management of this peculiar shallow coastal system.
Advanced Multidimensional Separations in Mass Spectrometry: Navigating the Big Data Deluge
May, Jody C.; McLean, John A.
2017-01-01
Hybrid analytical instrumentation constructed around mass spectrometry (MS) are becoming preferred techniques for addressing many grand challenges in science and medicine. From the omics sciences to drug discovery and synthetic biology, multidimensional separations based on MS provide the high peak capacity and high measurement throughput necessary to obtain large-scale measurements which are used to infer systems-level information. In this review, we describe multidimensional MS configurations as technologies which are big data drivers and discuss some new and emerging strategies for mining information from large-scale datasets. A discussion is included on the information content which can be obtained from individual dimensions, as well as the unique information which can be derived by comparing different levels of data. Finally, we discuss some emerging data visualization strategies which seek to make highly dimensional datasets both accessible and comprehensible. PMID:27306312
Big Data Transforms Discovery-Utilization Therapeutics Continuum.
Waldman, S A; Terzic, A
2016-03-01
Enabling omic technologies adopt a holistic view to produce unprecedented insights into the molecular underpinnings of health and disease, in part, by generating massive high-dimensional biological data. Leveraging these systems-level insights as an engine driving the healthcare evolution is maximized through integration with medical, demographic, and environmental datasets from individuals to populations. Big data analytics has accordingly emerged to add value to the technical aspects of storage, transfer, and analysis required for merging vast arrays of omic-, clinical-, and eco-datasets. In turn, this new field at the interface of biology, medicine, and information science is systematically transforming modern therapeutics across discovery, development, regulation, and utilization. © 2015 ASCPT.
Fast, Exact Bootstrap Principal Component Analysis for p > 1 million
Fisher, Aaron; Caffo, Brian; Schwartz, Brian; Zipunnikov, Vadim
2015-01-01
Many have suggested a bootstrap procedure for estimating the sampling variability of principal component analysis (PCA) results. However, when the number of measurements per subject (p) is much larger than the number of subjects (n), calculating and storing the leading principal components from each bootstrap sample can be computationally infeasible. To address this, we outline methods for fast, exact calculation of bootstrap principal components, eigenvalues, and scores. Our methods leverage the fact that all bootstrap samples occupy the same n-dimensional subspace as the original sample. As a result, all bootstrap principal components are limited to the same n-dimensional subspace and can be efficiently represented by their low dimensional coordinates in that subspace. Several uncertainty metrics can be computed solely based on the bootstrap distribution of these low dimensional coordinates, without calculating or storing the p-dimensional bootstrap components. Fast bootstrap PCA is applied to a dataset of sleep electroencephalogram recordings (p = 900, n = 392), and to a dataset of brain magnetic resonance images (MRIs) (p ≈ 3 million, n = 352). For the MRI dataset, our method allows for standard errors for the first 3 principal components based on 1000 bootstrap samples to be calculated on a standard laptop in 47 minutes, as opposed to approximately 4 days with standard methods. PMID:27616801
Online 3D Ear Recognition by Combining Global and Local Features.
Liu, Yahui; Zhang, Bob; Lu, Guangming; Zhang, David
2016-01-01
The three-dimensional shape of the ear has been proven to be a stable candidate for biometric authentication because of its desirable properties such as universality, uniqueness, and permanence. In this paper, a special laser scanner designed for online three-dimensional ear acquisition was described. Based on the dataset collected by our scanner, two novel feature classes were defined from a three-dimensional ear image: the global feature class (empty centers and angles) and local feature class (points, lines, and areas). These features are extracted and combined in an optimal way for three-dimensional ear recognition. Using a large dataset consisting of 2,000 samples, the experimental results illustrate the effectiveness of fusing global and local features, obtaining an equal error rate of 2.2%.
Online 3D Ear Recognition by Combining Global and Local Features
Liu, Yahui; Zhang, Bob; Lu, Guangming; Zhang, David
2016-01-01
The three-dimensional shape of the ear has been proven to be a stable candidate for biometric authentication because of its desirable properties such as universality, uniqueness, and permanence. In this paper, a special laser scanner designed for online three-dimensional ear acquisition was described. Based on the dataset collected by our scanner, two novel feature classes were defined from a three-dimensional ear image: the global feature class (empty centers and angles) and local feature class (points, lines, and areas). These features are extracted and combined in an optimal way for three-dimensional ear recognition. Using a large dataset consisting of 2,000 samples, the experimental results illustrate the effectiveness of fusing global and local features, obtaining an equal error rate of 2.2%. PMID:27935955
Improved dense trajectories for action recognition based on random projection and Fisher vectors
NASA Astrophysics Data System (ADS)
Ai, Shihui; Lu, Tongwei; Xiong, Yudian
2018-03-01
As an important application of intelligent monitoring system, the action recognition in video has become a very important research area of computer vision. In order to improve the accuracy rate of the action recognition in video with improved dense trajectories, one advanced vector method is introduced. Improved dense trajectories combine Fisher Vector with Random Projection. The method realizes the reduction of the characteristic trajectory though projecting the high-dimensional trajectory descriptor into the low-dimensional subspace based on defining and analyzing Gaussian mixture model by Random Projection. And a GMM-FV hybrid model is introduced to encode the trajectory feature vector and reduce dimension. The computational complexity is reduced by Random Projection which can drop Fisher coding vector. Finally, a Linear SVM is used to classifier to predict labels. We tested the algorithm in UCF101 dataset and KTH dataset. Compared with existed some others algorithm, the result showed that the method not only reduce the computational complexity but also improved the accuracy of action recognition.
Chaaraoui, Alexandros Andre; Flórez-Revuelta, Francisco
2014-01-01
This paper presents a novel silhouette-based feature for vision-based human action recognition, which relies on the contour of the silhouette and a radial scheme. Its low-dimensionality and ease of extraction result in an outstanding proficiency for real-time scenarios. This feature is used in a learning algorithm that by means of model fusion of multiple camera streams builds a bag of key poses, which serves as a dictionary of known poses and allows converting the training sequences into sequences of key poses. These are used in order to perform action recognition by means of a sequence matching algorithm. Experimentation on three different datasets returns high and stable recognition rates. To the best of our knowledge, this paper presents the highest results so far on the MuHAVi-MAS dataset. Real-time suitability is given, since the method easily performs above video frequency. Therefore, the related requirements that applications as ambient-assisted living services impose are successfully fulfilled.
Large-scale machine learning and evaluation platform for real-time traffic surveillance
NASA Astrophysics Data System (ADS)
Eichel, Justin A.; Mishra, Akshaya; Miller, Nicholas; Jankovic, Nicholas; Thomas, Mohan A.; Abbott, Tyler; Swanson, Douglas; Keller, Joel
2016-09-01
In traffic engineering, vehicle detectors are trained on limited datasets, resulting in poor accuracy when deployed in real-world surveillance applications. Annotating large-scale high-quality datasets is challenging. Typically, these datasets have limited diversity; they do not reflect the real-world operating environment. There is a need for a large-scale, cloud-based positive and negative mining process and a large-scale learning and evaluation system for the application of automatic traffic measurements and classification. The proposed positive and negative mining process addresses the quality of crowd sourced ground truth data through machine learning review and human feedback mechanisms. The proposed learning and evaluation system uses a distributed cloud computing framework to handle data-scaling issues associated with large numbers of samples and a high-dimensional feature space. The system is trained using AdaBoost on 1,000,000 Haar-like features extracted from 70,000 annotated video frames. The trained real-time vehicle detector achieves an accuracy of at least 95% for 1/2 and about 78% for 19/20 of the time when tested on ˜7,500,000 video frames. At the end of 2016, the dataset is expected to have over 1 billion annotated video frames.
Neural Network Machine Learning and Dimension Reduction for Data Visualization
NASA Technical Reports Server (NTRS)
Liles, Charles A.
2014-01-01
Neural network machine learning in computer science is a continuously developing field of study. Although neural network models have been developed which can accurately predict a numeric value or nominal classification, a general purpose method for constructing neural network architecture has yet to be developed. Computer scientists are often forced to rely on a trial-and-error process of developing and improving accurate neural network models. In many cases, models are constructed from a large number of input parameters. Understanding which input parameters have the greatest impact on the prediction of the model is often difficult to surmise, especially when the number of input variables is very high. This challenge is often labeled the "curse of dimensionality" in scientific fields. However, techniques exist for reducing the dimensionality of problems to just two dimensions. Once a problem's dimensions have been mapped to two dimensions, it can be easily plotted and understood by humans. The ability to visualize a multi-dimensional dataset can provide a means of identifying which input variables have the highest effect on determining a nominal or numeric output. Identifying these variables can provide a better means of training neural network models; models can be more easily and quickly trained using only input variables which appear to affect the outcome variable. The purpose of this project is to explore varying means of training neural networks and to utilize dimensional reduction for visualizing and understanding complex datasets.
SHARE: system design and case studies for statistical health information release
Gardner, James; Xiong, Li; Xiao, Yonghui; Gao, Jingjing; Post, Andrew R; Jiang, Xiaoqian; Ohno-Machado, Lucila
2013-01-01
Objectives We present SHARE, a new system for statistical health information release with differential privacy. We present two case studies that evaluate the software on real medical datasets and demonstrate the feasibility and utility of applying the differential privacy framework on biomedical data. Materials and Methods SHARE releases statistical information in electronic health records with differential privacy, a strong privacy framework for statistical data release. It includes a number of state-of-the-art methods for releasing multidimensional histograms and longitudinal patterns. We performed a variety of experiments on two real datasets, the surveillance, epidemiology and end results (SEER) breast cancer dataset and the Emory electronic medical record (EeMR) dataset, to demonstrate the feasibility and utility of SHARE. Results Experimental results indicate that SHARE can deal with heterogeneous data present in medical data, and that the released statistics are useful. The Kullback–Leibler divergence between the released multidimensional histograms and the original data distribution is below 0.5 and 0.01 for seven-dimensional and three-dimensional data cubes generated from the SEER dataset, respectively. The relative error for longitudinal pattern queries on the EeMR dataset varies between 0 and 0.3. While the results are promising, they also suggest that challenges remain in applying statistical data release using the differential privacy framework for higher dimensional data. Conclusions SHARE is one of the first systems to provide a mechanism for custodians to release differentially private aggregate statistics for a variety of use cases in the medical domain. This proof-of-concept system is intended to be applied to large-scale medical data warehouses. PMID:23059729
Chen, Chien-Chang; Juan, Hung-Hui; Tsai, Meng-Yuan; Lu, Henry Horng-Shing
2018-01-11
By introducing the methods of machine learning into the density functional theory, we made a detour for the construction of the most probable density function, which can be estimated by learning relevant features from the system of interest. Using the properties of universal functional, the vital core of density functional theory, the most probable cluster numbers and the corresponding cluster boundaries in a studying system can be simultaneously and automatically determined and the plausibility is erected on the Hohenberg-Kohn theorems. For the method validation and pragmatic applications, interdisciplinary problems from physical to biological systems were enumerated. The amalgamation of uncharged atomic clusters validated the unsupervised searching process of the cluster numbers and the corresponding cluster boundaries were exhibited likewise. High accurate clustering results of the Fisher's iris dataset showed the feasibility and the flexibility of the proposed scheme. Brain tumor detections from low-dimensional magnetic resonance imaging datasets and segmentations of high-dimensional neural network imageries in the Brainbow system were also used to inspect the method practicality. The experimental results exhibit the successful connection between the physical theory and the machine learning methods and will benefit the clinical diagnoses.
Yang, Y X; Teo, S-K; Van Reeth, E; Tan, C H; Tham, I W K; Poh, C L
2015-08-01
Accurate visualization of lung motion is important in many clinical applications, such as radiotherapy of lung cancer. Advancement in imaging modalities [e.g., computed tomography (CT) and MRI] has allowed dynamic imaging of lung and lung tumor motion. However, each imaging modality has its advantages and disadvantages. The study presented in this paper aims at generating synthetic 4D-CT dataset for lung cancer patients by combining both continuous three-dimensional (3D) motion captured by 4D-MRI and the high spatial resolution captured by CT using the authors' proposed approach. A novel hybrid approach based on deformable image registration (DIR) and finite element method simulation was developed to fuse a static 3D-CT volume (acquired under breath-hold) and the 3D motion information extracted from 4D-MRI dataset, creating a synthetic 4D-CT dataset. The study focuses on imaging of lung and lung tumor. Comparing the synthetic 4D-CT dataset with the acquired 4D-CT dataset of six lung cancer patients based on 420 landmarks, accurate results (average error <2 mm) were achieved using the authors' proposed approach. Their hybrid approach achieved a 40% error reduction (based on landmarks assessment) over using only DIR techniques. The synthetic 4D-CT dataset generated has high spatial resolution, has excellent lung details, and is able to show movement of lung and lung tumor over multiple breathing cycles.
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets
Wernisch, Lorenz
2017-01-01
Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm. PMID:29036190
Clusternomics: Integrative context-dependent clustering for heterogeneous datasets.
Gabasova, Evelina; Reid, John; Wernisch, Lorenz
2017-10-01
Integrative clustering is used to identify groups of samples by jointly analysing multiple datasets describing the same set of biological samples, such as gene expression, copy number, methylation etc. Most existing algorithms for integrative clustering assume that there is a shared consistent set of clusters across all datasets, and most of the data samples follow this structure. However in practice, the structure across heterogeneous datasets can be more varied, with clusters being joined in some datasets and separated in others. In this paper, we present a probabilistic clustering method to identify groups across datasets that do not share the same cluster structure. The proposed algorithm, Clusternomics, identifies groups of samples that share their global behaviour across heterogeneous datasets. The algorithm models clusters on the level of individual datasets, while also extracting global structure that arises from the local cluster assignments. Clusters on both the local and the global level are modelled using a hierarchical Dirichlet mixture model to identify structure on both levels. We evaluated the model both on simulated and on real-world datasets. The simulated data exemplifies datasets with varying degrees of common structure. In such a setting Clusternomics outperforms existing algorithms for integrative and consensus clustering. In a real-world application, we used the algorithm for cancer subtyping, identifying subtypes of cancer from heterogeneous datasets. We applied the algorithm to TCGA breast cancer dataset, integrating gene expression, miRNA expression, DNA methylation and proteomics. The algorithm extracted clinically meaningful clusters with significantly different survival probabilities. We also evaluated the algorithm on lung and kidney cancer TCGA datasets with high dimensionality, again showing clinically significant results and scalability of the algorithm.
Robust hashing with local models for approximate similarity search.
Song, Jingkuan; Yang, Yi; Li, Xuelong; Huang, Zi; Yang, Yang
2014-07-01
Similarity search plays an important role in many applications involving high-dimensional data. Due to the known dimensionality curse, the performance of most existing indexing structures degrades quickly as the feature dimensionality increases. Hashing methods, such as locality sensitive hashing (LSH) and its variants, have been widely used to achieve fast approximate similarity search by trading search quality for efficiency. However, most existing hashing methods make use of randomized algorithms to generate hash codes without considering the specific structural information in the data. In this paper, we propose a novel hashing method, namely, robust hashing with local models (RHLM), which learns a set of robust hash functions to map the high-dimensional data points into binary hash codes by effectively utilizing local structural information. In RHLM, for each individual data point in the training dataset, a local hashing model is learned and used to predict the hash codes of its neighboring data points. The local models from all the data points are globally aligned so that an optimal hash code can be assigned to each data point. After obtaining the hash codes of all the training data points, we design a robust method by employing l2,1 -norm minimization on the loss function to learn effective hash functions, which are then used to map each database point into its hash code. Given a query data point, the search process first maps it into the query hash code by the hash functions and then explores the buckets, which have similar hash codes to the query hash code. Extensive experimental results conducted on real-life datasets show that the proposed RHLM outperforms the state-of-the-art methods in terms of search quality and efficiency.
Belciug, Smaranda; Gorunescu, Florin
2018-06-08
Methods based on microarrays (MA), mass spectrometry (MS), and machine learning (ML) algorithms have evolved rapidly in recent years, allowing for early detection of several types of cancer. A pitfall of these approaches, however, is the overfitting of data due to large number of attributes and small number of instances -- a phenomenon known as the 'curse of dimensionality'. A potentially fruitful idea to avoid this drawback is to develop algorithms that combine fast computation with a filtering module for the attributes. The goal of this paper is to propose a statistical strategy to initiate the hidden nodes of a single-hidden layer feedforward neural network (SLFN) by using both the knowledge embedded in data and a filtering mechanism for attribute relevance. In order to attest its feasibility, the proposed model has been tested on five publicly available high-dimensional datasets: breast, lung, colon, and ovarian cancer regarding gene expression and proteomic spectra provided by cDNA arrays, DNA microarray, and MS. The novel algorithm, called adaptive SLFN (aSLFN), has been compared with four major classification algorithms: traditional ELM, radial basis function network (RBF), single-hidden layer feedforward neural network trained by backpropagation algorithm (BP-SLFN), and support vector-machine (SVM). Experimental results showed that the classification performance of aSLFN is competitive with the comparison models. Copyright © 2018. Published by Elsevier Inc.
Functional CAR models for large spatially correlated functional datasets.
Zhang, Lin; Baladandayuthapani, Veerabhadran; Zhu, Hongxiao; Baggerly, Keith A; Majewski, Tadeusz; Czerniak, Bogdan A; Morris, Jeffrey S
2016-01-01
We develop a functional conditional autoregressive (CAR) model for spatially correlated data for which functions are collected on areal units of a lattice. Our model performs functional response regression while accounting for spatial correlations with potentially nonseparable and nonstationary covariance structure, in both the space and functional domains. We show theoretically that our construction leads to a CAR model at each functional location, with spatial covariance parameters varying and borrowing strength across the functional domain. Using basis transformation strategies, the nonseparable spatial-functional model is computationally scalable to enormous functional datasets, generalizable to different basis functions, and can be used on functions defined on higher dimensional domains such as images. Through simulation studies, we demonstrate that accounting for the spatial correlation in our modeling leads to improved functional regression performance. Applied to a high-throughput spatially correlated copy number dataset, the model identifies genetic markers not identified by comparable methods that ignore spatial correlations.
3D Feature Extraction for Unstructured Grids
NASA Technical Reports Server (NTRS)
Silver, Deborah
1996-01-01
Visualization techniques provide tools that help scientists identify observed phenomena in scientific simulation. To be useful, these tools must allow the user to extract regions, classify and visualize them, abstract them for simplified representations, and track their evolution. Object Segmentation provides a technique to extract and quantify regions of interest within these massive datasets. This article explores basic algorithms to extract coherent amorphous regions from two-dimensional and three-dimensional scalar unstructured grids. The techniques are applied to datasets from Computational Fluid Dynamics and those from Finite Element Analysis.
Gene selection for microarray data classification via subspace learning and manifold regularization.
Tang, Chang; Cao, Lijuan; Zheng, Xiao; Wang, Minhui
2017-12-19
With the rapid development of DNA microarray technology, large amount of genomic data has been generated. Classification of these microarray data is a challenge task since gene expression data are often with thousands of genes but a small number of samples. In this paper, an effective gene selection method is proposed to select the best subset of genes for microarray data with the irrelevant and redundant genes removed. Compared with original data, the selected gene subset can benefit the classification task. We formulate the gene selection task as a manifold regularized subspace learning problem. In detail, a projection matrix is used to project the original high dimensional microarray data into a lower dimensional subspace, with the constraint that the original genes can be well represented by the selected genes. Meanwhile, the local manifold structure of original data is preserved by a Laplacian graph regularization term on the low-dimensional data space. The projection matrix can serve as an importance indicator of different genes. An iterative update algorithm is developed for solving the problem. Experimental results on six publicly available microarray datasets and one clinical dataset demonstrate that the proposed method performs better when compared with other state-of-the-art methods in terms of microarray data classification. Graphical Abstract The graphical abstract of this work.
Bromuri, Stefano; Zufferey, Damien; Hennebert, Jean; Schumacher, Michael
2014-10-01
This research is motivated by the issue of classifying illnesses of chronically ill patients for decision support in clinical settings. Our main objective is to propose multi-label classification of multivariate time series contained in medical records of chronically ill patients, by means of quantization methods, such as bag of words (BoW), and multi-label classification algorithms. Our second objective is to compare supervised dimensionality reduction techniques to state-of-the-art multi-label classification algorithms. The hypothesis is that kernel methods and locality preserving projections make such algorithms good candidates to study multi-label medical time series. We combine BoW and supervised dimensionality reduction algorithms to perform multi-label classification on health records of chronically ill patients. The considered algorithms are compared with state-of-the-art multi-label classifiers in two real world datasets. Portavita dataset contains 525 diabetes type 2 (DT2) patients, with co-morbidities of DT2 such as hypertension, dyslipidemia, and microvascular or macrovascular issues. MIMIC II dataset contains 2635 patients affected by thyroid disease, diabetes mellitus, lipoid metabolism disease, fluid electrolyte disease, hypertensive disease, thrombosis, hypotension, chronic obstructive pulmonary disease (COPD), liver disease and kidney disease. The algorithms are evaluated using multi-label evaluation metrics such as hamming loss, one error, coverage, ranking loss, and average precision. Non-linear dimensionality reduction approaches behave well on medical time series quantized using the BoW algorithm, with results comparable to state-of-the-art multi-label classification algorithms. Chaining the projected features has a positive impact on the performance of the algorithm with respect to pure binary relevance approaches. The evaluation highlights the feasibility of representing medical health records using the BoW for multi-label classification tasks. The study also highlights that dimensionality reduction algorithms based on kernel methods, locality preserving projections or both are good candidates to deal with multi-label classification tasks in medical time series with many missing values and high label density. Copyright © 2014 Elsevier Inc. All rights reserved.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Van Benthem, Mark H.
2016-05-04
This software is employed for 3D visualization of X-ray diffraction (XRD) data with functionality for slicing, reorienting, isolating and plotting of 2D color contour maps and 3D renderings of large datasets. The program makes use of the multidimensionality of textured XRD data where diffracted intensity is not constant over a given set of angular positions (as dictated by the three defined dimensional angles of phi, chi, and two-theta). Datasets are rendered in 3D with intensity as a scaler which is represented as a rainbow color scale. A GUI interface and scrolling tools along with interactive function via the mouse allowmore » for fast manipulation of these large datasets so as to perform detailed analysis of diffraction results with full dimensionality of the diffraction space.« less
Gregoretti, Francesco; Cesarini, Elisa; Lanzuolo, Chiara; Oliva, Gennaro; Antonelli, Laura
2016-01-01
The large amount of data generated in biological experiments that rely on advanced microscopy can be handled only with automated image analysis. Most analyses require a reliable cell image segmentation eventually capable of detecting subcellular structures.We present an automatic segmentation method to detect Polycomb group (PcG) proteins areas isolated from nuclei regions in high-resolution fluorescent cell image stacks. It combines two segmentation algorithms that use an active contour model and a classification technique serving as a tool to better understand the subcellular three-dimensional distribution of PcG proteins in live cell image sequences. We obtained accurate results throughout several cell image datasets, coming from different cell types and corresponding to different fluorescent labels, without requiring elaborate adjustments to each dataset.
Igloo-Plot: a tool for visualization of multidimensional datasets.
Kuntal, Bhusan K; Ghosh, Tarini Shankar; Mande, Sharmila S
2014-01-01
Advances in science and technology have resulted in an exponential growth of multivariate (or multi-dimensional) datasets which are being generated from various research areas especially in the domain of biological sciences. Visualization and analysis of such data (with the objective of uncovering the hidden patterns therein) is an important and challenging task. We present a tool, called Igloo-Plot, for efficient visualization of multidimensional datasets. The tool addresses some of the key limitations of contemporary multivariate visualization and analysis tools. The visualization layout, not only facilitates an easy identification of clusters of data-points having similar feature compositions, but also the 'marker features' specific to each of these clusters. The applicability of the various functionalities implemented herein is demonstrated using several well studied multi-dimensional datasets. Igloo-Plot is expected to be a valuable resource for researchers working in multivariate data mining studies. Igloo-Plot is available for download from: http://metagenomics.atc.tcs.com/IglooPlot/. Copyright © 2014 Elsevier Inc. All rights reserved.
Glover, Jason; Man, Tsz-Kwong; Barkauskas, Donald A; Hall, David; Tello, Tanya; Sullivan, Mary Beth; Gorlick, Richard; Janeway, Katherine; Grier, Holcombe; Lau, Ching; Toretsky, Jeffrey A; Borinstein, Scott C; Khanna, Chand; Fan, Timothy M
2017-01-01
The prospective banking of osteosarcoma tissue samples to promote research endeavors has been realized through the establishment of a nationally centralized biospecimen repository, the Children's Oncology Group (COG) biospecimen bank located at the Biopathology Center (BPC)/Nationwide Children's Hospital in Columbus, Ohio. Although the physical inventory of osteosarcoma biospecimens is substantive (>15,000 sample specimens), the nature of these resources remains exhaustible. Despite judicious allocation of these high-value biospecimens for conducting sarcoma-related research, a deeper understanding of osteosarcoma biology, in particular metastases, remains unrealized. In addition the identification and development of novel diagnostics and effective therapeutics remain elusive. The QuadW-COG Childhood Sarcoma Biostatistics and Annotation Office (CSBAO) has developed the High Dimensional Data (HDD) platform to complement the existing physical inventory and to promote in silico hypothesis testing in sarcoma biology. The HDD is a relational biologic database derived from matched osteosarcoma biospecimens in which diverse experimental readouts have been generated and digitally deposited. As proof-of-concept, we demonstrate that the HDD platform can be utilized to address previously unrealized biologic questions though the systematic juxtaposition of diverse datasets derived from shared biospecimens. The continued population of the HDD platform with high-value, high-throughput and mineable datasets allows a shared and reusable resource for researchers, both experimentalists and bioinformatics investigators, to propose and answer questions in silico that advance our understanding of osteosarcoma biology.
Saini, Harsh; Lal, Sunil Pranit; Naidu, Vimal Vikash; Pickering, Vincel Wince; Singh, Gurmeet; Tsunoda, Tatsuhiko; Sharma, Alok
2016-12-05
High dimensional feature space generally degrades classification in several applications. In this paper, we propose a strategy called gene masking, in which non-contributing dimensions are heuristically removed from the data to improve classification accuracy. Gene masking is implemented via a binary encoded genetic algorithm that can be integrated seamlessly with classifiers during the training phase of classification to perform feature selection. It can also be used to discriminate between features that contribute most to the classification, thereby, allowing researchers to isolate features that may have special significance. This technique was applied on publicly available datasets whereby it substantially reduced the number of features used for classification while maintaining high accuracies. The proposed technique can be extremely useful in feature selection as it heuristically removes non-contributing features to improve the performance of classifiers.
Protein sequence comparison based on K-string dictionary.
Yu, Chenglong; He, Rong L; Yau, Stephen S-T
2013-10-25
The current K-string-based protein sequence comparisons require large amounts of computer memory because the dimension of the protein vector representation grows exponentially with K. In this paper, we propose a novel concept, the "K-string dictionary", to solve this high-dimensional problem. It allows us to use a much lower dimensional K-string-based frequency or probability vector to represent a protein, and thus significantly reduce the computer memory requirements for their implementation. Furthermore, based on this new concept, we use Singular Value Decomposition to analyze real protein datasets, and the improved protein vector representation allows us to obtain accurate gene trees. © 2013.
Ding, Jiarui; Condon, Anne; Shah, Sohrab P
2018-05-21
Single-cell RNA-sequencing has great potential to discover cell types, identify cell states, trace development lineages, and reconstruct the spatial organization of cells. However, dimension reduction to interpret structure in single-cell sequencing data remains a challenge. Existing algorithms are either not able to uncover the clustering structures in the data or lose global information such as groups of clusters that are close to each other. We present a robust statistical model, scvis, to capture and visualize the low-dimensional structures in single-cell gene expression data. Simulation results demonstrate that low-dimensional representations learned by scvis preserve both the local and global neighbor structures in the data. In addition, scvis is robust to the number of data points and learns a probabilistic parametric mapping function to add new data points to an existing embedding. We then use scvis to analyze four single-cell RNA-sequencing datasets, exemplifying interpretable two-dimensional representations of the high-dimensional single-cell RNA-sequencing data.
Sonnaert, Maarten; Kerckhofs, Greet; Papantoniou, Ioannis; Van Vlierberghe, Sandra; Boterberg, Veerle; Dubruel, Peter; Luyten, Frank P; Schrooten, Jan; Geris, Liesbet
2015-01-01
To progress the fields of tissue engineering (TE) and regenerative medicine, development of quantitative methods for non-invasive three dimensional characterization of engineered constructs (i.e. cells/tissue combined with scaffolds) becomes essential. In this study, we have defined the most optimal staining conditions for contrast-enhanced nanofocus computed tomography for three dimensional visualization and quantitative analysis of in vitro engineered neo-tissue (i.e. extracellular matrix containing cells) in perfusion bioreactor-developed Ti6Al4V constructs. A fractional factorial 'design of experiments' approach was used to elucidate the influence of the staining time and concentration of two contrast agents (Hexabrix and phosphotungstic acid) and the neo-tissue volume on the image contrast and dataset quality. Additionally, the neo-tissue shrinkage that was induced by phosphotungstic acid staining was quantified to determine the operating window within which this contrast agent can be accurately applied. For Hexabrix the staining concentration was the main parameter influencing image contrast and dataset quality. Using phosphotungstic acid the staining concentration had a significant influence on the image contrast while both staining concentration and neo-tissue volume had an influence on the dataset quality. The use of high concentrations of phosphotungstic acid did however introduce significant shrinkage of the neo-tissue indicating that, despite sub-optimal image contrast, low concentrations of this staining agent should be used to enable quantitative analysis. To conclude, design of experiments allowed us to define the most optimal staining conditions for contrast-enhanced nanofocus computed tomography to be used as a routine screening tool of neo-tissue formation in Ti6Al4V constructs, transforming it into a robust three dimensional quality control methodology.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yang, Y. X.; Van Reeth, E.; Poh, C. L., E-mail: clpoh@ntu.edu.sg
2015-08-15
Purpose: Accurate visualization of lung motion is important in many clinical applications, such as radiotherapy of lung cancer. Advancement in imaging modalities [e.g., computed tomography (CT) and MRI] has allowed dynamic imaging of lung and lung tumor motion. However, each imaging modality has its advantages and disadvantages. The study presented in this paper aims at generating synthetic 4D-CT dataset for lung cancer patients by combining both continuous three-dimensional (3D) motion captured by 4D-MRI and the high spatial resolution captured by CT using the authors’ proposed approach. Methods: A novel hybrid approach based on deformable image registration (DIR) and finite elementmore » method simulation was developed to fuse a static 3D-CT volume (acquired under breath-hold) and the 3D motion information extracted from 4D-MRI dataset, creating a synthetic 4D-CT dataset. Results: The study focuses on imaging of lung and lung tumor. Comparing the synthetic 4D-CT dataset with the acquired 4D-CT dataset of six lung cancer patients based on 420 landmarks, accurate results (average error <2 mm) were achieved using the authors’ proposed approach. Their hybrid approach achieved a 40% error reduction (based on landmarks assessment) over using only DIR techniques. Conclusions: The synthetic 4D-CT dataset generated has high spatial resolution, has excellent lung details, and is able to show movement of lung and lung tumor over multiple breathing cycles.« less
Atlas-guided cluster analysis of large tractography datasets.
Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer
2013-01-01
Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment.
Weitz, Jochen; Deppe, Herbert; Stopp, Sebastian; Lueth, Tim; Mueller, Steffen; Hohlweg-Majert, Bettina
2011-12-01
The aim of this study is to evaluate the accuracy of a surgical template-aided implant placement produced by rapid prototyping using a DICOM dataset from cone beam computer tomography (CBCT). On the basis of CBCT scans (Sirona® Galileos), a total of ten models were produced using a rapid-prototyping three-dimensional printer. On the same patients, impressions were performed to compare fitting accuracy of both methods. From the models made by impression, templates were produced and accuracy was compared and analyzed with the rapid-prototyping model. Whereas templates made by conventional procedure had an excellent accuracy, the fitting accuracy of those produced by DICOM datasets was not sufficient. Deviations ranged between 2.0 and 3.5 mm, after modification of models between 1.4 and 3.1 mm. The findings of this study suggest that the accuracy of the low-dose Sirona Galileos® DICOM dataset seems to show a high deviation, which is not useable for accurate surgical transfer for example in implant surgery.
Krüger, Angela V; Jelier, Rob; Dzyubachyk, Oleh; Zimmerman, Timo; Meijering, Erik; Lehner, Ben
2015-02-15
Chromatin regulators are widely expressed proteins with diverse roles in gene expression, nuclear organization, cell cycle regulation, pluripotency, physiology and development, and are frequently mutated in human diseases such as cancer. Their inhibition often results in pleiotropic effects that are difficult to study using conventional approaches. We have developed a semi-automated nuclear tracking algorithm to quantify the divisions, movements and positions of all nuclei during the early development of Caenorhabditis elegans and have used it to systematically study the effects of inhibiting chromatin regulators. The resulting high dimensional datasets revealed that inhibition of multiple regulators, including F55A3.3 (encoding FACT subunit SUPT16H), lin-53 (RBBP4/7), rba-1 (RBBP4/7), set-16 (MLL2/3), hda-1 (HDAC1/2), swsn-7 (ARID2), and let-526 (ARID1A/1B) affected cell cycle progression and caused chromosome segregation defects. In contrast, inhibition of cir-1 (CIR1) accelerated cell division timing in specific cells of the AB lineage. The inhibition of RNA polymerase II also accelerated these division timings, suggesting that normal gene expression is required to delay cell cycle progression in multiple lineages in the early embryo. Quantitative analyses of the dataset suggested the existence of at least two functionally distinct SWI/SNF chromatin remodeling complex activities in the early embryo, and identified a redundant requirement for the egl-27 and lin-40 MTA orthologs in the development of endoderm and mesoderm lineages. Moreover, our dataset also revealed a characteristic rearrangement of chromatin to the nuclear periphery upon the inhibition of multiple general regulators of gene expression. Our systematic, comprehensive and quantitative datasets illustrate the power of single cell-resolution quantitative tracking and high dimensional phenotyping to investigate gene function. Furthermore, the results provide an overview of the functions of essential chromatin regulators during the early development of an animal. Copyright © 2014 Elsevier Inc. All rights reserved.
JOINT AND INDIVIDUAL VARIATION EXPLAINED (JIVE) FOR INTEGRATED ANALYSIS OF MULTIPLE DATA TYPES.
Lock, Eric F; Hoadley, Katherine A; Marron, J S; Nobel, Andrew B
2013-03-01
Research in several fields now requires the analysis of datasets in which multiple high-dimensional types of data are available for a common set of objects. In particular, The Cancer Genome Atlas (TCGA) includes data from several diverse genomic technologies on the same cancerous tumor samples. In this paper we introduce Joint and Individual Variation Explained (JIVE), a general decomposition of variation for the integrated analysis of such datasets. The decomposition consists of three terms: a low-rank approximation capturing joint variation across data types, low-rank approximations for structured variation individual to each data type, and residual noise. JIVE quantifies the amount of joint variation between data types, reduces the dimensionality of the data, and provides new directions for the visual exploration of joint and individual structure. The proposed method represents an extension of Principal Component Analysis and has clear advantages over popular two-block methods such as Canonical Correlation Analysis and Partial Least Squares. A JIVE analysis of gene expression and miRNA data on Glioblastoma Multiforme tumor samples reveals gene-miRNA associations and provides better characterization of tumor types.
Cai, Jia; Tang, Yi
2018-02-01
Canonical correlation analysis (CCA) is a powerful statistical tool for detecting the linear relationship between two sets of multivariate variables. Kernel generalization of it, namely, kernel CCA is proposed to describe nonlinear relationship between two variables. Although kernel CCA can achieve dimensionality reduction results for high-dimensional data feature selection problem, it also yields the so called over-fitting phenomenon. In this paper, we consider a new kernel CCA algorithm via randomized Kaczmarz method. The main contributions of the paper are: (1) A new kernel CCA algorithm is developed, (2) theoretical convergence of the proposed algorithm is addressed by means of scaled condition number, (3) a lower bound which addresses the minimum number of iterations is presented. We test on both synthetic dataset and several real-world datasets in cross-language document retrieval and content-based image retrieval to demonstrate the effectiveness of the proposed algorithm. Numerical results imply the performance and efficiency of the new algorithm, which is competitive with several state-of-the-art kernel CCA methods. Copyright © 2017 Elsevier Ltd. All rights reserved.
Dazard, Jean-Eudes; Rao, J. Sunil
2010-01-01
The search for structures in real datasets e.g. in the form of bumps, components, classes or clusters is important as these often reveal underlying phenomena leading to scientific discoveries. One of these tasks, known as bump hunting, is to locate domains of a multidimensional input space where the target function assumes local maxima without pre-specifying their total number. A number of related methods already exist, yet are challenged in the context of high dimensional data. We introduce a novel supervised and multivariate bump hunting strategy for exploring modes or classes of a target function of many continuous variables. This addresses the issues of correlation, interpretability, and high-dimensionality (p ≫ n case), while making minimal assumptions. The method is based upon a divide and conquer strategy, combining a tree-based method, a dimension reduction technique, and the Patient Rule Induction Method (PRIM). Important to this task, we show how to estimate the PRIM meta-parameters. Using accuracy evaluation procedures such as cross-validation and ROC analysis, we show empirically how the method outperforms a naive PRIM as well as competitive non-parametric supervised and unsupervised methods in the problem of class discovery. The method has practical application especially in the case of noisy high-throughput data. It is applied to a class discovery problem in a colon cancer micro-array dataset aimed at identifying tumor subtypes in the metastatic stage. Supplemental Materials are available online. PMID:22399839
BioSig3D: High Content Screening of Three-Dimensional Cell Culture Models
Bilgin, Cemal Cagatay; Fontenay, Gerald; Cheng, Qingsu; Chang, Hang; Han, Ju; Parvin, Bahram
2016-01-01
BioSig3D is a computational platform for high-content screening of three-dimensional (3D) cell culture models that are imaged in full 3D volume. It provides an end-to-end solution for designing high content screening assays, based on colony organization that is derived from segmentation of nuclei in each colony. BioSig3D also enables visualization of raw and processed 3D volumetric data for quality control, and integrates advanced bioinformatics analysis. The system consists of multiple computational and annotation modules that are coupled together with a strong use of controlled vocabularies to reduce ambiguities between different users. It is a web-based system that allows users to: design an experiment by defining experimental variables, upload a large set of volumetric images into the system, analyze and visualize the dataset, and either display computed indices as a heatmap, or phenotypic subtypes for heterogeneity analysis, or download computed indices for statistical analysis or integrative biology. BioSig3D has been used to profile baseline colony formations with two experiments: (i) morphogenesis of a panel of human mammary epithelial cell lines (HMEC), and (ii) heterogeneity in colony formation using an immortalized non-transformed cell line. These experiments reveal intrinsic growth properties of well-characterized cell lines that are routinely used for biological studies. BioSig3D is being released with seed datasets and video-based documentation. PMID:26978075
CyTOF workflow: differential discovery in high-throughput high-dimensional cytometry datasets
Nowicka, Malgorzata; Krieg, Carsten; Weber, Lukas M.; Hartmann, Felix J.; Guglietta, Silvia; Becher, Burkhard; Levesque, Mitchell P.; Robinson, Mark D.
2017-01-01
High dimensional mass and flow cytometry (HDCyto) experiments have become a method of choice for high throughput interrogation and characterization of cell populations.Here, we present an R-based pipeline for differential analyses of HDCyto data, largely based on Bioconductor packages. We computationally define cell populations using FlowSOM clustering, and facilitate an optional but reproducible strategy for manual merging of algorithm-generated clusters. Our workflow offers different analysis paths, including association of cell type abundance with a phenotype or changes in signaling markers within specific subpopulations, or differential analyses of aggregated signals. Importantly, the differential analyses we show are based on regression frameworks where the HDCyto data is the response; thus, we are able to model arbitrary experimental designs, such as those with batch effects, paired designs and so on. In particular, we apply generalized linear mixed models to analyses of cell population abundance or cell-population-specific analyses of signaling markers, allowing overdispersion in cell count or aggregated signals across samples to be appropriately modeled. To support the formal statistical analyses, we encourage exploratory data analysis at every step, including quality control (e.g. multi-dimensional scaling plots), reporting of clustering results (dimensionality reduction, heatmaps with dendrograms) and differential analyses (e.g. plots of aggregated signals). PMID:28663787
Systems and methods for displaying data in split dimension levels
Stolte, Chris; Hanrahan, Patrick
2015-07-28
Systems and methods for displaying data in split dimension levels are disclosed. In some implementations, a method includes: at a computer, obtaining a dimensional hierarchy associated with a dataset, wherein the dimensional hierarchy includes at least one dimension and a sub-dimension of the at least one dimension; and populating information representing data included in the dataset into a visual table having a first axis and a second axis, wherein the first axis corresponds to the at least one dimension and the second axis corresponds to the sub-dimension of the at least one dimension.
GATE: software for the analysis and visualization of high-dimensional time series expression data.
MacArthur, Ben D; Lachmann, Alexander; Lemischka, Ihor R; Ma'ayan, Avi
2010-01-01
We present Grid Analysis of Time series Expression (GATE), an integrated computational software platform for the analysis and visualization of high-dimensional biomolecular time series. GATE uses a correlation-based clustering algorithm to arrange molecular time series on a two-dimensional hexagonal array and dynamically colors individual hexagons according to the expression level of the molecular component to which they are assigned, to create animated movies of systems-level molecular regulatory dynamics. In order to infer potential regulatory control mechanisms from patterns of correlation, GATE also allows interactive interroga-tion of movies against a wide variety of prior knowledge datasets. GATE movies can be paused and are interactive, allowing users to reconstruct networks and perform functional enrichment analyses. Movies created with GATE can be saved in Flash format and can be inserted directly into PDF manuscript files as interactive figures. GATE is available for download and is free for academic use from http://amp.pharm.mssm.edu/maayan-lab/gate.htm
NASA Astrophysics Data System (ADS)
Shahiri, Amirah Mohamed; Husain, Wahidah; Rashid, Nur'Aini Abd
2017-10-01
Huge amounts of data in educational datasets may cause the problem in producing quality data. Recently, data mining approach are increasingly used by educational data mining researchers for analyzing the data patterns. However, many research studies have concentrated on selecting suitable learning algorithms instead of performing feature selection process. As a result, these data has problem with computational complexity and spend longer computational time for classification. The main objective of this research is to provide an overview of feature selection techniques that have been used to analyze the most significant features. Then, this research will propose a framework to improve the quality of students' dataset. The proposed framework uses filter and wrapper based technique to support prediction process in future study.
Embedded sparse representation of fMRI data via group-wise dictionary optimization
NASA Astrophysics Data System (ADS)
Zhu, Dajiang; Lin, Binbin; Faskowitz, Joshua; Ye, Jieping; Thompson, Paul M.
2016-03-01
Sparse learning enables dimension reduction and efficient modeling of high dimensional signals and images, but it may need to be tailored to best suit specific applications and datasets. Here we used sparse learning to efficiently represent functional magnetic resonance imaging (fMRI) data from the human brain. We propose a novel embedded sparse representation (ESR), to identify the most consistent dictionary atoms across different brain datasets via an iterative group-wise dictionary optimization procedure. In this framework, we introduced additional criteria to make the learned dictionary atoms more consistent across different subjects. We successfully identified four common dictionary atoms that follow the external task stimuli with very high accuracy. After projecting the corresponding coefficient vectors back into the 3-D brain volume space, the spatial patterns are also consistent with traditional fMRI analysis results. Our framework reveals common features of brain activation in a population, as a new, efficient fMRI analysis method.
A linear-RBF multikernel SVM to classify big text corpora.
Romero, R; Iglesias, E L; Borrajo, L
2015-01-01
Support vector machine (SVM) is a powerful technique for classification. However, SVM is not suitable for classification of large datasets or text corpora, because the training complexity of SVMs is highly dependent on the input size. Recent developments in the literature on the SVM and other kernel methods emphasize the need to consider multiple kernels or parameterizations of kernels because they provide greater flexibility. This paper shows a multikernel SVM to manage highly dimensional data, providing an automatic parameterization with low computational cost and improving results against SVMs parameterized under a brute-force search. The model consists in spreading the dataset into cohesive term slices (clusters) to construct a defined structure (multikernel). The new approach is tested on different text corpora. Experimental results show that the new classifier has good accuracy compared with the classic SVM, while the training is significantly faster than several other SVM classifiers.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Van de Velde, Joris, E-mail: joris.vandevelde@ugent.be; Department of Radiotherapy, Ghent University, Ghent; Audenaert, Emmanuel
Purpose: To develop contouring guidelines for the brachial plexus (BP) using anatomically validated cadaver datasets. Magnetic resonance imaging (MRI) and computed tomography (CT) were used to obtain detailed visualizations of the BP region, with the goal of achieving maximal inclusion of the actual BP in a small contoured volume while also accommodating for anatomic variations. Methods and Materials: CT and MRI were obtained for 8 cadavers positioned for intensity modulated radiation therapy. 3-dimensional reconstructions of soft tissue (from MRI) and bone (from CT) were combined to create 8 separate enhanced CT project files. Dissection of the corresponding cadavers anatomically validatedmore » the reconstructions created. Seven enhanced CT project files were then automatically fitted, separately in different regions, to obtain a single dataset of superimposed BP regions that incorporated anatomic variations. From this dataset, improved BP contouring guidelines were developed. These guidelines were then applied to the 7 original CT project files and also to 1 additional file, left out from the superimposing procedure. The percentage of BP inclusion was compared with the published guidelines. Results: The anatomic validation procedure showed a high level of conformity for the BP regions examined between the 3-dimensional reconstructions generated and the dissected counterparts. Accurate and detailed BP contouring guidelines were developed, which provided corresponding guidance for each level in a clinical dataset. An average margin of 4.7 mm around the anatomically validated BP contour is sufficient to accommodate for anatomic variations. Using the new guidelines, 100% inclusion of the BP was achieved, compared with a mean inclusion of 37.75% when published guidelines were applied. Conclusion: Improved guidelines for BP delineation were developed using combined MRI and CT imaging with validation by anatomic dissection.« less
A trace ratio maximization approach to multiple kernel-based dimensionality reduction.
Jiang, Wenhao; Chung, Fu-lai
2014-01-01
Most dimensionality reduction techniques are based on one metric or one kernel, hence it is necessary to select an appropriate kernel for kernel-based dimensionality reduction. Multiple kernel learning for dimensionality reduction (MKL-DR) has been recently proposed to learn a kernel from a set of base kernels which are seen as different descriptions of data. As MKL-DR does not involve regularization, it might be ill-posed under some conditions and consequently its applications are hindered. This paper proposes a multiple kernel learning framework for dimensionality reduction based on regularized trace ratio, termed as MKL-TR. Our method aims at learning a transformation into a space of lower dimension and a corresponding kernel from the given base kernels among which some may not be suitable for the given data. The solutions for the proposed framework can be found based on trace ratio maximization. The experimental results demonstrate its effectiveness in benchmark datasets, which include text, image and sound datasets, for supervised, unsupervised as well as semi-supervised settings. Copyright © 2013 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Shi, Guoliang; Peng, Xing; Huangfu, Yanqi; Wang, Wei; Xu, Jiao; Tian, Yingze; Feng, Yinchang; Ivey, Cesunica E.; Russell, Armistead G.
2017-07-01
Source apportionment technologies are used to understand the impacts of important sources of particulate matter (PM) air quality, and are widely used for both scientific studies and air quality management. Generally, receptor models apportion speciated PM data from a single sampling site. With the development of large scale monitoring networks, PM speciation are observed at multiple sites in an urban area. For these situations, the models should account for three factors, or dimensions, of the PM, including the chemical species concentrations, sampling periods and sampling site information, suggesting the potential power of a three-dimensional source apportionment approach. However, the principle of three-dimensional Parallel Factor Analysis (Ordinary PARAFAC) model does not always work well in real environmental situations for multi-site receptor datasets. In this work, a new three-way receptor model, called "multi-site three way factor analysis" model is proposed to deal with the multi-site receptor datasets. Synthetic datasets were developed and introduced into the new model to test its performance. Average absolute error (AAE, between estimated and true contributions) for extracted sources were all less than 50%. Additionally, three-dimensional ambient datasets from a Chinese mega-city, Chengdu, were analyzed using this new model to assess the application. Four factors are extracted by the multi-site WFA3 model: secondary source have the highest contributions (64.73 and 56.24 μg/m3), followed by vehicular exhaust (30.13 and 33.60 μg/m3), crustal dust (26.12 and 29.99 μg/m3) and coal combustion (10.73 and 14.83 μg/m3). The model was also compared to PMF, with general agreement, though PMF suggested a lower crustal contribution.
An Evaluation of Database Solutions to Spatial Object Association
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kumar, V S; Kurc, T; Saltz, J
2008-06-24
Object association is a common problem encountered in many applications. Spatial object association, also referred to as crossmatch of spatial datasets, is the problem of identifying and comparing objects in two datasets based on their positions in a common spatial coordinate system--one of the datasets may correspond to a catalog of objects observed over time in a multi-dimensional domain; the other dataset may consist of objects observed in a snapshot of the domain at a time point. The use of database management systems to the solve the object association problem provides portability across different platforms and also greater flexibility. Increasingmore » dataset sizes in today's applications, however, have made object association a data/compute-intensive problem that requires targeted optimizations for efficient execution. In this work, we investigate how database-based crossmatch algorithms can be deployed on different database system architectures and evaluate the deployments to understand the impact of architectural choices on crossmatch performance and associated trade-offs. We investigate the execution of two crossmatch algorithms on (1) a parallel database system with active disk style processing capabilities, (2) a high-throughput network database (MySQL Cluster), and (3) shared-nothing databases with replication. We have conducted our study in the context of a large-scale astronomy application with real use-case scenarios.« less
Dimensionality reduction in epidemic spreading models
NASA Astrophysics Data System (ADS)
Frasca, M.; Rizzo, A.; Gallo, L.; Fortuna, L.; Porfiri, M.
2015-09-01
Complex dynamical systems often exhibit collective dynamics that are well described by a reduced set of key variables in a low-dimensional space. Such a low-dimensional description offers a privileged perspective to understand the system behavior across temporal and spatial scales. In this work, we propose a data-driven approach to establish low-dimensional representations of large epidemic datasets by using a dimensionality reduction algorithm based on isometric features mapping (ISOMAP). We demonstrate our approach on synthetic data for epidemic spreading in a population of mobile individuals. We find that ISOMAP is successful in embedding high-dimensional data into a low-dimensional manifold, whose topological features are associated with the epidemic outbreak. Across a range of simulation parameters and model instances, we observe that epidemic outbreaks are embedded into a family of closed curves in a three-dimensional space, in which neighboring points pertain to instants that are close in time. The orientation of each curve is unique to a specific outbreak, and the coordinates correlate with the number of infected individuals. A low-dimensional description of epidemic spreading is expected to improve our understanding of the role of individual response on the outbreak dynamics, inform the selection of meaningful global observables, and, possibly, aid in the design of control and quarantine procedures.
NASA Astrophysics Data System (ADS)
Prasanna, V.
2016-06-01
The warm (cold) phase of El Niño (La Niña) and its impact on all Indian Summer Monsoon rainfall (AISMR) relationship is explored for the past 100 years. The 103-year (1901-2003) data from the twentieth century reanalysis datasets (20CR) and other major reanalysis datasets for southwest monsoon season (JJAS) is utilized to find out the simultaneous influence of the El Niño Southern Oscillation (ENSO)-AISMR relationship. Two cases such as wet, dry monsoon years associated with ENSO(+) (El Niño), ENSO(-) (La Niña) and Non-ENSO (neutral) events have been discussed in detail using observed rainfall and three-dimensional 20CR dataset. The dry and wet years associated with ENSO and Non-ENSO periods show significant differences in the spatial pattern of rainfall associated with three-dimensional atmospheric composite, the 20CR dataset has captured the anomalies quite well. During wet (dry) years, the rainfall is high (low), i.e. 10 % above (below) average from the long-term mean and this wet or dry condition occur both during ENSO and Non-ENSO phases. The Non-ENSO year dry or wet composites are also focused in detail to understand, where do the anomalous winds come from unlike in the ENSO case. The moisture transport is coherent with the changes in the spatial pattern of AISMR and large-scale feature in the 20CR dataset. Recent 50-year trend (1951-2000) is also analyzed from various available observational and reanalysis datasets to see the influence of Indo-Pacific SST and moist processes on the South Asian summer monsoon rainfall trend. Apart from the Indo-Pacific sea surface temperatures (SST), the moisture convergence and moisture transport among India (IND), Equatorial Indian Ocean (IOC) and tropical western pacific (WNP) is also important in modifying the wet or dry cycles over India. The mutual interaction among IOC, WNP and IND in seasonal timescales is significant in modifying wet and dry cycles over the Indian region and the seasonal anomalies.
Tao, Chenyang; Nichols, Thomas E.; Hua, Xue; Ching, Christopher R.K.; Rolls, Edmund T.; Thompson, Paul M.; Feng, Jianfeng
2017-01-01
We propose a generalized reduced rank latent factor regression model (GRRLF) for the analysis of tensor field responses and high dimensional covariates. The model is motivated by the need from imaging-genetic studies to identify genetic variants that are associated with brain imaging phenotypes, often in the form of high dimensional tensor fields. GRRLF identifies from the structure in the data the effective dimensionality of the data, and then jointly performs dimension reduction of the covariates, dynamic identification of latent factors, and nonparametric estimation of both covariate and latent response fields. After accounting for the latent and covariate effects, GRLLF performs a nonparametric test on the remaining factor of interest. GRRLF provides a better factorization of the signals compared with common solutions, and is less susceptible to overfitting because it exploits the effective dimensionality. The generality and the flexibility of GRRLF also allow various statistical models to be handled in a unified framework and solutions can be efficiently computed. Within the field of neuroimaging, it improves the sensitivity for weak signals and is a promising alternative to existing approaches. The operation of the framework is demonstrated with both synthetic datasets and a real-world neuroimaging example in which the effects of a set of genes on the structure of the brain at the voxel level were measured, and the results compared favorably with those from existing approaches. PMID:27666385
Hand, foot and mouth disease: spatiotemporal transmission and climate.
Wang, Jin-feng; Guo, Yan-Sha; Christakos, George; Yang, Wei-Zhong; Liao, Yi-Lan; Li, Zhong-Jie; Li, Xiao-Zhou; Lai, Sheng-Jie; Chen, Hong-Yan
2011-04-05
The Hand-Foot-Mouth Disease (HFMD) is the most common infectious disease in China, its total incidence being around 500,000~1,000,000 cases per year. The composite space-time disease variation is the result of underlining attribute mechanisms that could provide clues about the physiologic and demographic determinants of disease transmission and also guide the appropriate allocation of medical resources to control the disease. HFMD cases were aggregated into 1456 counties and during a period of 11 months. Suspected climate attributes to HFMD were recorded monthly at 674 stations throughout the country and subsequently interpolated within 1456 × 11 cells across space-time (same as the number of HFMD cases) using the Bayesian Maximum Entropy (BME) method while taking into consideration the relevant uncertainty sources. The dimensionalities of the two datasets together with the integrated dataset combining the two previous ones are very high when the topologies of the space-time relationships between cells are taken into account. Using a self-organizing map (SOM) algorithm the dataset dimensionality was effectively reduced into 2 dimensions, while the spatiotemporal attribute structure was maintained. 16 types of spatiotemporal HFMD transmission were identified, and 3-4 high spatial incidence clusters of the HFMD types were found throughout China, which are basically within the scope of the monthly climate (precipitation) types. HFMD propagates in a composite space-time domain rather than showing a purely spatial and purely temporal variation. There is a clear relationship between HFMD occurrence and climate. HFMD cases are geographically clustered and closely linked to the monthly precipitation types of the region. The occurrence of the former depends on the later.
NASA Astrophysics Data System (ADS)
Shibahara, A.; Tsukamoto, H.; Kazahaya, K.; Morikawa, N.; Takahashi, M.; Takahashi, H.; Yasuhara, M.; Ohwada, M.; Oyama, Y.; Inamura, A.; Handa, H.; Nakama, J.
2008-12-01
Kobe city is located on the northern side of Osaka sedimentary basin, Japan, containing 1,000-2,000 m thick Quaternary sediments. After the Hanshin-Awaji Earthquake (January 17, 1995), a number of geological and geophysical surveys were conducted in this region. Then high-temperature anomaly of groundwater accompanied with high Cl concentration was detected along fault systems in this area. In addition, dissolved He in groundwater showed nearly upper mantle-like 3He/4He ratio, although there were no Quaternary volcanic activities in this region. Some recent studies have assumed that these groundwater profiles are related with geological structure because some faults and joints can function as pathways for groundwater flow, and mantle-derived water can upwell through the fault system to the ground surface. To verify these hypotheses, we established 3D geological and hydrological model around Osaka sedimentary basin. Our primary goal is to analyze spatial relationship between geological structure and groundwater profile. In the study region, a number of geological and hydrological datasets, such as boring log data, seismic profiling data, groundwater chemical profile, were reported. We converted these datasets to meshed data on the GIS, and plotted in the three dimensional space to visualize spatial distribution. Furthermore, we projected seismic profiling data into three dimensional space and calculated distance between faults and sampling points, using Visual Basic for Applications (VBA) scripts. All 3D models are converted into VRML format, and can be used as a versatile dataset on personal computer. This research project has been conducted under the research contract with the Japan Nuclear Energy Safety Organization (JNES).
Majumdar, Subhabrata; Basak, Subhash C
2018-04-26
Proper validation is an important aspect of QSAR modelling. External validation is one of the widely used validation methods in QSAR where the model is built on a subset of the data and validated on the rest of the samples. However, its effectiveness for datasets with a small number of samples but large number of predictors remains suspect. Calculating hundreds or thousands of molecular descriptors using currently available software has become the norm in QSAR research, owing to computational advances in the past few decades. Thus, for n chemical compounds and p descriptors calculated for each molecule, the typical chemometric dataset today has high value of p but small n (i.e. n < p). Motivated by the evidence of inadequacies of external validation in estimating the true predictive capability of a statistical model in recent literature, this paper performs an extensive and comparative study of this method with several other validation techniques. We compared four validation methods: leave-one-out, K-fold, external and multi-split validation, using statistical models built using the LASSO regression, which simultaneously performs variable selection and modelling. We used 300 simulated datasets and one real dataset of 95 congeneric amine mutagens for this evaluation. External validation metrics have high variation among different random splits of the data, hence are not recommended for predictive QSAR models. LOO has the overall best performance among all validation methods applied in our scenario. Results from external validation are too unstable for the datasets we analyzed. Based on our findings, we recommend using the LOO procedure for validating QSAR predictive models built on high-dimensional small-sample data. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
Xray: N-dimensional, labeled arrays for analyzing physical datasets in Python
NASA Astrophysics Data System (ADS)
Hoyer, S.
2015-12-01
Efficient analysis of geophysical datasets requires tools that both preserve and utilize metadata, and that transparently scale to process large datas. Xray is such a tool, in the form of an open source Python library for analyzing the labeled, multi-dimensional array (tensor) datasets that are ubiquitous in the Earth sciences. Xray's approach pairs Python data structures based on the data model of the netCDF file format with the proven design and user interface of pandas, the popular Python data analysis library for labeled tabular data. On top of the NumPy array, xray adds labeled dimensions (e.g., "time") and coordinate values (e.g., "2015-04-10"), which it uses to enable a host of operations powered by these labels: selection, aggregation, alignment, broadcasting, split-apply-combine, interoperability with pandas and serialization to netCDF/HDF5. Many of these operations are enabled by xray's tight integration with pandas. Finally, to allow for easy parallelism and to enable its labeled data operations to scale to datasets that does not fit into memory, xray integrates with the parallel processing library dask.
Atlas-Guided Cluster Analysis of Large Tractography Datasets
Ros, Christian; Güllmar, Daniel; Stenzel, Martin; Mentzel, Hans-Joachim; Reichenbach, Jürgen Rainer
2013-01-01
Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment. PMID:24386292
Li, Yushuang; Yang, Jiasheng; Zhang, Yi
2016-01-01
In this paper, we have proposed a novel alignment-free method for comparing the similarity of protein sequences. We first encode a protein sequence into a 440 dimensional feature vector consisting of a 400 dimensional Pseudo-Markov transition probability vector among the 20 amino acids, a 20 dimensional content ratio vector, and a 20 dimensional position ratio vector of the amino acids in the sequence. By evaluating the Euclidean distances among the representing vectors, we compare the similarity of protein sequences. We then apply this method into the ND5 dataset consisting of the ND5 protein sequences of 9 species, and the F10 and G11 datasets representing two of the xylanases containing glycoside hydrolase families, i.e., families 10 and 11. As a result, our method achieves a correlation coefficient of 0.962 with the canonical protein sequence aligner ClustalW in the ND5 dataset, much higher than those of other 5 popular alignment-free methods. In addition, we successfully separate the xylanases sequences in the F10 family and the G11 family and illustrate that the F10 family is more heat stable than the G11 family, consistent with a few previous studies. Moreover, we prove mathematically an identity equation involving the Pseudo-Markov transition probability vector and the amino acids content ratio vector. PMID:27918587
Graph Based Models for Unsupervised High Dimensional Data Clustering and Network Analysis
2015-01-01
ApprovedOMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for...algorithms we proposed improve the time e ciency signi cantly for large scale datasets. In the last chapter, we also propose an incremental reseeding...plume detection in hyper-spectral video data. These graph based clustering algorithms we proposed improve the time efficiency significantly for large
Sonification Prototype for Space Physics
NASA Astrophysics Data System (ADS)
Candey, R. M.; Schertenleib, A. M.; Diaz Merced, W. L.
2005-12-01
As an alternative and adjunct to visual displays, auditory exploration of data via sonification (data controlled sound) and audification (audible playback of data samples) is promising for complex or rapidly/temporally changing visualizations, for data exploration of large datasets (particularly multi-dimensional datasets), and for exploring datasets in frequency rather than spatial dimensions (see also International Conferences on Auditory Display
Single-shot diffraction data from the Mimivirus particle using an X-ray free-electron laser.
Ekeberg, Tomas; Svenda, Martin; Seibert, M Marvin; Abergel, Chantal; Maia, Filipe R N C; Seltzer, Virginie; DePonte, Daniel P; Aquila, Andrew; Andreasson, Jakob; Iwan, Bianca; Jönsson, Olof; Westphal, Daniel; Odić, Duško; Andersson, Inger; Barty, Anton; Liang, Meng; Martin, Andrew V; Gumprecht, Lars; Fleckenstein, Holger; Bajt, Saša; Barthelmess, Miriam; Coppola, Nicola; Claverie, Jean-Michel; Loh, N Duane; Bostedt, Christoph; Bozek, John D; Krzywinski, Jacek; Messerschmidt, Marc; Bogan, Michael J; Hampton, Christina Y; Sierra, Raymond G; Frank, Matthias; Shoeman, Robert L; Lomb, Lukas; Foucar, Lutz; Epp, Sascha W; Rolles, Daniel; Rudenko, Artem; Hartmann, Robert; Hartmann, Andreas; Kimmel, Nils; Holl, Peter; Weidenspointner, Georg; Rudek, Benedikt; Erk, Benjamin; Kassemeyer, Stephan; Schlichting, Ilme; Strüder, Lothar; Ullrich, Joachim; Schmidt, Carlo; Krasniqi, Faton; Hauser, Günter; Reich, Christian; Soltau, Heike; Schorb, Sebastian; Hirsemann, Helmut; Wunderer, Cornelia; Graafsma, Heinz; Chapman, Henry; Hajdu, Janos
2016-08-01
Free-electron lasers (FEL) hold the potential to revolutionize structural biology by producing X-ray pules short enough to outrun radiation damage, thus allowing imaging of biological samples without the limitation from radiation damage. Thus, a major part of the scientific case for the first FELs was three-dimensional (3D) reconstruction of non-crystalline biological objects. In a recent publication we demonstrated the first 3D reconstruction of a biological object from an X-ray FEL using this technique. The sample was the giant Mimivirus, which is one of the largest known viruses with a diameter of 450 nm. Here we present the dataset used for this successful reconstruction. Data-analysis methods for single-particle imaging at FELs are undergoing heavy development but data collection relies on very limited time available through a highly competitive proposal process. This dataset provides experimental data to the entire community and could boost algorithm development and provide a benchmark dataset for new algorithms.
Basak, Subhash C; Majumdar, Subhabrata
2015-01-01
Variation in high-dimensional data is often caused by a few latent factors, and hence dimension reduction or variable selection techniques are often useful in gathering useful information from the data. In this paper we consider two such recent methods: Interrelated two-way clustering and envelope models. We couple these methods with traditional statistical procedures like ridge regression and linear discriminant analysis, and apply them on two data sets which have more predictors than samples (i.e. n < p scenario) and several types of molecular descriptors. One of these datasets consists of a congeneric group of Amines while the other has a much diverse collection compounds. The difference of prediction results between these two datasets for both the methods supports the hypothesis that for a congeneric set of compounds, descriptors of a certain type are enough to provide good QSAR models, but as the data set grows diverse including a variety of descriptors can improve model quality considerably.
The coming paradigm shift: A transition from manual to automated microscopy.
Farahani, Navid; Monteith, Corey E
2016-01-01
The field of pathology has used light microscopy (LM) extensively since the mid-19(th) century for examination of histological tissue preparations. This technology has remained the foremost tool in use by pathologists even as other fields have undergone a great change in recent years through new technologies. However, as new microscopy techniques are perfected and made available, this reliance on the standard LM will likely begin to change. Advanced imaging involving both diffraction-limited and subdiffraction techniques are bringing nondestructive, high-resolution, molecular-level imaging to pathology. Some of these technologies can produce three-dimensional (3D) datasets from sampled tissues. In addition, block-face/tissue-sectioning techniques are already providing automated, large-scale 3D datasets of whole specimens. These datasets allow pathologists to see an entire sample with all of its spatial information intact, and furthermore allow image analysis such as detection, segmentation, and classification, which are impossible in standard LM. It is likely that these technologies herald a major paradigm shift in the field of pathology.
Peek, N; Holmes, J H; Sun, J
2014-08-15
To review technical and methodological challenges for big data research in biomedicine and health. We discuss sources of big datasets, survey infrastructures for big data storage and big data processing, and describe the main challenges that arise when analyzing big data. The life and biomedical sciences are massively contributing to the big data revolution through secondary use of data that were collected during routine care and through new data sources such as social media. Efficient processing of big datasets is typically achieved by distributing computation over a cluster of computers. Data analysts should be aware of pitfalls related to big data such as bias in routine care data and the risk of false-positive findings in high-dimensional datasets. The major challenge for the near future is to transform analytical methods that are used in the biomedical and health domain, to fit the distributed storage and processing model that is required to handle big data, while ensuring confidentiality of the data being analyzed.
Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets.
Datta, Abhirup; Banerjee, Sudipto; Finley, Andrew O; Gelfand, Alan E
2016-01-01
Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations become large. This article develops a class of highly scalable nearest-neighbor Gaussian process (NNGP) models to provide fully model-based inference for large geostatistical datasets. We establish that the NNGP is a well-defined spatial process providing legitimate finite-dimensional Gaussian densities with sparse precision matrices. We embed the NNGP as a sparsity-inducing prior within a rich hierarchical modeling framework and outline how computationally efficient Markov chain Monte Carlo (MCMC) algorithms can be executed without storing or decomposing large matrices. The floating point operations (flops) per iteration of this algorithm is linear in the number of spatial locations, thereby rendering substantial scalability. We illustrate the computational and inferential benefits of the NNGP over competing methods using simulation studies and also analyze forest biomass from a massive U.S. Forest Inventory dataset at a scale that precludes alternative dimension-reducing methods. Supplementary materials for this article are available online.
Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets
Datta, Abhirup; Banerjee, Sudipto; Finley, Andrew O.; Gelfand, Alan E.
2018-01-01
Spatial process models for analyzing geostatistical data entail computations that become prohibitive as the number of spatial locations become large. This article develops a class of highly scalable nearest-neighbor Gaussian process (NNGP) models to provide fully model-based inference for large geostatistical datasets. We establish that the NNGP is a well-defined spatial process providing legitimate finite-dimensional Gaussian densities with sparse precision matrices. We embed the NNGP as a sparsity-inducing prior within a rich hierarchical modeling framework and outline how computationally efficient Markov chain Monte Carlo (MCMC) algorithms can be executed without storing or decomposing large matrices. The floating point operations (flops) per iteration of this algorithm is linear in the number of spatial locations, thereby rendering substantial scalability. We illustrate the computational and inferential benefits of the NNGP over competing methods using simulation studies and also analyze forest biomass from a massive U.S. Forest Inventory dataset at a scale that precludes alternative dimension-reducing methods. Supplementary materials for this article are available online. PMID:29720777
Single-shot diffraction data from the Mimivirus particle using an X-ray free-electron laser
NASA Astrophysics Data System (ADS)
Ekeberg, Tomas; Svenda, Martin; Seibert, M. Marvin; Abergel, Chantal; Maia, Filipe R. N. C.; Seltzer, Virginie; Deponte, Daniel P.; Aquila, Andrew; Andreasson, Jakob; Iwan, Bianca; Jönsson, Olof; Westphal, Daniel; Odić, Duško; Andersson, Inger; Barty, Anton; Liang, Meng; Martin, Andrew V.; Gumprecht, Lars; Fleckenstein, Holger; Bajt, Saša; Barthelmess, Miriam; Coppola, Nicola; Claverie, Jean-Michel; Loh, N. Duane; Bostedt, Christoph; Bozek, John D.; Krzywinski, Jacek; Messerschmidt, Marc; Bogan, Michael J.; Hampton, Christina Y.; Sierra, Raymond G.; Frank, Matthias; Shoeman, Robert L.; Lomb, Lukas; Foucar, Lutz; Epp, Sascha W.; Rolles, Daniel; Rudenko, Artem; Hartmann, Robert; Hartmann, Andreas; Kimmel, Nils; Holl, Peter; Weidenspointner, Georg; Rudek, Benedikt; Erk, Benjamin; Kassemeyer, Stephan; Schlichting, Ilme; Strüder, Lothar; Ullrich, Joachim; Schmidt, Carlo; Krasniqi, Faton; Hauser, Günter; Reich, Christian; Soltau, Heike; Schorb, Sebastian; Hirsemann, Helmut; Wunderer, Cornelia; Graafsma, Heinz; Chapman, Henry; Hajdu, Janos
2016-08-01
Free-electron lasers (FEL) hold the potential to revolutionize structural biology by producing X-ray pules short enough to outrun radiation damage, thus allowing imaging of biological samples without the limitation from radiation damage. Thus, a major part of the scientific case for the first FELs was three-dimensional (3D) reconstruction of non-crystalline biological objects. In a recent publication we demonstrated the first 3D reconstruction of a biological object from an X-ray FEL using this technique. The sample was the giant Mimivirus, which is one of the largest known viruses with a diameter of 450 nm. Here we present the dataset used for this successful reconstruction. Data-analysis methods for single-particle imaging at FELs are undergoing heavy development but data collection relies on very limited time available through a highly competitive proposal process. This dataset provides experimental data to the entire community and could boost algorithm development and provide a benchmark dataset for new algorithms.
OMERO and Bio-Formats 5: flexible access to large bioimaging datasets at scale
NASA Astrophysics Data System (ADS)
Moore, Josh; Linkert, Melissa; Blackburn, Colin; Carroll, Mark; Ferguson, Richard K.; Flynn, Helen; Gillen, Kenneth; Leigh, Roger; Li, Simon; Lindner, Dominik; Moore, William J.; Patterson, Andrew J.; Pindelski, Blazej; Ramalingam, Balaji; Rozbicki, Emil; Tarkowska, Aleksandra; Walczysko, Petr; Allan, Chris; Burel, Jean-Marie; Swedlow, Jason
2015-03-01
The Open Microscopy Environment (OME) has built and released Bio-Formats, a Java-based proprietary file format conversion tool and OMERO, an enterprise data management platform under open source licenses. In this report, we describe new versions of Bio-Formats and OMERO that are specifically designed to support large, multi-gigabyte or terabyte scale datasets that are routinely collected across most domains of biological and biomedical research. Bio- Formats reads image data directly from native proprietary formats, bypassing the need for conversion into a standard format. It implements the concept of a file set, a container that defines the contents of multi-dimensional data comprised of many files. OMERO uses Bio-Formats to read files natively, and provides a flexible access mechanism that supports several different storage and access strategies. These new capabilities of OMERO and Bio-Formats make them especially useful for use in imaging applications like digital pathology, high content screening and light sheet microscopy that create routinely large datasets that must be managed and analyzed.
Glover, Jason; Man, Tsz-Kwong; Barkauskas, Donald A.; Hall, David; Tello, Tanya; Sullivan, Mary Beth; Gorlick, Richard; Janeway, Katherine; Grier, Holcombe; Lau, Ching; Toretsky, Jeffrey A.; Borinstein, Scott C.; Khanna, Chand
2017-01-01
The prospective banking of osteosarcoma tissue samples to promote research endeavors has been realized through the establishment of a nationally centralized biospecimen repository, the Children’s Oncology Group (COG) biospecimen bank located at the Biopathology Center (BPC)/Nationwide Children’s Hospital in Columbus, Ohio. Although the physical inventory of osteosarcoma biospecimens is substantive (>15,000 sample specimens), the nature of these resources remains exhaustible. Despite judicious allocation of these high-value biospecimens for conducting sarcoma-related research, a deeper understanding of osteosarcoma biology, in particular metastases, remains unrealized. In addition the identification and development of novel diagnostics and effective therapeutics remain elusive. The QuadW-COG Childhood Sarcoma Biostatistics and Annotation Office (CSBAO) has developed the High Dimensional Data (HDD) platform to complement the existing physical inventory and to promote in silico hypothesis testing in sarcoma biology. The HDD is a relational biologic database derived from matched osteosarcoma biospecimens in which diverse experimental readouts have been generated and digitally deposited. As proof-of-concept, we demonstrate that the HDD platform can be utilized to address previously unrealized biologic questions though the systematic juxtaposition of diverse datasets derived from shared biospecimens. The continued population of the HDD platform with high-value, high-throughput and mineable datasets allows a shared and reusable resource for researchers, both experimentalists and bioinformatics investigators, to propose and answer questions in silico that advance our understanding of osteosarcoma biology. PMID:28732082
Correlation Dimension Estimates of Global and Local Temperature Data.
NASA Astrophysics Data System (ADS)
Wang, Qiang
1995-11-01
The author has attempted to detect the presence of low-dimensional deterministic chaos in temperature data by estimating the correlation dimension with the Hill estimate that has been recently developed by Mikosch and Wang. There is no convincing evidence of low dimensionality with either global dataset (Southern Hemisphere monthly average temperatures from 1858 to 1984) or local temperature dataset (daily minimums at Auckland, New Zealand). Any apparent reduction in the dimension estimates appears to be due large1y, if not entirely, to effects of statistical bias, but neither is it a purely random stochastic process. The dimension of the climatic attractor may be significantly larger than 10.
Spectral methods in machine learning and new strategies for very large datasets
Belabbas, Mohamed-Ali; Wolfe, Patrick J.
2009-01-01
Spectral methods are of fundamental importance in statistics and machine learning, because they underlie algorithms from classical principal components analysis to more recent approaches that exploit manifold structure. In most cases, the core technical problem can be reduced to computing a low-rank approximation to a positive-definite kernel. For the growing number of applications dealing with very large or high-dimensional datasets, however, the optimal approximation afforded by an exact spectral decomposition is too costly, because its complexity scales as the cube of either the number of training examples or their dimensionality. Motivated by such applications, we present here 2 new algorithms for the approximation of positive-semidefinite kernels, together with error bounds that improve on results in the literature. We approach this problem by seeking to determine, in an efficient manner, the most informative subset of our data relative to the kernel approximation task at hand. This leads to two new strategies based on the Nyström method that are directly applicable to massive datasets. The first of these—based on sampling—leads to a randomized algorithm whereupon the kernel induces a probability distribution on its set of partitions, whereas the latter approach—based on sorting—provides for the selection of a partition in a deterministic way. We detail their numerical implementation and provide simulation results for a variety of representative problems in statistical data analysis, each of which demonstrates the improved performance of our approach relative to existing methods. PMID:19129490
Yang, Yang; Saleemi, Imran; Shah, Mubarak
2013-07-01
This paper proposes a novel representation of articulated human actions and gestures and facial expressions. The main goals of the proposed approach are: 1) to enable recognition using very few examples, i.e., one or k-shot learning, and 2) meaningful organization of unlabeled datasets by unsupervised clustering. Our proposed representation is obtained by automatically discovering high-level subactions or motion primitives, by hierarchical clustering of observed optical flow in four-dimensional, spatial, and motion flow space. The completely unsupervised proposed method, in contrast to state-of-the-art representations like bag of video words, provides a meaningful representation conducive to visual interpretation and textual labeling. Each primitive action depicts an atomic subaction, like directional motion of limb or torso, and is represented by a mixture of four-dimensional Gaussian distributions. For one--shot and k-shot learning, the sequence of primitive labels discovered in a test video are labeled using KL divergence, and can then be represented as a string and matched against similar strings of training videos. The same sequence can also be collapsed into a histogram of primitives or be used to learn a Hidden Markov model to represent classes. We have performed extensive experiments on recognition by one and k-shot learning as well as unsupervised action clustering on six human actions and gesture datasets, a composite dataset, and a database of facial expressions. These experiments confirm the validity and discriminative nature of the proposed representation.
Integrative sparse principal component analysis of gene expression data.
Liu, Mengque; Fan, Xinyan; Fang, Kuangnan; Zhang, Qingzhao; Ma, Shuangge
2017-12-01
In the analysis of gene expression data, dimension reduction techniques have been extensively adopted. The most popular one is perhaps the PCA (principal component analysis). To generate more reliable and more interpretable results, the SPCA (sparse PCA) technique has been developed. With the "small sample size, high dimensionality" characteristic of gene expression data, the analysis results generated from a single dataset are often unsatisfactory. Under contexts other than dimension reduction, integrative analysis techniques, which jointly analyze the raw data of multiple independent datasets, have been developed and shown to outperform "classic" meta-analysis and other multidatasets techniques and single-dataset analysis. In this study, we conduct integrative analysis by developing the iSPCA (integrative SPCA) method. iSPCA achieves the selection and estimation of sparse loadings using a group penalty. To take advantage of the similarity across datasets and generate more accurate results, we further impose contrasted penalties. Different penalties are proposed to accommodate different data conditions. Extensive simulations show that iSPCA outperforms the alternatives under a wide spectrum of settings. The analysis of breast cancer and pancreatic cancer data further shows iSPCA's satisfactory performance. © 2017 WILEY PERIODICALS, INC.
Benchmark datasets for 3D MALDI- and DESI-imaging mass spectrometry.
Oetjen, Janina; Veselkov, Kirill; Watrous, Jeramie; McKenzie, James S; Becker, Michael; Hauberg-Lotte, Lena; Kobarg, Jan Hendrik; Strittmatter, Nicole; Mróz, Anna K; Hoffmann, Franziska; Trede, Dennis; Palmer, Andrew; Schiffler, Stefan; Steinhorst, Klaus; Aichler, Michaela; Goldin, Robert; Guntinas-Lichius, Orlando; von Eggeling, Ferdinand; Thiele, Herbert; Maedler, Kathrin; Walch, Axel; Maass, Peter; Dorrestein, Pieter C; Takats, Zoltan; Alexandrov, Theodore
2015-01-01
Three-dimensional (3D) imaging mass spectrometry (MS) is an analytical chemistry technique for the 3D molecular analysis of a tissue specimen, entire organ, or microbial colonies on an agar plate. 3D-imaging MS has unique advantages over existing 3D imaging techniques, offers novel perspectives for understanding the spatial organization of biological processes, and has growing potential to be introduced into routine use in both biology and medicine. Owing to the sheer quantity of data generated, the visualization, analysis, and interpretation of 3D imaging MS data remain a significant challenge. Bioinformatics research in this field is hampered by the lack of publicly available benchmark datasets needed to evaluate and compare algorithms. High-quality 3D imaging MS datasets from different biological systems at several labs were acquired, supplied with overview images and scripts demonstrating how to read them, and deposited into MetaboLights, an open repository for metabolomics data. 3D imaging MS data were collected from five samples using two types of 3D imaging MS. 3D matrix-assisted laser desorption/ionization imaging (MALDI) MS data were collected from murine pancreas, murine kidney, human oral squamous cell carcinoma, and interacting microbial colonies cultured in Petri dishes. 3D desorption electrospray ionization (DESI) imaging MS data were collected from a human colorectal adenocarcinoma. With the aim to stimulate computational research in the field of computational 3D imaging MS, selected high-quality 3D imaging MS datasets are provided that could be used by algorithm developers as benchmark datasets.
Yang, Jie; McArdle, Conor; Daniels, Stephen
2014-01-01
A new data dimension-reduction method, called Internal Information Redundancy Reduction (IIRR), is proposed for application to Optical Emission Spectroscopy (OES) datasets obtained from industrial plasma processes. For example in a semiconductor manufacturing environment, real-time spectral emission data is potentially very useful for inferring information about critical process parameters such as wafer etch rates, however, the relationship between the spectral sensor data gathered over the duration of an etching process step and the target process output parameters is complex. OES sensor data has high dimensionality (fine wavelength resolution is required in spectral emission measurements in order to capture data on all chemical species involved in plasma reactions) and full spectrum samples are taken at frequent time points, so that dynamic process changes can be captured. To maximise the utility of the gathered dataset, it is essential that information redundancy is minimised, but with the important requirement that the resulting reduced dataset remains in a form that is amenable to direct interpretation of the physical process. To meet this requirement and to achieve a high reduction in dimension with little information loss, the IIRR method proposed in this paper operates directly in the original variable space, identifying peak wavelength emissions and the correlative relationships between them. A new statistic, Mean Determination Ratio (MDR), is proposed to quantify the information loss after dimension reduction and the effectiveness of IIRR is demonstrated using an actual semiconductor manufacturing dataset. As an example of the application of IIRR in process monitoring/control, we also show how etch rates can be accurately predicted from IIRR dimension-reduced spectral data. PMID:24451453
McTwo: a two-step feature selection algorithm based on maximal information coefficient.
Ge, Ruiquan; Zhou, Manli; Luo, Youxi; Meng, Qinghan; Mai, Guoqin; Ma, Dongli; Wang, Guoqing; Zhou, Fengfeng
2016-03-23
High-throughput bio-OMIC technologies are producing high-dimension data from bio-samples at an ever increasing rate, whereas the training sample number in a traditional experiment remains small due to various difficulties. This "large p, small n" paradigm in the area of biomedical "big data" may be at least partly solved by feature selection algorithms, which select only features significantly associated with phenotypes. Feature selection is an NP-hard problem. Due to the exponentially increased time requirement for finding the globally optimal solution, all the existing feature selection algorithms employ heuristic rules to find locally optimal solutions, and their solutions achieve different performances on different datasets. This work describes a feature selection algorithm based on a recently published correlation measurement, Maximal Information Coefficient (MIC). The proposed algorithm, McTwo, aims to select features associated with phenotypes, independently of each other, and achieving high classification performance of the nearest neighbor algorithm. Based on the comparative study of 17 datasets, McTwo performs about as well as or better than existing algorithms, with significantly reduced numbers of selected features. The features selected by McTwo also appear to have particular biomedical relevance to the phenotypes from the literature. McTwo selects a feature subset with very good classification performance, as well as a small feature number. So McTwo may represent a complementary feature selection algorithm for the high-dimensional biomedical datasets.
RchyOptimyx: Cellular Hierarchy Optimization for Flow Cytometry
Aghaeepour, Nima; Jalali, Adrin; O’Neill, Kieran; Chattopadhyay, Pratip K.; Roederer, Mario; Hoos, Holger H.; Brinkman, Ryan R.
2013-01-01
Analysis of high-dimensional flow cytometry datasets can reveal novel cell populations with poorly understood biology. Following discovery, characterization of these populations in terms of the critical markers involved is an important step, as this can help to both better understand the biology of these populations and aid in designing simpler marker panels to identify them on simpler instruments and with fewer reagents (i.e., in resource poor or highly regulated clinical settings). However, current tools to design panels based on the biological characteristics of the target cell populations work exclusively based on technical parameters (e.g., instrument configurations, spectral overlap, and reagent availability). To address this shortcoming, we developed RchyOptimyx (cellular hieraRCHY OPTIMization), a computational tool that constructs cellular hierarchies by combining automated gating with dynamic programming and graph theory to provide the best gating strategies to identify a target population to a desired level of purity or correlation with a clinical outcome, using the simplest possible marker panels. RchyOptimyx can assess and graphically present the trade-offs between marker choice and population specificity in high-dimensional flow or mass cytometry datasets. We present three proof-of-concept use cases for RchyOptimyx that involve 1) designing a panel of surface markers for identification of rare populations that are primarily characterized using their intracellular signature; 2) simplifying the gating strategy for identification of a target cell population; 3) identification of a non-redundant marker set to identify a target cell population. PMID:23044634
Estimating Mixture of Gaussian Processes by Kernel Smoothing
Huang, Mian; Li, Runze; Wang, Hansheng; Yao, Weixin
2014-01-01
When the functional data are not homogeneous, e.g., there exist multiple classes of functional curves in the dataset, traditional estimation methods may fail. In this paper, we propose a new estimation procedure for the Mixture of Gaussian Processes, to incorporate both functional and inhomogeneous properties of the data. Our method can be viewed as a natural extension of high-dimensional normal mixtures. However, the key difference is that smoothed structures are imposed for both the mean and covariance functions. The model is shown to be identifiable, and can be estimated efficiently by a combination of the ideas from EM algorithm, kernel regression, and functional principal component analysis. Our methodology is empirically justified by Monte Carlo simulations and illustrated by an analysis of a supermarket dataset. PMID:24976675
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dayman, Ken J; Ade, Brian J; Weber, Charles F
High-dimensional, nonlinear function estimation using large datasets is a current area of interest in the machine learning community, and applications may be found throughout the analytical sciences, where ever-growing datasets are making more information available to the analyst. In this paper, we leverage the existing relevance vector machine, a sparse Bayesian version of the well-studied support vector machine, and expand the method to include integrated feature selection and automatic function shaping. These innovations produce an algorithm that is able to distinguish variables that are useful for making predictions of a response from variables that are unrelated or confusing. We testmore » the technology using synthetic data, conduct initial performance studies, and develop a model capable of making position-independent predictions of the coreaveraged burnup using a single specimen drawn randomly from a nuclear reactor core.« less
Categorical dimensions of human odor descriptor space revealed by non-negative matrix factorization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chennubhotla, Chakra; Castro, Jason
2013-01-01
In contrast to most other sensory modalities, the basic perceptual dimensions of olfaction remain un- clear. Here, we use non-negative matrix factorization (NMF) - a dimensionality reduction technique - to uncover structure in a panel of odor profiles, with each odor defined as a point in multi-dimensional descriptor space. The properties of NMF are favorable for the analysis of such lexical and perceptual data, and lead to a high-dimensional account of odor space. We further provide evidence that odor di- mensions apply categorically. That is, odor space is not occupied homogenously, but rather in a discrete and intrinsically clustered manner.more » We discuss the potential implications of these results for the neural coding of odors, as well as for developing classifiers on larger datasets that may be useful for predicting perceptual qualities from chemical structures.« less
Tensor Train Neighborhood Preserving Embedding
NASA Astrophysics Data System (ADS)
Wang, Wenqi; Aggarwal, Vaneet; Aeron, Shuchin
2018-05-01
In this paper, we propose a Tensor Train Neighborhood Preserving Embedding (TTNPE) to embed multi-dimensional tensor data into low dimensional tensor subspace. Novel approaches to solve the optimization problem in TTNPE are proposed. For this embedding, we evaluate novel trade-off gain among classification, computation, and dimensionality reduction (storage) for supervised learning. It is shown that compared to the state-of-the-arts tensor embedding methods, TTNPE achieves superior trade-off in classification, computation, and dimensionality reduction in MNIST handwritten digits and Weizmann face datasets.
Li, Jia; Xia, Changqun; Chen, Xiaowu
2017-10-12
Image-based salient object detection (SOD) has been extensively studied in past decades. However, video-based SOD is much less explored due to the lack of large-scale video datasets within which salient objects are unambiguously defined and annotated. Toward this end, this paper proposes a video-based SOD dataset that consists of 200 videos. In constructing the dataset, we manually annotate all objects and regions over 7,650 uniformly sampled keyframes and collect the eye-tracking data of 23 subjects who free-view all videos. From the user data, we find that salient objects in a video can be defined as objects that consistently pop-out throughout the video, and objects with such attributes can be unambiguously annotated by combining manually annotated object/region masks with eye-tracking data of multiple subjects. To the best of our knowledge, it is currently the largest dataset for videobased salient object detection. Based on this dataset, this paper proposes an unsupervised baseline approach for video-based SOD by using saliencyguided stacked autoencoders. In the proposed approach, multiple spatiotemporal saliency cues are first extracted at the pixel, superpixel and object levels. With these saliency cues, stacked autoencoders are constructed in an unsupervised manner that automatically infers a saliency score for each pixel by progressively encoding the high-dimensional saliency cues gathered from the pixel and its spatiotemporal neighbors. In experiments, the proposed unsupervised approach is compared with 31 state-of-the-art models on the proposed dataset and outperforms 30 of them, including 19 imagebased classic (unsupervised or non-deep learning) models, six image-based deep learning models, and five video-based unsupervised models. Moreover, benchmarking results show that the proposed dataset is very challenging and has the potential to boost the development of video-based SOD.
Park, Seong Ho; Han, Kyunghwa
2018-03-01
The use of artificial intelligence in medicine is currently an issue of great interest, especially with regard to the diagnostic or predictive analysis of medical images. Adoption of an artificial intelligence tool in clinical practice requires careful confirmation of its clinical utility. Herein, the authors explain key methodology points involved in a clinical evaluation of artificial intelligence technology for use in medicine, especially high-dimensional or overparameterized diagnostic or predictive models in which artificial deep neural networks are used, mainly from the standpoints of clinical epidemiology and biostatistics. First, statistical methods for assessing the discrimination and calibration performances of a diagnostic or predictive model are summarized. Next, the effects of disease manifestation spectrum and disease prevalence on the performance results are explained, followed by a discussion of the difference between evaluating the performance with use of internal and external datasets, the importance of using an adequate external dataset obtained from a well-defined clinical cohort to avoid overestimating the clinical performance as a result of overfitting in high-dimensional or overparameterized classification model and spectrum bias, and the essentials for achieving a more robust clinical evaluation. Finally, the authors review the role of clinical trials and observational outcome studies for ultimate clinical verification of diagnostic or predictive artificial intelligence tools through patient outcomes, beyond performance metrics, and how to design such studies. © RSNA, 2018.
Cannistraci, Carlo Vittorio; Ravasi, Timothy; Montevecchi, Franco Maria; Ideker, Trey; Alessio, Massimo
2010-09-15
Nonlinear small datasets, which are characterized by low numbers of samples and very high numbers of measures, occur frequently in computational biology, and pose problems in their investigation. Unsupervised hybrid-two-phase (H2P) procedures-specifically dimension reduction (DR), coupled with clustering-provide valuable assistance, not only for unsupervised data classification, but also for visualization of the patterns hidden in high-dimensional feature space. 'Minimum Curvilinearity' (MC) is a principle that-for small datasets-suggests the approximation of curvilinear sample distances in the feature space by pair-wise distances over their minimum spanning tree (MST), and thus avoids the introduction of any tuning parameter. MC is used to design two novel forms of nonlinear machine learning (NML): Minimum Curvilinear embedding (MCE) for DR, and Minimum Curvilinear affinity propagation (MCAP) for clustering. Compared with several other unsupervised and supervised algorithms, MCE and MCAP, whether individually or combined in H2P, overcome the limits of classical approaches. High performance was attained in the visualization and classification of: (i) pain patients (proteomic measurements) in peripheral neuropathy; (ii) human organ tissues (genomic transcription factor measurements) on the basis of their embryological origin. MC provides a valuable framework to estimate nonlinear distances in small datasets. Its extension to large datasets is prefigured for novel NMLs. Classification of neuropathic pain by proteomic profiles offers new insights for future molecular and systems biology characterization of pain. Improvements in tissue embryological classification refine results obtained in an earlier study, and suggest a possible reinterpretation of skin attribution as mesodermal. https://sites.google.com/site/carlovittoriocannistraci/home.
NASA Astrophysics Data System (ADS)
Shiangjen, Kanokwatt; Chaijaruwanich, Jeerayut; Srisujjalertwaja, Wijak; Unachak, Prakarn; Somhom, Samerkae
2018-02-01
This article presents an efficient heuristic placement algorithm, namely, a bidirectional heuristic placement, for solving the two-dimensional rectangular knapsack packing problem. The heuristic demonstrates ways to maximize space utilization by fitting the appropriate rectangle from both sides of the wall of the current residual space layer by layer. The iterative local search along with a shift strategy is developed and applied to the heuristic to balance the exploitation and exploration tasks in the solution space without the tuning of any parameters. The experimental results on many scales of packing problems show that this approach can produce high-quality solutions for most of the benchmark datasets, especially for large-scale problems, within a reasonable duration of computational time.
shinyheatmap: Ultra fast low memory heatmap web interface for big data genomics.
Khomtchouk, Bohdan B; Hennessy, James R; Wahlestedt, Claes
2017-01-01
Transcriptomics, metabolomics, metagenomics, and other various next-generation sequencing (-omics) fields are known for their production of large datasets, especially across single-cell sequencing studies. Visualizing such big data has posed technical challenges in biology, both in terms of available computational resources as well as programming acumen. Since heatmaps are used to depict high-dimensional numerical data as a colored grid of cells, efficiency and speed have often proven to be critical considerations in the process of successfully converting data into graphics. For example, rendering interactive heatmaps from large input datasets (e.g., 100k+ rows) has been computationally infeasible on both desktop computers and web browsers. In addition to memory requirements, programming skills and knowledge have frequently been barriers-to-entry for creating highly customizable heatmaps. We propose shinyheatmap: an advanced user-friendly heatmap software suite capable of efficiently creating highly customizable static and interactive biological heatmaps in a web browser. shinyheatmap is a low memory footprint program, making it particularly well-suited for the interactive visualization of extremely large datasets that cannot typically be computed in-memory due to size restrictions. Also, shinyheatmap features a built-in high performance web plug-in, fastheatmap, for rapidly plotting interactive heatmaps of datasets as large as 105-107 rows within seconds, effectively shattering previous performance benchmarks of heatmap rendering speed. shinyheatmap is hosted online as a freely available web server with an intuitive graphical user interface: http://shinyheatmap.com. The methods are implemented in R, and are available as part of the shinyheatmap project at: https://github.com/Bohdan-Khomtchouk/shinyheatmap. Users can access fastheatmap directly from within the shinyheatmap web interface, and all source code has been made publicly available on Github: https://github.com/Bohdan-Khomtchouk/fastheatmap.
NASA Astrophysics Data System (ADS)
Ndaw, Joseph D.; Faye, Andre; Maïga, Amadou S.
2017-05-01
Artificial neural networks (ANN)-based models are efficient ways of source localisation. However very large training sets are needed to precisely estimate two-dimensional Direction of arrival (2D-DOA) with ANN models. In this paper we present a fast artificial neural network approach for 2D-DOA estimation with reduced training sets sizes. We exploit the symmetry properties of Uniform Circular Arrays (UCA) to build two different datasets for elevation and azimuth angles. Linear Vector Quantisation (LVQ) neural networks are then sequentially trained on each dataset to separately estimate elevation and azimuth angles. A multilevel training process is applied to further reduce the training sets sizes.
Kavakiotis, Ioannis; Samaras, Patroklos; Triantafyllidis, Alexandros; Vlahavas, Ioannis
2017-11-01
Single Nucleotide Polymorphism (SNPs) are, nowadays, becoming the marker of choice for biological analyses involving a wide range of applications with great medical, biological, economic and environmental interest. Classification tasks i.e. the assignment of individuals to groups of origin based on their (multi-locus) genotypes, are performed in many fields such as forensic investigations, discrimination between wild and/or farmed populations and others. Τhese tasks, should be performed with a small number of loci, for computational as well as biological reasons. Thus, feature selection should precede classification tasks, especially for Single Nucleotide Polymorphism (SNP) datasets, where the number of features can amount to hundreds of thousands or millions. In this paper, we present a novel data mining approach, called FIFS - Frequent Item Feature Selection, based on the use of frequent items for selection of the most informative markers from population genomic data. It is a modular method, consisting of two main components. The first one identifies the most frequent and unique genotypes for each sampled population. The second one selects the most appropriate among them, in order to create the informative SNP subsets to be returned. The proposed method (FIFS) was tested on a real dataset, which comprised of a comprehensive coverage of pig breed types present in Britain. This dataset consisted of 446 individuals divided in 14 sub-populations, genotyped at 59,436 SNPs. Our method outperforms the state-of-the-art and baseline methods in every case. More specifically, our method surpassed the assignment accuracy threshold of 95% needing only half the number of SNPs selected by other methods (FIFS: 28 SNPs, Delta: 70 SNPs Pairwise FST: 70 SNPs, In: 100 SNPs.) CONCLUSION: Our approach successfully deals with the problem of informative marker selection in high dimensional genomic datasets. It offers better results compared to existing approaches and can aid biologists in selecting the most informative markers with maximum discrimination power for optimization of cost-effective panels with applications related to e.g. species identification, wildlife management, and forensics. Copyright © 2017 Elsevier Ltd. All rights reserved.
Dutt-Mazumder, Aviroop; Button, Chris; Robins, Anthony; Bartlett, Roger
2011-12-01
Recent studies have explored the organization of player movements in team sports using a range of statistical tools. However, the factors that best explain the performance of association football teams remain elusive. Arguably, this is due to the high-dimensional behavioural outputs that illustrate the complex, evolving configurations typical of team games. According to dynamical system analysts, movement patterns in team sports exhibit nonlinear self-organizing features. Nonlinear processing tools (i.e. Artificial Neural Networks; ANNs) are becoming increasingly popular to investigate the coordination of participants in sports competitions. ANNs are well suited to describing high-dimensional data sets with nonlinear attributes, however, limited information concerning the processes required to apply ANNs exists. This review investigates the relative value of various ANN learning approaches used in sports performance analysis of team sports focusing on potential applications for association football. Sixty-two research sources were summarized and reviewed from electronic literature search engines such as SPORTDiscus, Google Scholar, IEEE Xplore, Scirus, ScienceDirect and Elsevier. Typical ANN learning algorithms can be adapted to perform pattern recognition and pattern classification. Particularly, dimensionality reduction by a Kohonen feature map (KFM) can compress chaotic high-dimensional datasets into low-dimensional relevant information. Such information would be useful for developing effective training drills that should enhance self-organizing coordination among players. We conclude that ANN-based qualitative analysis is a promising approach to understand the dynamical attributes of association football players.
SciSpark's SRDD : A Scientific Resilient Distributed Dataset for Multidimensional Data
NASA Astrophysics Data System (ADS)
Palamuttam, R. S.; Wilson, B. D.; Mogrovejo, R. M.; Whitehall, K. D.; Mattmann, C. A.; McGibbney, L. J.; Ramirez, P.
2015-12-01
Remote sensing data and climate model output are multi-dimensional arrays of massive sizes locked away in heterogeneous file formats (HDF5/4, NetCDF 3/4) and metadata models (HDF-EOS, CF) making it difficult to perform multi-stage, iterative science processing since each stage requires writing and reading data to and from disk. We have developed SciSpark, a robust Big Data framework, that extends ApacheTM Spark for scaling scientific computations. Apache Spark improves the map-reduce implementation in ApacheTM Hadoop for parallel computing on a cluster, by emphasizing in-memory computation, "spilling" to disk only as needed, and relying on lazy evaluation. Central to Spark is the Resilient Distributed Dataset (RDD), an in-memory distributed data structure that extends the functional paradigm provided by the Scala programming language. However, RDDs are ideal for tabular or unstructured data, and not for highly dimensional data. The SciSpark project introduces the Scientific Resilient Distributed Dataset (sRDD), a distributed-computing array structure which supports iterative scientific algorithms for multidimensional data. SciSpark processes data stored in NetCDF and HDF files by partitioning them across time or space and distributing the partitions among a cluster of compute nodes. We show usability and extensibility of SciSpark by implementing distributed algorithms for geospatial operations on large collections of multi-dimensional grids. In particular we address the problem of scaling an automated method for finding Mesoscale Convective Complexes. SciSpark provides a tensor interface to support the pluggability of different matrix libraries. We evaluate performance of the various matrix libraries in distributed pipelines, such as Nd4jTM and BreezeTM. We detail the architecture and design of SciSpark, our efforts to integrate climate science algorithms, parallel ingest and partitioning (sharding) of A-Train satellite observations from model grids. These solutions are encompassed in SciSpark, an open-source software framework for distributed computing on scientific data.
GPU-powered Shotgun Stochastic Search for Dirichlet process mixtures of Gaussian Graphical Models
Mukherjee, Chiranjit; Rodriguez, Abel
2016-01-01
Gaussian graphical models are popular for modeling high-dimensional multivariate data with sparse conditional dependencies. A mixture of Gaussian graphical models extends this model to the more realistic scenario where observations come from a heterogenous population composed of a small number of homogeneous sub-groups. In this paper we present a novel stochastic search algorithm for finding the posterior mode of high-dimensional Dirichlet process mixtures of decomposable Gaussian graphical models. Further, we investigate how to harness the massive thread-parallelization capabilities of graphical processing units to accelerate computation. The computational advantages of our algorithms are demonstrated with various simulated data examples in which we compare our stochastic search with a Markov chain Monte Carlo algorithm in moderate dimensional data examples. These experiments show that our stochastic search largely outperforms the Markov chain Monte Carlo algorithm in terms of computing-times and in terms of the quality of the posterior mode discovered. Finally, we analyze a gene expression dataset in which Markov chain Monte Carlo algorithms are too slow to be practically useful. PMID:28626348
GPU-powered Shotgun Stochastic Search for Dirichlet process mixtures of Gaussian Graphical Models.
Mukherjee, Chiranjit; Rodriguez, Abel
2016-01-01
Gaussian graphical models are popular for modeling high-dimensional multivariate data with sparse conditional dependencies. A mixture of Gaussian graphical models extends this model to the more realistic scenario where observations come from a heterogenous population composed of a small number of homogeneous sub-groups. In this paper we present a novel stochastic search algorithm for finding the posterior mode of high-dimensional Dirichlet process mixtures of decomposable Gaussian graphical models. Further, we investigate how to harness the massive thread-parallelization capabilities of graphical processing units to accelerate computation. The computational advantages of our algorithms are demonstrated with various simulated data examples in which we compare our stochastic search with a Markov chain Monte Carlo algorithm in moderate dimensional data examples. These experiments show that our stochastic search largely outperforms the Markov chain Monte Carlo algorithm in terms of computing-times and in terms of the quality of the posterior mode discovered. Finally, we analyze a gene expression dataset in which Markov chain Monte Carlo algorithms are too slow to be practically useful.
Flocks, James
2006-01-01
Scientific knowledge from the past century is commonly represented by two-dimensional figures and graphs, as presented in manuscripts and maps. Using today's computer technology, this information can be extracted and projected into three- and four-dimensional perspectives. Computer models can be applied to datasets to provide additional insight into complex spatial and temporal systems. This process can be demonstrated by applying digitizing and modeling techniques to valuable information within widely used publications. The seminal paper by D. Frazier, published in 1967, identified 16 separate delta lobes formed by the Mississippi River during the past 6,000 yrs. The paper includes stratigraphic descriptions through geologic cross-sections, and provides distribution and chronologies of the delta lobes. The data from Frazier's publication are extensively referenced in the literature. Additional information can be extracted from the data through computer modeling. Digitizing and geo-rectifying Frazier's geologic cross-sections produce a three-dimensional perspective of the delta lobes. Adding the chronological data included in the report provides the fourth-dimension of the delta cycles, which can be visualized through computer-generated animation. Supplemental information can be added to the model, such as post-abandonment subsidence of the delta-lobe surface. Analyzing the regional, net surface-elevation balance between delta progradations and land subsidence is computationally intensive. By visualizing this process during the past 4,500 yrs through multi-dimensional animation, the importance of sediment compaction in influencing both the shape and direction of subsequent delta progradations becomes apparent. Visualization enhances a classic dataset, and can be further refined using additional data, as well as provide a guide for identifying future areas of study.
Pataky, Todd C; Vanrenterghem, Jos; Robinson, Mark A
2016-06-14
A false positive is the mistake of inferring an effect when none exists, and although α controls the false positive (Type I error) rate in classical hypothesis testing, a given α value is accurate only if the underlying model of randomness appropriately reflects experimentally observed variance. Hypotheses pertaining to one-dimensional (1D) (e.g. time-varying) biomechanical trajectories are most often tested using a traditional zero-dimensional (0D) Gaussian model of randomness, but variance in these datasets is clearly 1D. The purpose of this study was to determine the likelihood that analyzing smooth 1D data with a 0D model of variance will produce false positives. We first used random field theory (RFT) to predict the probability of false positives in 0D analyses. We then validated RFT predictions via numerical simulations of smooth Gaussian 1D trajectories. Results showed that, across a range of public kinematic, force/moment and EMG datasets, the median false positive rate was 0.382 and not the assumed α=0.05, even for a simple two-sample t test involving N=10 trajectories per group. The median false positive rate for experiments involving three-component vector trajectories was p=0.764. This rate increased to p=0.945 for two three-component vector trajectories, and to p=0.999 for six three-component vectors. This implies that experiments involving vector trajectories have a high probability of yielding 0D statistical significance when there is, in fact, no 1D effect. Either (a) explicit a priori identification of 0D variables or (b) adoption of 1D methods can more tightly control α. Copyright © 2016 Elsevier Ltd. All rights reserved.
Music Signal Processing Using Vector Product Neural Networks
NASA Astrophysics Data System (ADS)
Fan, Z. C.; Chan, T. S.; Yang, Y. H.; Jang, J. S. R.
2017-05-01
We propose a novel neural network model for music signal processing using vector product neurons and dimensionality transformations. Here, the inputs are first mapped from real values into three-dimensional vectors then fed into a three-dimensional vector product neural network where the inputs, outputs, and weights are all three-dimensional values. Next, the final outputs are mapped back to the reals. Two methods for dimensionality transformation are proposed, one via context windows and the other via spectral coloring. Experimental results on the iKala dataset for blind singing voice separation confirm the efficacy of our model.
Hearing Nano-Structures: A Case Study in Timbral Sonification
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schedel, M.; Yager, K.
2012-06-18
We explore the sonification of x-ray scattering data, which are two-dimensional arrays of intensity whose meaning is obscure and non-intuitive. Direct mapping of the experimental data into sound is found to produce timbral sonifications that, while sacrificing conventional aesthetic appeal, provide a rich auditory landscape for exploration. We discuss the optimization of sonification variables, and speculate on potential real-world applications. We have presented a case study of sonifying x-ray scattering data. Direct mapping of the two-dimensional intensity values of a scattering dataset into the two-dimensional matrix of a sonogram is a natural and information-preserving operation that creates rich sounds. Ourmore » work supports the notion that many problems in understanding rather abstract scientific datasets can be ameliorated by adding the auditory modality of sonification. We further emphasize that sonification need not be limited to time-series data: any data matrix is amenable. Timbral sonification is less obviously aesthetic, than tonal sonification, which generate melody, harmony, or rhythm. However these musical sonifications necessarily sacrifice information content for beauty. Timbral sonification is useful because the entire dataset is represented. Non-musicians can understand the data through the overall color of the sound; audio experts can extract more detailed insight by studying all the features of the sound.« less
Zeng, Rongping; Petrick, Nicholas; Gavrielides, Marios A; Myers, Kyle J
2011-10-07
Multi-slice computed tomography (MSCT) scanners have become popular volumetric imaging tools. Deterministic and random properties of the resulting CT scans have been studied in the literature. Due to the large number of voxels in the three-dimensional (3D) volumetric dataset, full characterization of the noise covariance in MSCT scans is difficult to tackle. However, as usage of such datasets for quantitative disease diagnosis grows, so does the importance of understanding the noise properties because of their effect on the accuracy of the clinical outcome. The goal of this work is to study noise covariance in the helical MSCT volumetric dataset. We explore possible approximations to the noise covariance matrix with reduced degrees of freedom, including voxel-based variance, one-dimensional (1D) correlation, two-dimensional (2D) in-plane correlation and the noise power spectrum (NPS). We further examine the effect of various noise covariance models on the accuracy of a prewhitening matched filter nodule size estimation strategy. Our simulation results suggest that the 1D longitudinal, 2D in-plane and NPS prewhitening approaches can improve the performance of nodule size estimation algorithms. When taking into account computational costs in determining noise characterizations, the NPS model may be the most efficient approximation to the MSCT noise covariance matrix.
Teshima, Tara Lynn; Patel, Vaibhav; Mainprize, James G; Edwards, Glenn; Antonyshyn, Oleh M
2015-07-01
The utilization of three-dimensional modeling technology in craniomaxillofacial surgery has grown exponentially during the last decade. Future development, however, is hindered by the lack of a normative three-dimensional anatomic dataset and a statistical mean three-dimensional virtual model. The purpose of this study is to develop and validate a protocol to generate a statistical three-dimensional virtual model based on a normative dataset of adult skulls. Two hundred adult skull CT images were reviewed. The average three-dimensional skull was computed by processing each CT image in the series using thin-plate spline geometric morphometric protocol. Our statistical average three-dimensional skull was validated by reconstructing patient-specific topography in cranial defects. The experiment was repeated 4 times. In each case, computer-generated cranioplasties were compared directly to the original intact skull. The errors describing the difference between the prediction and the original were calculated. A normative database of 33 adult human skulls was collected. Using 21 anthropometric landmark points, a protocol for three-dimensional skull landmarking and data reduction was developed and a statistical average three-dimensional skull was generated. Our results show the root mean square error (RMSE) for restoration of a known defect using the native best match skull, our statistical average skull, and worst match skull was 0.58, 0.74, and 4.4 mm, respectively. The ability to statistically average craniofacial surface topography will be a valuable instrument for deriving missing anatomy in complex craniofacial defects and deficiencies as well as in evaluating morphologic results of surgery.
Cluster ensemble based on Random Forests for genetic data.
Alhusain, Luluah; Hafez, Alaaeldin M
2017-01-01
Clustering plays a crucial role in several application domains, such as bioinformatics. In bioinformatics, clustering has been extensively used as an approach for detecting interesting patterns in genetic data. One application is population structure analysis, which aims to group individuals into subpopulations based on shared genetic variations, such as single nucleotide polymorphisms. Advances in DNA sequencing technology have facilitated the obtainment of genetic datasets with exceptional sizes. Genetic data usually contain hundreds of thousands of genetic markers genotyped for thousands of individuals, making an efficient means for handling such data desirable. Random Forests (RFs) has emerged as an efficient algorithm capable of handling high-dimensional data. RFs provides a proximity measure that can capture different levels of co-occurring relationships between variables. RFs has been widely considered a supervised learning method, although it can be converted into an unsupervised learning method. Therefore, RF-derived proximity measure combined with a clustering technique may be well suited for determining the underlying structure of unlabeled data. This paper proposes, RFcluE, a cluster ensemble approach for determining the underlying structure of genetic data based on RFs. The approach comprises a cluster ensemble framework to combine multiple runs of RF clustering. Experiments were conducted on high-dimensional, real genetic dataset to evaluate the proposed approach. The experiments included an examination of the impact of parameter changes, comparing RFcluE performance against other clustering methods, and an assessment of the relationship between the diversity and quality of the ensemble and its effect on RFcluE performance. This paper proposes, RFcluE, a cluster ensemble approach based on RF clustering to address the problem of population structure analysis and demonstrate the effectiveness of the approach. The paper also illustrates that applying a cluster ensemble approach, combining multiple RF clusterings, produces more robust and higher-quality results as a consequence of feeding the ensemble with diverse views of high-dimensional genetic data obtained through bagging and random subspace, the two key features of the RF algorithm.
Canessa, Andrea; Gibaldi, Agostino; Chessa, Manuela; Fato, Marco; Solari, Fabio; Sabatini, Silvio P.
2017-01-01
Binocular stereopsis is the ability of a visual system, belonging to a live being or a machine, to interpret the different visual information deriving from two eyes/cameras for depth perception. From this perspective, the ground-truth information about three-dimensional visual space, which is hardly available, is an ideal tool both for evaluating human performance and for benchmarking machine vision algorithms. In the present work, we implemented a rendering methodology in which the camera pose mimics realistic eye pose for a fixating observer, thus including convergent eye geometry and cyclotorsion. The virtual environment we developed relies on highly accurate 3D virtual models, and its full controllability allows us to obtain the stereoscopic pairs together with the ground-truth depth and camera pose information. We thus created a stereoscopic dataset: GENUA PESTO—GENoa hUman Active fixation database: PEripersonal space STereoscopic images and grOund truth disparity. The dataset aims to provide a unified framework useful for a number of problems relevant to human and computer vision, from scene exploration and eye movement studies to 3D scene reconstruction. PMID:28350382
Wong, Gerard; Leckie, Christopher; Kowalczyk, Adam
2012-01-15
Feature selection is a key concept in machine learning for microarray datasets, where features represented by probesets are typically several orders of magnitude larger than the available sample size. Computational tractability is a key challenge for feature selection algorithms in handling very high-dimensional datasets beyond a hundred thousand features, such as in datasets produced on single nucleotide polymorphism microarrays. In this article, we present a novel feature set reduction approach that enables scalable feature selection on datasets with hundreds of thousands of features and beyond. Our approach enables more efficient handling of higher resolution datasets to achieve better disease subtype classification of samples for potentially more accurate diagnosis and prognosis, which allows clinicians to make more informed decisions in regards to patient treatment options. We applied our feature set reduction approach to several publicly available cancer single nucleotide polymorphism (SNP) array datasets and evaluated its performance in terms of its multiclass predictive classification accuracy over different cancer subtypes, its speedup in execution as well as its scalability with respect to sample size and array resolution. Feature Set Reduction (FSR) was able to reduce the dimensions of an SNP array dataset by more than two orders of magnitude while achieving at least equal, and in most cases superior predictive classification performance over that achieved on features selected by existing feature selection methods alone. An examination of the biological relevance of frequently selected features from FSR-reduced feature sets revealed strong enrichment in association with cancer. FSR was implemented in MATLAB R2010b and is available at http://ww2.cs.mu.oz.au/~gwong/FSR.
NASA Astrophysics Data System (ADS)
Cao, Faxian; Yang, Zhijing; Ren, Jinchang; Ling, Wing-Kuen; Zhao, Huimin; Marshall, Stephen
2017-12-01
Although the sparse multinomial logistic regression (SMLR) has provided a useful tool for sparse classification, it suffers from inefficacy in dealing with high dimensional features and manually set initial regressor values. This has significantly constrained its applications for hyperspectral image (HSI) classification. In order to tackle these two drawbacks, an extreme sparse multinomial logistic regression (ESMLR) is proposed for effective classification of HSI. First, the HSI dataset is projected to a new feature space with randomly generated weight and bias. Second, an optimization model is established by the Lagrange multiplier method and the dual principle to automatically determine a good initial regressor for SMLR via minimizing the training error and the regressor value. Furthermore, the extended multi-attribute profiles (EMAPs) are utilized for extracting both the spectral and spatial features. A combinational linear multiple features learning (MFL) method is proposed to further enhance the features extracted by ESMLR and EMAPs. Finally, the logistic regression via the variable splitting and the augmented Lagrangian (LORSAL) is adopted in the proposed framework for reducing the computational time. Experiments are conducted on two well-known HSI datasets, namely the Indian Pines dataset and the Pavia University dataset, which have shown the fast and robust performance of the proposed ESMLR framework.
Enhancing GIS Capabilities for High Resolution Earth Science Grids
NASA Astrophysics Data System (ADS)
Koziol, B. W.; Oehmke, R.; Li, P.; O'Kuinghttons, R.; Theurich, G.; DeLuca, C.
2017-12-01
Applications for high performance GIS will continue to increase as Earth system models pursue more realistic representations of Earth system processes. Finer spatial resolution model input and output, unstructured or irregular modeling grids, data assimilation, and regional coordinate systems present novel challenges for GIS frameworks operating in the Earth system modeling domain. This presentation provides an overview of two GIS-driven applications that combine high performance software with big geospatial datasets to produce value-added tools for the modeling and geoscientific community. First, a large-scale interpolation experiment using National Hydrography Dataset (NHD) catchments, a high resolution rectilinear CONUS grid, and the Earth System Modeling Framework's (ESMF) conservative interpolation capability will be described. ESMF is a parallel, high-performance software toolkit that provides capabilities (e.g. interpolation) for building and coupling Earth science applications. ESMF is developed primarily by the NOAA Environmental Software Infrastructure and Interoperability (NESII) group. The purpose of this experiment was to test and demonstrate the utility of high performance scientific software in traditional GIS domains. Special attention will be paid to the nuanced requirements for dealing with high resolution, unstructured grids in scientific data formats. Second, a chunked interpolation application using ESMF and OpenClimateGIS (OCGIS) will demonstrate how spatial subsetting can virtually remove computing resource ceilings for very high spatial resolution interpolation operations. OCGIS is a NESII-developed Python software package designed for the geospatial manipulation of high-dimensional scientific datasets. An overview of the data processing workflow, why a chunked approach is required, and how the application could be adapted to meet operational requirements will be discussed here. In addition, we'll provide a general overview of OCGIS's parallel subsetting capabilities including challenges in the design and implementation of a scientific data subsetter.
Aggregated Indexing of Biomedical Time Series Data
Woodbridge, Jonathan; Mortazavi, Bobak; Sarrafzadeh, Majid; Bui, Alex A.T.
2016-01-01
Remote and wearable medical sensing has the potential to create very large and high dimensional datasets. Medical time series databases must be able to efficiently store, index, and mine these datasets to enable medical professionals to effectively analyze data collected from their patients. Conventional high dimensional indexing methods are a two stage process. First, a superset of the true matches is efficiently extracted from the database. Second, supersets are pruned by comparing each of their objects to the query object and rejecting any objects falling outside a predetermined radius. This pruning stage heavily dominates the computational complexity of most conventional search algorithms. Therefore, indexing algorithms can be significantly improved by reducing the amount of pruning. This paper presents an online algorithm to aggregate biomedical times series data to significantly reduce the search space (index size) without compromising the quality of search results. This algorithm is built on the observation that biomedical time series signals are composed of cyclical and often similar patterns. This algorithm takes in a stream of segments and groups them to highly concentrated collections. Locality Sensitive Hashing (LSH) is used to reduce the overall complexity of the algorithm, allowing it to run online. The output of this aggregation is used to populate an index. The proposed algorithm yields logarithmic growth of the index (with respect to the total number of objects) while keeping sensitivity and specificity simultaneously above 98%. Both memory and runtime complexities of time series search are improved when using aggregated indexes. In addition, data mining tasks, such as clustering, exhibit runtimes that are orders of magnitudes faster when run on aggregated indexes. PMID:27617298
NASA Astrophysics Data System (ADS)
Adjorlolo, Clement; Cho, Moses A.; Mutanga, Onisimo; Ismail, Riyad
2012-01-01
Hyperspectral remote-sensing approaches are suitable for detection of the differences in 3-carbon (C3) and four carbon (C4) grass species phenology and composition. However, the application of hyperspectral sensors to vegetation has been hampered by high-dimensionality, spectral redundancy, and multicollinearity problems. In this experiment, resampling of hyperspectral data to wider wavelength intervals, around a few band-centers, sensitive to the biophysical and biochemical properties of C3 or C4 grass species is proposed. The approach accounts for an inherent property of vegetation spectral response: the asymmetrical nature of the inter-band correlations between a waveband and its shorter- and longer-wavelength neighbors. It involves constructing a curve of weighting threshold of correlation (Pearson's r) between a chosen band-center and its neighbors, as a function of wavelength. In addition, data were resampled to some multispectral sensors-ASTER, GeoEye-1, IKONOS, QuickBird, RapidEye, SPOT 5, and WorldView-2 satellites-for comparative purposes, with the proposed method. The resulting datasets were analyzed, using the random forest algorithm. The proposed resampling method achieved improved classification accuracy (κ=0.82), compared to the resampled multispectral datasets (κ=0.78, 0.65, 0.62, 0.59, 0.65, 0.62, 0.76, respectively). Overall, results from this study demonstrated that spectral resolutions for C3 and C4 grasses can be optimized and controlled for high dimensionality and multicollinearity problems, yet yielding high classification accuracies. The findings also provide a sound basis for programming wavebands for future sensors.
Network embedding-based representation learning for single cell RNA-seq data.
Li, Xiangyu; Chen, Weizheng; Chen, Yang; Zhang, Xuegong; Gu, Jin; Zhang, Michael Q
2017-11-02
Single cell RNA-seq (scRNA-seq) techniques can reveal valuable insights of cell-to-cell heterogeneities. Projection of high-dimensional data into a low-dimensional subspace is a powerful strategy in general for mining such big data. However, scRNA-seq suffers from higher noise and lower coverage than traditional bulk RNA-seq, hence bringing in new computational difficulties. One major challenge is how to deal with the frequent drop-out events. The events, usually caused by the stochastic burst effect in gene transcription and the technical failure of RNA transcript capture, often render traditional dimension reduction methods work inefficiently. To overcome this problem, we have developed a novel Single Cell Representation Learning (SCRL) method based on network embedding. This method can efficiently implement data-driven non-linear projection and incorporate prior biological knowledge (such as pathway information) to learn more meaningful low-dimensional representations for both cells and genes. Benchmark results show that SCRL outperforms other dimensional reduction methods on several recent scRNA-seq datasets. © The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research.
Reducing 4D CT artifacts using optimized sorting based on anatomic similarity.
Johnston, Eric; Diehn, Maximilian; Murphy, James D; Loo, Billy W; Maxim, Peter G
2011-05-01
Four-dimensional (4D) computed tomography (CT) has been widely used as a tool to characterize respiratory motion in radiotherapy. The two most commonly used 4D CT algorithms sort images by the associated respiratory phase or displacement into a predefined number of bins, and are prone to image artifacts at transitions between bed positions. The purpose of this work is to demonstrate a method of reducing motion artifacts in 4D CT by incorporating anatomic similarity into phase or displacement based sorting protocols. Ten patient datasets were retrospectively sorted using both the displacement and phase based sorting algorithms. Conventional sorting methods allow selection of only the nearest-neighbor image in time or displacement within each bin. In our method, for each bed position either the displacement or the phase defines the center of a bin range about which several candidate images are selected. The two dimensional correlation coefficients between slices bordering the interface between adjacent couch positions are then calculated for all candidate pairings. Two slices have a high correlation if they are anatomically similar. Candidates from each bin are then selected to maximize the slice correlation over the entire data set using the Dijkstra's shortest path algorithm. To assess the reduction of artifacts, two thoracic radiation oncologists independently compared the resorted 4D datasets pairwise with conventionally sorted datasets, blinded to the sorting method, to choose which had the least motion artifacts. Agreement between reviewers was evaluated using the weighted kappa score. Anatomically based image selection resulted in 4D CT datasets with significantly reduced motion artifacts with both displacement (P = 0.0063) and phase sorting (P = 0.00022). There was good agreement between the two reviewers, with complete agreement 34 times and complete disagreement 6 times. Optimized sorting using anatomic similarity significantly reduces 4D CT motion artifacts compared to conventional phase or displacement based sorting. This improved sorting algorithm is a straightforward extension of the two most common 4D CT sorting algorithms.
REM-3D Reference Datasets: Reconciling large and diverse compilations of travel-time observations
NASA Astrophysics Data System (ADS)
Moulik, P.; Lekic, V.; Romanowicz, B. A.
2017-12-01
A three-dimensional Reference Earth model (REM-3D) should ideally represent the consensus view of long-wavelength heterogeneity in the Earth's mantle through the joint modeling of large and diverse seismological datasets. This requires reconciliation of datasets obtained using various methodologies and identification of consistent features. The goal of REM-3D datasets is to provide a quality-controlled and comprehensive set of seismic observations that would not only enable construction of REM-3D, but also allow identification of outliers and assist in more detailed studies of heterogeneity. The community response to data solicitation has been enthusiastic with several groups across the world contributing recent measurements of normal modes, (fundamental mode and overtone) surface waves, and body waves. We present results from ongoing work with body and surface wave datasets analyzed in consultation with a Reference Dataset Working Group. We have formulated procedures for reconciling travel-time datasets that include: (1) quality control for salvaging missing metadata; (2) identification of and reasons for discrepant measurements; (3) homogenization of coverage through the construction of summary rays; and (4) inversions of structure at various wavelengths to evaluate inter-dataset consistency. In consultation with the Reference Dataset Working Group, we retrieved the station and earthquake metadata in several legacy compilations and codified several guidelines that would facilitate easy storage and reproducibility. We find strong agreement between the dispersion measurements of fundamental-mode Rayleigh waves, particularly when made using supervised techniques. The agreement deteriorates substantially in surface-wave overtones, for which discrepancies vary with frequency and overtone number. A half-cycle band of discrepancies is attributed to reversed instrument polarities at a limited number of stations, which are not reflected in the instrument response history. By assessing inter-dataset consistency across similar paths, we quantify travel-time measurement errors for both surface and body waves. Finally, we discuss challenges associated with combining high frequency ( 1 Hz) and long period (10-20s) body-wave measurements into the REM-3D reference dataset.
2015-01-01
Background Investigations into novel biomarkers using omics techniques generate large amounts of data. Due to their size and numbers of attributes, these data are suitable for analysis with machine learning methods. A key component of typical machine learning pipelines for omics data is feature selection, which is used to reduce the raw high-dimensional data into a tractable number of features. Feature selection needs to balance the objective of using as few features as possible, while maintaining high predictive power. This balance is crucial when the goal of data analysis is the identification of highly accurate but small panels of biomarkers with potential clinical utility. In this paper we propose a heuristic for the selection of very small feature subsets, via an iterative feature elimination process that is guided by rule-based machine learning, called RGIFE (Rule-guided Iterative Feature Elimination). We use this heuristic to identify putative biomarkers of osteoarthritis (OA), articular cartilage degradation and synovial inflammation, using both proteomic and transcriptomic datasets. Results and discussion Our RGIFE heuristic increased the classification accuracies achieved for all datasets when no feature selection is used, and performed well in a comparison with other feature selection methods. Using this method the datasets were reduced to a smaller number of genes or proteins, including those known to be relevant to OA, cartilage degradation and joint inflammation. The results have shown the RGIFE feature reduction method to be suitable for analysing both proteomic and transcriptomics data. Methods that generate large ‘omics’ datasets are increasingly being used in the area of rheumatology. Conclusions Feature reduction methods are advantageous for the analysis of omics data in the field of rheumatology, as the applications of such techniques are likely to result in improvements in diagnosis, treatment and drug discovery. PMID:25923811
LIME: 3D visualisation and interpretation of virtual geoscience models
NASA Astrophysics Data System (ADS)
Buckley, Simon; Ringdal, Kari; Dolva, Benjamin; Naumann, Nicole; Kurz, Tobias
2017-04-01
Three-dimensional and photorealistic acquisition of surface topography, using methods such as laser scanning and photogrammetry, has become widespread across the geosciences over the last decade. With recent innovations in photogrammetric processing software, robust and automated data capture hardware, and novel sensor platforms, including unmanned aerial vehicles, obtaining 3D representations of exposed topography has never been easier. In addition to 3D datasets, fusion of surface geometry with imaging sensors, such as multi/hyperspectral, thermal and ground-based InSAR, and geophysical methods, create novel and highly visual datasets that provide a fundamental spatial framework to address open geoscience research questions. Although data capture and processing routines are becoming well-established and widely reported in the scientific literature, challenges remain related to the analysis, co-visualisation and presentation of 3D photorealistic models, especially for new users (e.g. students and scientists new to geomatics methods). Interpretation and measurement is essential for quantitative analysis of 3D datasets, and qualitative methods are valuable for presentation purposes, for planning and in education. Motivated by this background, the current contribution presents LIME, a lightweight and high performance 3D software for interpreting and co-visualising 3D models and related image data in geoscience applications. The software focuses on novel data integration and visualisation of 3D topography with image sources such as hyperspectral imagery, logs and interpretation panels, geophysical datasets and georeferenced maps and images. High quality visual output can be generated for dissemination purposes, to aid researchers with communication of their research results. The background of the software is described and case studies from outcrop geology, in hyperspectral mineral mapping and geophysical-geospatial data integration are used to showcase the novel methods developed.
AstroBlend: An astrophysical visualization package for Blender
NASA Astrophysics Data System (ADS)
Naiman, J. P.
2016-04-01
The rapid growth in scale and complexity of both computational and observational astrophysics over the past decade necessitates efficient and intuitive methods for examining and visualizing large datasets. Here, I present AstroBlend, an open-source Python library for use within the three dimensional modeling software, Blender. While Blender has been a popular open-source software among animators and visual effects artists, in recent years it has also become a tool for visualizing astrophysical datasets. AstroBlend combines the three dimensional capabilities of Blender with the analysis tools of the widely used astrophysical toolset, yt, to afford both computational and observational astrophysicists the ability to simultaneously analyze their data and create informative and appealing visualizations. The introduction of this package includes a description of features, work flow, and various example visualizations. A website - www.astroblend.com - has been developed which includes tutorials, and a gallery of example images and movies, along with links to downloadable data, three dimensional artistic models, and various other resources.
High-Quality T2-Weighted 4-Dimensional Magnetic Resonance Imaging for Radiation Therapy Applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Du, Dongsu; Caruthers, Shelton D.; Glide-Hurst, Carri
2015-06-01
Purpose: The purpose of this study was to improve triggering efficiency of the prospective respiratory amplitude-triggered 4-dimensional magnetic resonance imaging (4DMRI) method and to develop a 4DMRI imaging protocol that could offer T2 weighting for better tumor visualization, good spatial coverage and spatial resolution, and respiratory motion sampling within a reasonable amount of time for radiation therapy applications. Methods and Materials: The respiratory state splitting (RSS) and multi-shot acquisition (MSA) methods were analytically compared and validated in a simulation study by using the respiratory signals from 10 healthy human subjects. The RSS method was more effective in improving triggering efficiency.more » It was implemented in prospective respiratory amplitude-triggered 4DMRI. 4DMRI image datasets were acquired from 5 healthy human subjects. Liver motion was estimated using the acquired 4DMRI image datasets. Results: The simulation study showed the RSS method was more effective for improving triggering efficiency than the MSA method. The average reductions in 4DMRI acquisition times were 36% and 10% for the RSS and MSA methods, respectively. The human subject study showed that T2-weighted 4DMRI with 10 respiratory states, 60 slices at a spatial resolution of 1.5 × 1.5 × 3.0 mm{sup 3} could be acquired in 9 to 18 minutes, depending on the individual's breath pattern. Based on the acquired 4DMRI image datasets, the ranges of peak-to-peak liver displacements among 5 human subjects were 9.0 to 12.9 mm, 2.5 to 3.9 mm, and 0.5 to 2.3 mm in superior-inferior, anterior-posterior, and left-right directions, respectively. Conclusions: We demonstrated that with the RSS method, it was feasible to acquire high-quality T2-weighted 4DMRI within a reasonable amount of time for radiation therapy applications.« less
Zhang, Yifan; Gao, Xunzhang; Peng, Xuan; Ye, Jiaqi; Li, Xiang
2018-05-16
The High Resolution Range Profile (HRRP) recognition has attracted great concern in the field of Radar Automatic Target Recognition (RATR). However, traditional HRRP recognition methods failed to model high dimensional sequential data efficiently and have a poor anti-noise ability. To deal with these problems, a novel stochastic neural network model named Attention-based Recurrent Temporal Restricted Boltzmann Machine (ARTRBM) is proposed in this paper. RTRBM is utilized to extract discriminative features and the attention mechanism is adopted to select major features. RTRBM is efficient to model high dimensional HRRP sequences because it can extract the information of temporal and spatial correlation between adjacent HRRPs. The attention mechanism is used in sequential data recognition tasks including machine translation and relation classification, which makes the model pay more attention to the major features of recognition. Therefore, the combination of RTRBM and the attention mechanism makes our model effective for extracting more internal related features and choose the important parts of the extracted features. Additionally, the model performs well with the noise corrupted HRRP data. Experimental results on the Moving and Stationary Target Acquisition and Recognition (MSTAR) dataset show that our proposed model outperforms other traditional methods, which indicates that ARTRBM extracts, selects, and utilizes the correlation information between adjacent HRRPs effectively and is suitable for high dimensional data or noise corrupted data.
Interactive 4D Visualization of Sediment Transport Models
NASA Astrophysics Data System (ADS)
Butkiewicz, T.; Englert, C. M.
2013-12-01
Coastal sediment transport models simulate the effects that waves, currents, and tides have on near-shore bathymetry and features such as beaches and barrier islands. Understanding these dynamic processes is integral to the study of coastline stability, beach erosion, and environmental contamination. Furthermore, analyzing the results of these simulations is a critical task in the design, placement, and engineering of coastal structures such as seawalls, jetties, support pilings for wind turbines, etc. Despite the importance of these models, there is a lack of available visualization software that allows users to explore and perform analysis on these datasets in an intuitive and effective manner. Existing visualization interfaces for these datasets often present only one variable at a time, using two dimensional plan or cross-sectional views. These visual restrictions limit the ability to observe the contents in the proper overall context, both in spatial and multi-dimensional terms. To improve upon these limitations, we use 3D rendering and particle system based illustration techniques to show water column/flow data across all depths simultaneously. We can also encode multiple variables across different perceptual channels (color, texture, motion, etc.) to enrich surfaces with multi-dimensional information. Interactive tools are provided, which can be used to explore the dataset and find regions-of-interest for further investigation. Our visualization package provides an intuitive 4D (3D, time-varying) visualization of sediment transport model output. In addition, we are also integrating real world observations with the simulated data to support analysis of the impact from major sediment transport events. In particular, we have been focusing on the effects of Superstorm Sandy on the Redbird Artificial Reef Site, offshore of Delaware Bay. Based on our pre- and post-storm high-resolution sonar surveys, there has significant scour and bedform migration around the sunken subway cars and other vessels present at the Redbird site. Due to the extensive surveying and historical data availability in the area, the site is highly attractive for comparing hindcasted sediment transport simulations to our observations of actual changes. This work has the potential to strengthen the accuracy of sediment transport modeling, as well as help predict and prepare for future changes due to similar extreme sediment transport events. Our visualization showing a simple sediment transport model with tidal flows causing significant erosion (red) and deposition (blue).
Zhao, Mingbo; Zhang, Zhao; Chow, Tommy W S; Li, Bing
2014-07-01
Dealing with high-dimensional data has always been a major problem in research of pattern recognition and machine learning, and Linear Discriminant Analysis (LDA) is one of the most popular methods for dimension reduction. However, it only uses labeled samples while neglecting unlabeled samples, which are abundant and can be easily obtained in the real world. In this paper, we propose a new dimension reduction method, called "SL-LDA", by using unlabeled samples to enhance the performance of LDA. The new method first propagates label information from the labeled set to the unlabeled set via a label propagation process, where the predicted labels of unlabeled samples, called "soft labels", can be obtained. It then incorporates the soft labels into the construction of scatter matrixes to find a transformed matrix for dimension reduction. In this way, the proposed method can preserve more discriminative information, which is preferable when solving the classification problem. We further propose an efficient approach for solving SL-LDA under a least squares framework, and a flexible method of SL-LDA (FSL-LDA) to better cope with datasets sampled from a nonlinear manifold. Extensive simulations are carried out on several datasets, and the results show the effectiveness of the proposed method. Copyright © 2014 Elsevier Ltd. All rights reserved.
Application of Template Matching for Improving Classification of Urban Railroad Point Clouds
Arastounia, Mostafa; Oude Elberink, Sander
2016-01-01
This study develops an integrated data-driven and model-driven approach (template matching) that clusters the urban railroad point clouds into three classes of rail track, contact cable, and catenary cable. The employed dataset covers 630 m of the Dutch urban railroad corridors in which there are four rail tracks, two contact cables, and two catenary cables. The dataset includes only geometrical information (three dimensional (3D) coordinates of the points) with no intensity data and no RGB data. The obtained results indicate that all objects of interest are successfully classified at the object level with no false positives and no false negatives. The results also show that an average 97.3% precision and an average 97.7% accuracy at the point cloud level are achieved. The high precision and high accuracy of the rail track classification (both greater than 96%) at the point cloud level stems from the great impact of the employed template matching method on excluding the false positives. The cables also achieve quite high average precision (96.8%) and accuracy (98.4%) due to their high sampling and isolated position in the railroad corridor. PMID:27973452
Wen, Ying; Hou, Lili; He, Lianghua; Peterson, Bradley S; Xu, Dongrong
2015-05-01
Spatial normalization plays a key role in voxel-based analyses of brain images. We propose a highly accurate algorithm for high-dimensional spatial normalization of brain images based on the technique of symmetric optical flow. We first construct a three dimension optical model with the consistency assumption of intensity and consistency of the gradient of intensity under a constraint of discontinuity-preserving spatio-temporal smoothness. Then, an efficient inverse consistency optical flow is proposed with aims of higher registration accuracy, where the flow is naturally symmetric. By employing a hierarchical strategy ranging from coarse to fine scales of resolution and a method of Euler-Lagrange numerical analysis, our algorithm is capable of registering brain images data. Experiments using both simulated and real datasets demonstrated that the accuracy of our algorithm is not only better than that of those traditional optical flow algorithms, but also comparable to other registration methods used extensively in the medical imaging community. Moreover, our registration algorithm is fully automated, requiring a very limited number of parameters and no manual intervention. Copyright © 2015 Elsevier Inc. All rights reserved.
Using high-resolution displays for high-resolution cardiac data.
Goodyer, Christopher; Hodrien, John; Wood, Jason; Kohl, Peter; Brodlie, Ken
2009-07-13
The ability to perform fast, accurate, high-resolution visualization is fundamental to improving our understanding of anatomical data. As the volumes of data increase from improvements in scanning technology, the methods applied to visualization must evolve. In this paper, we address the interactive display of data from high-resolution magnetic resonance imaging scanning of a rabbit heart and subsequent histological imaging. We describe a visualization environment involving a tiled liquid crystal display panel display wall and associated software, which provides an interactive and intuitive user interface. The oView software is an OpenGL application that is written for the VR Juggler environment. This environment abstracts displays and devices away from the application itself, aiding portability between different systems, from desktop PCs to multi-tiled display walls. Portability between display walls has been demonstrated through its use on walls at the universities of both Leeds and Oxford. We discuss important factors to be considered for interactive two-dimensional display of large three-dimensional datasets, including the use of intuitive input devices and level of detail aspects.
Fast and Accurate Support Vector Machines on Large Scale Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Vishnu, Abhinav; Narasimhan, Jayenthi; Holder, Larry
Support Vector Machines (SVM) is a supervised Machine Learning and Data Mining (MLDM) algorithm, which has become ubiquitous largely due to its high accuracy and obliviousness to dimensionality. The objective of SVM is to find an optimal boundary --- also known as hyperplane --- which separates the samples (examples in a dataset) of different classes by a maximum margin. Usually, very few samples contribute to the definition of the boundary. However, existing parallel algorithms use the entire dataset for finding the boundary, which is sub-optimal for performance reasons. In this paper, we propose a novel distributed memory algorithm to eliminatemore » the samples which do not contribute to the boundary definition in SVM. We propose several heuristics, which range from early (aggressive) to late (conservative) elimination of the samples, such that the overall time for generating the boundary is reduced considerably. In a few cases, a sample may be eliminated (shrunk) pre-emptively --- potentially resulting in an incorrect boundary. We propose a scalable approach to synchronize the necessary data structures such that the proposed algorithm maintains its accuracy. We consider the necessary trade-offs of single/multiple synchronization using in-depth time-space complexity analysis. We implement the proposed algorithm using MPI and compare it with libsvm--- de facto sequential SVM software --- which we enhance with OpenMP for multi-core/many-core parallelism. Our proposed approach shows excellent efficiency using up to 4096 processes on several large datasets such as UCI HIGGS Boson dataset and Offending URL dataset.« less
NASA Astrophysics Data System (ADS)
Jiang, Li; Xuan, Jianping; Shi, Tielin
2013-12-01
Generally, the vibration signals of faulty machinery are non-stationary and nonlinear under complicated operating conditions. Therefore, it is a big challenge for machinery fault diagnosis to extract optimal features for improving classification accuracy. This paper proposes semi-supervised kernel Marginal Fisher analysis (SSKMFA) for feature extraction, which can discover the intrinsic manifold structure of dataset, and simultaneously consider the intra-class compactness and the inter-class separability. Based on SSKMFA, a novel approach to fault diagnosis is put forward and applied to fault recognition of rolling bearings. SSKMFA directly extracts the low-dimensional characteristics from the raw high-dimensional vibration signals, by exploiting the inherent manifold structure of both labeled and unlabeled samples. Subsequently, the optimal low-dimensional features are fed into the simplest K-nearest neighbor (KNN) classifier to recognize different fault categories and severities of bearings. The experimental results demonstrate that the proposed approach improves the fault recognition performance and outperforms the other four feature extraction methods.
Metal Oxide Gas Sensor Drift Compensation Using a Two-Dimensional Classifier Ensemble
Liu, Hang; Chu, Renzhi; Tang, Zhenan
2015-01-01
Sensor drift is the most challenging problem in gas sensing at present. We propose a novel two-dimensional classifier ensemble strategy to solve the gas discrimination problem, regardless of the gas concentration, with high accuracy over extended periods of time. This strategy is appropriate for multi-class classifiers that consist of combinations of pairwise classifiers, such as support vector machines. We compare the performance of the strategy with those of competing methods in an experiment based on a public dataset that was compiled over a period of three years. The experimental results demonstrate that the two-dimensional ensemble outperforms the other methods considered. Furthermore, we propose a pre-aging process inspired by that applied to the sensors to improve the stability of the classifier ensemble. The experimental results demonstrate that the weight of each multi-class classifier model in the ensemble remains fairly static before and after the addition of new classifier models to the ensemble, when a pre-aging procedure is applied. PMID:25942640
Application Perspective of 2D+SCALE Dimension
NASA Astrophysics Data System (ADS)
Karim, H.; Rahman, A. Abdul
2016-09-01
Different applications or users need different abstraction of spatial models, dimensionalities and specification of their datasets due to variations of required analysis and output. Various approaches, data models and data structures are now available to support most current application models in Geographic Information System (GIS). One of the focuses trend in GIS multi-dimensional research community is the implementation of scale dimension with spatial datasets to suit various scale application needs. In this paper, 2D spatial datasets that been scaled up as the third dimension are addressed as 2D+scale (or 3D-scale) dimension. Nowadays, various data structures, data models, approaches, schemas, and formats have been proposed as the best approaches to support variety of applications and dimensionality in 3D topology. However, only a few of them considers the element of scale as their targeted dimension. As the scale dimension is concerned, the implementation approach can be either multi-scale or vario-scale (with any available data structures and formats) depending on application requirements (topology, semantic and function). This paper attempts to discuss on the current and new potential applications which positively could be integrated upon 3D-scale dimension approach. The previous and current works on scale dimension as well as the requirements to be preserved for any given applications, implementation issues and future potential applications forms the major discussion of this paper.
ERIC Educational Resources Information Center
Sander, Ian M.; McGoldrick, Matthew T.; Helms, My N.; Betts, Aislinn; van Avermaete, Anthony; Owers, Elizabeth; Doney, Evan; Liepert, Taimi; Niebur, Glen; Liepert, Douglas; Leevy, W. Matthew
2017-01-01
Advances in three-dimensional (3D) printing allow for digital files to be turned into a "printed" physical product. For example, complex anatomical models derived from clinical or pre-clinical X-ray computed tomography (CT) data of patients or research specimens can be constructed using various printable materials. Although 3D printing…
A framework for optimal kernel-based manifold embedding of medical image data.
Zimmer, Veronika A; Lekadir, Karim; Hoogendoorn, Corné; Frangi, Alejandro F; Piella, Gemma
2015-04-01
Kernel-based dimensionality reduction is a widely used technique in medical image analysis. To fully unravel the underlying nonlinear manifold the selection of an adequate kernel function and of its free parameters is critical. In practice, however, the kernel function is generally chosen as Gaussian or polynomial and such standard kernels might not always be optimal for a given image dataset or application. In this paper, we present a study on the effect of the kernel functions in nonlinear manifold embedding of medical image data. To this end, we first carry out a literature review on existing advanced kernels developed in the statistics, machine learning, and signal processing communities. In addition, we implement kernel-based formulations of well-known nonlinear dimensional reduction techniques such as Isomap and Locally Linear Embedding, thus obtaining a unified framework for manifold embedding using kernels. Subsequently, we present a method to automatically choose a kernel function and its associated parameters from a pool of kernel candidates, with the aim to generate the most optimal manifold embeddings. Furthermore, we show how the calculated selection measures can be extended to take into account the spatial relationships in images, or used to combine several kernels to further improve the embedding results. Experiments are then carried out on various synthetic and phantom datasets for numerical assessment of the methods. Furthermore, the workflow is applied to real data that include brain manifolds and multispectral images to demonstrate the importance of the kernel selection in the analysis of high-dimensional medical images. Copyright © 2014 Elsevier Ltd. All rights reserved.
Big Data Transforms Discovery-Utilization Therapeutics Continuum
Waldman, SA; Terzic, A
2015-01-01
Enabling omic technologies adopt a holistic view to produce unprecedented insights into the molecular underpinnings of health and disease, in part, by generating massive high-dimensional biological data. Leveraging these systems-level insights as an engine driving the healthcare evolution is maximized through integration with medical, demographic, and environmental datasets from individuals to populations. Big data analytics has accordingly emerged to add value to the technical aspects of storage, transfer, and analysis required for merging vast arrays of omic-, clinical- and eco-datasets. In turn, this new field at the interface of biology, medicine, and information science is systematically transforming modern therapeutics across discovery, development, regulation, and utilization. “…a man's discourse was like to a rich Persian carpet, the beautiful figures and patterns of which can be shown only by spreading and extending it out; when it is contracted and folded up, they are obscured and lost” Themistocles quoted by Plutarch AD 46 – AD 120 PMID:26888297
NASA Technical Reports Server (NTRS)
Kaplan, Michael L.; Lin, Yuh-Lang
2004-01-01
During the research project, sounding datasets were generated for the region surrounding 9 major airports, including Dallas, TX, Boston, MA, New York, NY, Chicago, IL, St. Louis, MO, Atlanta, GA, Miami, FL, San Francico, CA, and Los Angeles, CA. The numerical simulation of winter and summer environments during which no instrument flight rule impact was occurring at these 9 terminals was performed using the most contemporary version of the Terminal Area PBL Prediction System (TAPPS) model nested from 36 km to 6 km to 1 km horizontal resolution and very detailed vertical resolution in the planetary boundary layer. The soundings from the 1 km model were archived at 30 minute time intervals for a 24 hour period and the vertical dependent variables as well as derived quantities, i.e., 3-dimensional wind components, temperatures, pressures, mixing ratios, turbulence kinetic energy and eddy dissipation rates were then interpolated to 5 m vertical resolution up to 1000 m elevation above ground level. After partial validation against field experiment datasets for Dallas as well as larger scale and much coarser resolution observations at the other 8 airports, these sounding datasets were sent to NASA for use in the Virtual Air Space and Modeling program. The application of these datasets being to determine representative airport weather environments to diagnose the response of simulated wake vortices to realistic atmospheric environments. These virtual datasets are based on large scale observed atmospheric initial conditions that are dynamically interpolated in space and time. The 1 km nested-grid simulated datasets providing a very coarse and highly smoothed representation of airport environment meteorological conditions. Details concerning the airport surface forcing are virtually absent from these simulated datasets although the observed background atmospheric processes have been compared to the simulated fields and the fields were found to accurately replicate the flows surrounding the airport where coarse verification data were available as well as where airport scale datasets were available.
D Tracking Based Augmented Reality for Cultural Heritage Data Management
NASA Astrophysics Data System (ADS)
Battini, C.; Landi, G.
2015-02-01
The development of contactless documentation techniques is allowing researchers to collect high volumes of three-dimensional data in a short time but with high levels of accuracy. The digitalisation of cultural heritage opens up the possibility of using image processing and analysis, and computer graphics techniques, to preserve this heritage for future generations; augmenting it with additional information or with new possibilities for its enjoyment and use. The collection of precise datasets about cultural heritage status is crucial for its interpretation, its conservation and during the restoration processes. The application of digital-imaging solutions for various feature extraction, image data-analysis techniques, and three-dimensional reconstruction of ancient artworks, allows the creation of multidimensional models that can incorporate information coming from heterogeneous data sets, research results and historical sources. Real objects can be scanned and reconstructed virtually, with high levels of data accuracy and resolution. Real-time visualisation software and hardware is rapidly evolving and complex three-dimensional models can be interactively visualised and explored on applications developed for mobile devices. This paper will show how a 3D reconstruction of an object, with multiple layers of information, can be stored and visualised through a mobile application that will allow interaction with a physical object for its study and analysis, using 3D Tracking based Augmented Reality techniques.
The cross-validated AUC for MCP-logistic regression with high-dimensional data.
Jiang, Dingfeng; Huang, Jian; Zhang, Ying
2013-10-01
We propose a cross-validated area under the receiving operator characteristic (ROC) curve (CV-AUC) criterion for tuning parameter selection for penalized methods in sparse, high-dimensional logistic regression models. We use this criterion in combination with the minimax concave penalty (MCP) method for variable selection. The CV-AUC criterion is specifically designed for optimizing the classification performance for binary outcome data. To implement the proposed approach, we derive an efficient coordinate descent algorithm to compute the MCP-logistic regression solution surface. Simulation studies are conducted to evaluate the finite sample performance of the proposed method and its comparison with the existing methods including the Akaike information criterion (AIC), Bayesian information criterion (BIC) or Extended BIC (EBIC). The model selected based on the CV-AUC criterion tends to have a larger predictive AUC and smaller classification error than those with tuning parameters selected using the AIC, BIC or EBIC. We illustrate the application of the MCP-logistic regression with the CV-AUC criterion on three microarray datasets from the studies that attempt to identify genes related to cancers. Our simulation studies and data examples demonstrate that the CV-AUC is an attractive method for tuning parameter selection for penalized methods in high-dimensional logistic regression models.
TESTING HIGH-DIMENSIONAL COVARIANCE MATRICES, WITH APPLICATION TO DETECTING SCHIZOPHRENIA RISK GENES
Zhu, Lingxue; Lei, Jing; Devlin, Bernie; Roeder, Kathryn
2017-01-01
Scientists routinely compare gene expression levels in cases versus controls in part to determine genes associated with a disease. Similarly, detecting case-control differences in co-expression among genes can be critical to understanding complex human diseases; however statistical methods have been limited by the high dimensional nature of this problem. In this paper, we construct a sparse-Leading-Eigenvalue-Driven (sLED) test for comparing two high-dimensional covariance matrices. By focusing on the spectrum of the differential matrix, sLED provides a novel perspective that accommodates what we assume to be common, namely sparse and weak signals in gene expression data, and it is closely related with Sparse Principal Component Analysis. We prove that sLED achieves full power asymptotically under mild assumptions, and simulation studies verify that it outperforms other existing procedures under many biologically plausible scenarios. Applying sLED to the largest gene-expression dataset obtained from post-mortem brain tissue from Schizophrenia patients and controls, we provide a novel list of genes implicated in Schizophrenia and reveal intriguing patterns in gene co-expression change for Schizophrenia subjects. We also illustrate that sLED can be generalized to compare other gene-gene “relationship” matrices that are of practical interest, such as the weighted adjacency matrices. PMID:29081874
Zhu, Lingxue; Lei, Jing; Devlin, Bernie; Roeder, Kathryn
2017-09-01
Scientists routinely compare gene expression levels in cases versus controls in part to determine genes associated with a disease. Similarly, detecting case-control differences in co-expression among genes can be critical to understanding complex human diseases; however statistical methods have been limited by the high dimensional nature of this problem. In this paper, we construct a sparse-Leading-Eigenvalue-Driven (sLED) test for comparing two high-dimensional covariance matrices. By focusing on the spectrum of the differential matrix, sLED provides a novel perspective that accommodates what we assume to be common, namely sparse and weak signals in gene expression data, and it is closely related with Sparse Principal Component Analysis. We prove that sLED achieves full power asymptotically under mild assumptions, and simulation studies verify that it outperforms other existing procedures under many biologically plausible scenarios. Applying sLED to the largest gene-expression dataset obtained from post-mortem brain tissue from Schizophrenia patients and controls, we provide a novel list of genes implicated in Schizophrenia and reveal intriguing patterns in gene co-expression change for Schizophrenia subjects. We also illustrate that sLED can be generalized to compare other gene-gene "relationship" matrices that are of practical interest, such as the weighted adjacency matrices.
Spectral embedding finds meaningful (relevant) structure in image and microarray data
Higgs, Brandon W; Weller, Jennifer; Solka, Jeffrey L
2006-01-01
Background Accurate methods for extraction of meaningful patterns in high dimensional data have become increasingly important with the recent generation of data types containing measurements across thousands of variables. Principal components analysis (PCA) is a linear dimensionality reduction (DR) method that is unsupervised in that it relies only on the data; projections are calculated in Euclidean or a similar linear space and do not use tuning parameters for optimizing the fit to the data. However, relationships within sets of nonlinear data types, such as biological networks or images, are frequently mis-rendered into a low dimensional space by linear methods. Nonlinear methods, in contrast, attempt to model important aspects of the underlying data structure, often requiring parameter(s) fitting to the data type of interest. In many cases, the optimal parameter values vary when different classification algorithms are applied on the same rendered subspace, making the results of such methods highly dependent upon the type of classifier implemented. Results We present the results of applying the spectral method of Lafon, a nonlinear DR method based on the weighted graph Laplacian, that minimizes the requirements for such parameter optimization for two biological data types. We demonstrate that it is successful in determining implicit ordering of brain slice image data and in classifying separate species in microarray data, as compared to two conventional linear methods and three nonlinear methods (one of which is an alternative spectral method). This spectral implementation is shown to provide more meaningful information, by preserving important relationships, than the methods of DR presented for comparison. Tuning parameter fitting is simple and is a general, rather than data type or experiment specific approach, for the two datasets analyzed here. Tuning parameter optimization is minimized in the DR step to each subsequent classification method, enabling the possibility of valid cross-experiment comparisons. Conclusion Results from the spectral method presented here exhibit the desirable properties of preserving meaningful nonlinear relationships in lower dimensional space and requiring minimal parameter fitting, providing a useful algorithm for purposes of visualization and classification across diverse datasets, a common challenge in systems biology. PMID:16483359
Quality Assessments of Long-Term Quantitative Proteomic Analysis of Breast Cancer Xenograft Tissues
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhou, Jian-Ying; Chen, Lijun; Zhang, Bai
The identification of protein biomarkers requires large-scale analysis of human specimens to achieve statistical significance. In this study, we evaluated the long-term reproducibility of an iTRAQ (isobaric tags for relative and absolute quantification) based quantitative proteomics strategy using one channel for universal normalization across all samples. A total of 307 liquid chromatography tandem mass spectrometric (LC-MS/MS) analyses were completed, generating 107 one-dimensional (1D) LC-MS/MS datasets and 8 offline two-dimensional (2D) LC-MS/MS datasets (25 fractions for each set) for human-in-mouse breast cancer xenograft tissues representative of basal and luminal subtypes. Such large-scale studies require the implementation of robust metrics to assessmore » the contributions of technical and biological variability in the qualitative and quantitative data. Accordingly, we developed a quantification confidence score based on the quality of each peptide-spectrum match (PSM) to remove quantification outliers from each analysis. After combining confidence score filtering and statistical analysis, reproducible protein identification and quantitative results were achieved from LC-MS/MS datasets collected over a 16 month period.« less
bioWeb3D: an online webGL 3D data visualisation tool.
Pettit, Jean-Baptiste; Marioni, John C
2013-06-07
Data visualization is critical for interpreting biological data. However, in practice it can prove to be a bottleneck for non trained researchers; this is especially true for three dimensional (3D) data representation. Whilst existing software can provide all necessary functionalities to represent and manipulate biological 3D datasets, very few are easily accessible (browser based), cross platform and accessible to non-expert users. An online HTML5/WebGL based 3D visualisation tool has been developed to allow biologists to quickly and easily view interactive and customizable three dimensional representations of their data along with multiple layers of information. Using the WebGL library Three.js written in Javascript, bioWeb3D allows the simultaneous visualisation of multiple large datasets inputted via a simple JSON, XML or CSV file, which can be read and analysed locally thanks to HTML5 capabilities. Using basic 3D representation techniques in a technologically innovative context, we provide a program that is not intended to compete with professional 3D representation software, but that instead enables a quick and intuitive representation of reasonably large 3D datasets.
A comparative study of two prediction models for brain tumor progression
NASA Astrophysics Data System (ADS)
Zhou, Deqi; Tran, Loc; Wang, Jihong; Li, Jiang
2015-03-01
MR diffusion tensor imaging (DTI) technique together with traditional T1 or T2 weighted MRI scans supplies rich information sources for brain cancer diagnoses. These images form large-scale, high-dimensional data sets. Due to the fact that significant correlations exist among these images, we assume low-dimensional geometry data structures (manifolds) are embedded in the high-dimensional space. Those manifolds might be hidden from radiologists because it is challenging for human experts to interpret high-dimensional data. Identification of the manifold is a critical step for successfully analyzing multimodal MR images. We have developed various manifold learning algorithms (Tran et al. 2011; Tran et al. 2013) for medical image analysis. This paper presents a comparative study of an incremental manifold learning scheme (Tran. et al. 2013) versus the deep learning model (Hinton et al. 2006) in the application of brain tumor progression prediction. The incremental manifold learning is a variant of manifold learning algorithm to handle large-scale datasets in which a representative subset of original data is sampled first to construct a manifold skeleton and remaining data points are then inserted into the skeleton by following their local geometry. The incremental manifold learning algorithm aims at mitigating the computational burden associated with traditional manifold learning methods for large-scale datasets. Deep learning is a recently developed multilayer perceptron model that has achieved start-of-the-art performances in many applications. A recent technique named "Dropout" can further boost the deep model by preventing weight coadaptation to avoid over-fitting (Hinton et al. 2012). We applied the two models on multiple MRI scans from four brain tumor patients to predict tumor progression and compared the performances of the two models in terms of average prediction accuracy, sensitivity, specificity and precision. The quantitative performance metrics were calculated as average over the four patients. Experimental results show that both the manifold learning and deep neural network models produced better results compared to using raw data and principle component analysis (PCA), and the deep learning model is a better method than manifold learning on this data set. The averaged sensitivity and specificity by deep learning are comparable with these by the manifold learning approach while its precision is considerably higher. This means that the predicted abnormal points by deep learning are more likely to correspond to the actual progression region.
Sparse Covariance Matrix Estimation With Eigenvalue Constraints
LIU, Han; WANG, Lie; ZHAO, Tuo
2014-01-01
We propose a new approach for estimating high-dimensional, positive-definite covariance matrices. Our method extends the generalized thresholding operator by adding an explicit eigenvalue constraint. The estimated covariance matrix simultaneously achieves sparsity and positive definiteness. The estimator is rate optimal in the minimax sense and we develop an efficient iterative soft-thresholding and projection algorithm based on the alternating direction method of multipliers. Empirically, we conduct thorough numerical experiments on simulated datasets as well as real data examples to illustrate the usefulness of our method. Supplementary materials for the article are available online. PMID:25620866
Vaccarino, Anthony L; Dharsee, Moyez; Strother, Stephen; Aldridge, Don; Arnott, Stephen R; Behan, Brendan; Dafnas, Costas; Dong, Fan; Edgecombe, Kenneth; El-Badrawi, Rachad; El-Emam, Khaled; Gee, Tom; Evans, Susan G; Javadi, Mojib; Jeanson, Francis; Lefaivre, Shannon; Lutz, Kristen; MacPhee, F Chris; Mikkelsen, Jordan; Mikkelsen, Tom; Mirotchnick, Nicholas; Schmah, Tanya; Studzinski, Christa M; Stuss, Donald T; Theriault, Elizabeth; Evans, Kenneth R
2018-01-01
Historically, research databases have existed in isolation with no practical avenue for sharing or pooling medical data into high dimensional datasets that can be efficiently compared across databases. To address this challenge, the Ontario Brain Institute's "Brain-CODE" is a large-scale neuroinformatics platform designed to support the collection, storage, federation, sharing and analysis of different data types across several brain disorders, as a means to understand common underlying causes of brain dysfunction and develop novel approaches to treatment. By providing researchers access to aggregated datasets that they otherwise could not obtain independently, Brain-CODE incentivizes data sharing and collaboration and facilitates analyses both within and across disorders and across a wide array of data types, including clinical, neuroimaging and molecular. The Brain-CODE system architecture provides the technical capabilities to support (1) consolidated data management to securely capture, monitor and curate data, (2) privacy and security best-practices, and (3) interoperable and extensible systems that support harmonization, integration, and query across diverse data modalities and linkages to external data sources. Brain-CODE currently supports collaborative research networks focused on various brain conditions, including neurodevelopmental disorders, cerebral palsy, neurodegenerative diseases, epilepsy and mood disorders. These programs are generating large volumes of data that are integrated within Brain-CODE to support scientific inquiry and analytics across multiple brain disorders and modalities. By providing access to very large datasets on patients with different brain disorders and enabling linkages to provincial, national and international databases, Brain-CODE will help to generate new hypotheses about the biological bases of brain disorders, and ultimately promote new discoveries to improve patient care.
Vaccarino, Anthony L.; Dharsee, Moyez; Strother, Stephen; Aldridge, Don; Arnott, Stephen R.; Behan, Brendan; Dafnas, Costas; Dong, Fan; Edgecombe, Kenneth; El-Badrawi, Rachad; El-Emam, Khaled; Gee, Tom; Evans, Susan G.; Javadi, Mojib; Jeanson, Francis; Lefaivre, Shannon; Lutz, Kristen; MacPhee, F. Chris; Mikkelsen, Jordan; Mikkelsen, Tom; Mirotchnick, Nicholas; Schmah, Tanya; Studzinski, Christa M.; Stuss, Donald T.; Theriault, Elizabeth; Evans, Kenneth R.
2018-01-01
Historically, research databases have existed in isolation with no practical avenue for sharing or pooling medical data into high dimensional datasets that can be efficiently compared across databases. To address this challenge, the Ontario Brain Institute’s “Brain-CODE” is a large-scale neuroinformatics platform designed to support the collection, storage, federation, sharing and analysis of different data types across several brain disorders, as a means to understand common underlying causes of brain dysfunction and develop novel approaches to treatment. By providing researchers access to aggregated datasets that they otherwise could not obtain independently, Brain-CODE incentivizes data sharing and collaboration and facilitates analyses both within and across disorders and across a wide array of data types, including clinical, neuroimaging and molecular. The Brain-CODE system architecture provides the technical capabilities to support (1) consolidated data management to securely capture, monitor and curate data, (2) privacy and security best-practices, and (3) interoperable and extensible systems that support harmonization, integration, and query across diverse data modalities and linkages to external data sources. Brain-CODE currently supports collaborative research networks focused on various brain conditions, including neurodevelopmental disorders, cerebral palsy, neurodegenerative diseases, epilepsy and mood disorders. These programs are generating large volumes of data that are integrated within Brain-CODE to support scientific inquiry and analytics across multiple brain disorders and modalities. By providing access to very large datasets on patients with different brain disorders and enabling linkages to provincial, national and international databases, Brain-CODE will help to generate new hypotheses about the biological bases of brain disorders, and ultimately promote new discoveries to improve patient care. PMID:29875648
Automatic three-dimensional registration of intravascular optical coherence tomography images
NASA Astrophysics Data System (ADS)
Ughi, Giovanni J.; Adriaenssens, Tom; Larsson, Matilda; Dubois, Christophe; Sinnaeve, Peter R.; Coosemans, Mark; Desmet, Walter; D'hooge, Jan
2012-02-01
Intravascular optical coherence tomography (IV-OCT) is a catheter-based high-resolution imaging technique able to visualize the inner wall of the coronary arteries and implanted devices in vivo with an axial resolution below 20 μm. IV-OCT is being used in several clinical trials aiming to quantify the vessel response to stent implantation over time. However, stent analysis is currently performed manually and corresponding images taken at different time points are matched through a very labor-intensive and subjective procedure. We present an automated method for the spatial registration of IV-OCT datasets. Stent struts are segmented through consecutive images and three-dimensional models of the stents are created for both datasets to be registered. The two models are initially roughly registered through an automatic initialization procedure and an iterative closest point algorithm is subsequently applied for a more precise registration. To correct for nonuniform rotational distortions (NURDs) and other potential acquisition artifacts, the registration is consecutively refined on a local level. The algorithm was first validated by using an in vitro experimental setup based on a polyvinyl-alcohol gel tubular phantom. Subsequently, an in vivo validation was obtained by exploiting stable vessel landmarks. The mean registration error in vitro was quantified to be 0.14 mm in the longitudinal axis and 7.3-deg mean rotation error. In vivo validation resulted in 0.23 mm in the longitudinal axis and 10.1-deg rotation error. These results indicate that the proposed methodology can be used for automatic registration of in vivo IV-OCT datasets. Such a tool will be indispensable for larger studies on vessel healing pathophysiology and reaction to stent implantation. As such, it will be valuable in testing the performance of new generations of intracoronary devices and new therapeutic drugs.
The 3D Reference Earth Model: Status and Preliminary Results
NASA Astrophysics Data System (ADS)
Moulik, P.; Lekic, V.; Romanowicz, B. A.
2017-12-01
In the 20th century, seismologists constructed models of how average physical properties (e.g. density, rigidity, compressibility, anisotropy) vary with depth in the Earth's interior. These one-dimensional (1D) reference Earth models (e.g. PREM) have proven indispensable in earthquake location, imaging of interior structure, understanding material properties under extreme conditions, and as a reference in other fields, such as particle physics and astronomy. Over the past three decades, new datasets motivated more sophisticated efforts that yielded models of how properties vary both laterally and with depth in the Earth's interior. Though these three-dimensional (3D) models exhibit compelling similarities at large scales, differences in the methodology, representation of structure, and dataset upon which they are based, have prevented the creation of 3D community reference models. As part of the REM-3D project, we are compiling and reconciling reference seismic datasets of body wave travel-time measurements, fundamental mode and overtone surface wave dispersion measurements, and normal mode frequencies and splitting functions. These reference datasets are being inverted for a long-wavelength, 3D reference Earth model that describes the robust long-wavelength features of mantle heterogeneity. As a community reference model with fully quantified uncertainties and tradeoffs and an associated publically available dataset, REM-3D will facilitate Earth imaging studies, earthquake characterization, inferences on temperature and composition in the deep interior, and be of improved utility to emerging scientific endeavors, such as neutrino geoscience. Here, we summarize progress made in the construction of the reference long period dataset and present a preliminary version of REM-3D in the upper-mantle. In order to determine the level of detail warranted for inclusion in REM-3D, we analyze the spectrum of discrepancies between models inverted with different subsets of the reference dataset. This procedure allows us to evaluate the extent of consistency in imaging heterogeneity at various depths and between spatial scales.
Washburne, Alex D; Silverman, Justin D; Leff, Jonathan W; Bennett, Dominic J; Darcy, John L; Mukherjee, Sayan; Fierer, Noah; David, Lawrence A
2017-01-01
Marker gene sequencing of microbial communities has generated big datasets of microbial relative abundances varying across environmental conditions, sample sites and treatments. These data often come with putative phylogenies, providing unique opportunities to investigate how shared evolutionary history affects microbial abundance patterns. Here, we present a method to identify the phylogenetic factors driving patterns in microbial community composition. We use the method, "phylofactorization," to re-analyze datasets from the human body and soil microbial communities, demonstrating how phylofactorization is a dimensionality-reducing tool, an ordination-visualization tool, and an inferential tool for identifying edges in the phylogeny along which putative functional ecological traits may have arisen.
Inference for High-dimensional Differential Correlation Matrices.
Cai, T Tony; Zhang, Anru
2016-01-01
Motivated by differential co-expression analysis in genomics, we consider in this paper estimation and testing of high-dimensional differential correlation matrices. An adaptive thresholding procedure is introduced and theoretical guarantees are given. Minimax rate of convergence is established and the proposed estimator is shown to be adaptively rate-optimal over collections of paired correlation matrices with approximately sparse differences. Simulation results show that the procedure significantly outperforms two other natural methods that are based on separate estimation of the individual correlation matrices. The procedure is also illustrated through an analysis of a breast cancer dataset, which provides evidence at the gene co-expression level that several genes, of which a subset has been previously verified, are associated with the breast cancer. Hypothesis testing on the differential correlation matrices is also considered. A test, which is particularly well suited for testing against sparse alternatives, is introduced. In addition, other related problems, including estimation of a single sparse correlation matrix, estimation of the differential covariance matrices, and estimation of the differential cross-correlation matrices, are also discussed.
SEMIPARAMETRIC QUANTILE REGRESSION WITH HIGH-DIMENSIONAL COVARIATES
Zhu, Liping; Huang, Mian; Li, Runze
2012-01-01
This paper is concerned with quantile regression for a semiparametric regression model, in which both the conditional mean and conditional variance function of the response given the covariates admit a single-index structure. This semiparametric regression model enables us to reduce the dimension of the covariates and simultaneously retains the flexibility of nonparametric regression. Under mild conditions, we show that the simple linear quantile regression offers a consistent estimate of the index parameter vector. This is a surprising and interesting result because the single-index model is possibly misspecified under the linear quantile regression. With a root-n consistent estimate of the index vector, one may employ a local polynomial regression technique to estimate the conditional quantile function. This procedure is computationally efficient, which is very appealing in high-dimensional data analysis. We show that the resulting estimator of the quantile function performs asymptotically as efficiently as if the true value of the index vector were known. The methodologies are demonstrated through comprehensive simulation studies and an application to a real dataset. PMID:24501536
NASA Astrophysics Data System (ADS)
Wang, Xin; Zhang, Yanqi; Zhang, Limin; Li, Jiao; Zhou, Zhongxing; Zhao, Huijuan; Gao, Feng
2016-04-01
We present a generalized strategy for direct reconstruction in pharmacokinetic diffuse fluorescence tomography (DFT) with CT-analogous scanning mode, which can accomplish one-step reconstruction of the indocyanine-green pharmacokinetic-rate images within in vivo small animals by incorporating the compartmental kinetic model into an adaptive extended Kalman filtering scheme and using an instantaneous sampling dataset. This scheme, compared with the established indirect and direct methods, eliminates the interim error of the DFT inversion and relaxes the expensive requirement of the instrument for obtaining highly time-resolved date-sets of complete 360 deg projections. The scheme is validated by two-dimensional simulations for the two-compartment model and pilot phantom experiments for the one-compartment model, suggesting that the proposed method can estimate the compartmental concentrations and the pharmacokinetic-rates simultaneously with a fair quantitative and localization accuracy, and is well suitable for cost-effective and dense-sampling instrumentation based on the highly-sensitive photon counting technique.
Moon, Myungjin; Nakai, Kenta
2018-04-01
Currently, cancer biomarker discovery is one of the important research topics worldwide. In particular, detecting significant genes related to cancer is an important task for early diagnosis and treatment of cancer. Conventional studies mostly focus on genes that are differentially expressed in different states of cancer; however, noise in gene expression datasets and insufficient information in limited datasets impede precise analysis of novel candidate biomarkers. In this study, we propose an integrative analysis of gene expression and DNA methylation using normalization and unsupervised feature extractions to identify candidate biomarkers of cancer using renal cell carcinoma RNA-seq datasets. Gene expression and DNA methylation datasets are normalized by Box-Cox transformation and integrated into a one-dimensional dataset that retains the major characteristics of the original datasets by unsupervised feature extraction methods, and differentially expressed genes are selected from the integrated dataset. Use of the integrated dataset demonstrated improved performance as compared with conventional approaches that utilize gene expression or DNA methylation datasets alone. Validation based on the literature showed that a considerable number of top-ranked genes from the integrated dataset have known relationships with cancer, implying that novel candidate biomarkers can also be acquired from the proposed analysis method. Furthermore, we expect that the proposed method can be expanded for applications involving various types of multi-omics datasets.
Human Activity Recognition in AAL Environments Using Random Projections.
Damaševičius, Robertas; Vasiljevas, Mindaugas; Šalkevičius, Justas; Woźniak, Marcin
2016-01-01
Automatic human activity recognition systems aim to capture the state of the user and its environment by exploiting heterogeneous sensors attached to the subject's body and permit continuous monitoring of numerous physiological signals reflecting the state of human actions. Successful identification of human activities can be immensely useful in healthcare applications for Ambient Assisted Living (AAL), for automatic and intelligent activity monitoring systems developed for elderly and disabled people. In this paper, we propose the method for activity recognition and subject identification based on random projections from high-dimensional feature space to low-dimensional projection space, where the classes are separated using the Jaccard distance between probability density functions of projected data. Two HAR domain tasks are considered: activity identification and subject identification. The experimental results using the proposed method with Human Activity Dataset (HAD) data are presented.
Human Activity Recognition in AAL Environments Using Random Projections
Damaševičius, Robertas; Vasiljevas, Mindaugas; Šalkevičius, Justas; Woźniak, Marcin
2016-01-01
Automatic human activity recognition systems aim to capture the state of the user and its environment by exploiting heterogeneous sensors attached to the subject's body and permit continuous monitoring of numerous physiological signals reflecting the state of human actions. Successful identification of human activities can be immensely useful in healthcare applications for Ambient Assisted Living (AAL), for automatic and intelligent activity monitoring systems developed for elderly and disabled people. In this paper, we propose the method for activity recognition and subject identification based on random projections from high-dimensional feature space to low-dimensional projection space, where the classes are separated using the Jaccard distance between probability density functions of projected data. Two HAR domain tasks are considered: activity identification and subject identification. The experimental results using the proposed method with Human Activity Dataset (HAD) data are presented. PMID:27413392
Non-model-based correction of respiratory motion using beat-to-beat 3D spiral fat-selective imaging.
Keegan, Jennifer; Gatehouse, Peter D; Yang, Guang-Zhong; Firmin, David N
2007-09-01
To demonstrate the feasibility of retrospective beat-to-beat correction of respiratory motion, without the need for a respiratory motion model. A high-resolution three-dimensional (3D) spiral black-blood scan of the right coronary artery (RCA) of six healthy volunteers was acquired over 160 cardiac cycles without respiratory gating. One spiral interleaf was acquired per cardiac cycle, prior to each of which a complete low-resolution fat-selective 3D spiral dataset was acquired. The respiratory motion (3D translation) on each cardiac cycle was determined by cross-correlating a region of interest (ROI) in the fat around the artery in the low-resolution datasets with that on a reference end-expiratory dataset. The measured translations were used to correct the raw data of the high-resolution spiral interleaves. Beat-to-beat correction provided consistently good results, with the image quality being better than that obtained with a fixed superior-inferior tracking factor of 0.6 and better than (N = 5) or equal to (N = 1) that achieved using a subject-specific retrospective 3D translation motion model. Non-model-based correction of respiratory motion using 3D spiral fat-selective imaging is feasible, and in this small group of volunteers produced better-quality images than a subject-specific retrospective 3D translation motion model. (c) 2007 Wiley-Liss, Inc.
Buettner, Florian; Moignard, Victoria; Göttgens, Berthold; Theis, Fabian J
2014-07-01
High-throughput single-cell quantitative real-time polymerase chain reaction (qPCR) is a promising technique allowing for new insights in complex cellular processes. However, the PCR reaction can be detected only up to a certain detection limit, whereas failed reactions could be due to low or absent expression, and the true expression level is unknown. Because this censoring can occur for high proportions of the data, it is one of the main challenges when dealing with single-cell qPCR data. Principal component analysis (PCA) is an important tool for visualizing the structure of high-dimensional data as well as for identifying subpopulations of cells. However, to date it is not clear how to perform a PCA of censored data. We present a probabilistic approach that accounts for the censoring and evaluate it for two typical datasets containing single-cell qPCR data. We use the Gaussian process latent variable model framework to account for censoring by introducing an appropriate noise model and allowing a different kernel for each dimension. We evaluate this new approach for two typical qPCR datasets (of mouse embryonic stem cells and blood stem/progenitor cells, respectively) by performing linear and non-linear probabilistic PCA. Taking the censoring into account results in a 2D representation of the data, which better reflects its known structure: in both datasets, our new approach results in a better separation of known cell types and is able to reveal subpopulations in one dataset that could not be resolved using standard PCA. The implementation was based on the existing Gaussian process latent variable model toolbox (https://github.com/SheffieldML/GPmat); extensions for noise models and kernels accounting for censoring are available at http://icb.helmholtz-muenchen.de/censgplvm. © The Author 2014. Published by Oxford University Press. All rights reserved.
Buettner, Florian; Moignard, Victoria; Göttgens, Berthold; Theis, Fabian J.
2014-01-01
Motivation: High-throughput single-cell quantitative real-time polymerase chain reaction (qPCR) is a promising technique allowing for new insights in complex cellular processes. However, the PCR reaction can be detected only up to a certain detection limit, whereas failed reactions could be due to low or absent expression, and the true expression level is unknown. Because this censoring can occur for high proportions of the data, it is one of the main challenges when dealing with single-cell qPCR data. Principal component analysis (PCA) is an important tool for visualizing the structure of high-dimensional data as well as for identifying subpopulations of cells. However, to date it is not clear how to perform a PCA of censored data. We present a probabilistic approach that accounts for the censoring and evaluate it for two typical datasets containing single-cell qPCR data. Results: We use the Gaussian process latent variable model framework to account for censoring by introducing an appropriate noise model and allowing a different kernel for each dimension. We evaluate this new approach for two typical qPCR datasets (of mouse embryonic stem cells and blood stem/progenitor cells, respectively) by performing linear and non-linear probabilistic PCA. Taking the censoring into account results in a 2D representation of the data, which better reflects its known structure: in both datasets, our new approach results in a better separation of known cell types and is able to reveal subpopulations in one dataset that could not be resolved using standard PCA. Availability and implementation: The implementation was based on the existing Gaussian process latent variable model toolbox (https://github.com/SheffieldML/GPmat); extensions for noise models and kernels accounting for censoring are available at http://icb.helmholtz-muenchen.de/censgplvm. Contact: fbuettner.phys@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online. PMID:24618470
Locally Linear Embedding of Local Orthogonal Least Squares Images for Face Recognition
NASA Astrophysics Data System (ADS)
Hafizhelmi Kamaru Zaman, Fadhlan
2018-03-01
Dimensionality reduction is very important in face recognition since it ensures that high-dimensionality data can be mapped to lower dimensional space without losing salient and integral facial information. Locally Linear Embedding (LLE) has been previously used to serve this purpose, however, the process of acquiring LLE features requires high computation and resources. To overcome this limitation, we propose a locally-applied Local Orthogonal Least Squares (LOLS) model can be used as initial feature extraction before the application of LLE. By construction of least squares regression under orthogonal constraints we can preserve more discriminant information in the local subspace of facial features while reducing the overall features into a more compact form that we called LOLS images. LLE can then be applied on the LOLS images to maps its representation into a global coordinate system of much lower dimensionality. Several experiments carried out using publicly available face datasets such as AR, ORL, YaleB, and FERET under Single Sample Per Person (SSPP) constraint demonstrates that our proposed method can reduce the time required to compute LLE features while delivering better accuracy when compared to when either LLE or OLS alone is used. Comparison against several other feature extraction methods and more recent feature-learning method such as state-of-the-art Convolutional Neural Networks (CNN) also reveal the superiority of the proposed method under SSPP constraint.
NASA Astrophysics Data System (ADS)
Li, Guo; Xia, Jun; Li, Lei; Wang, Lidai; Wang, Lihong V.
2015-03-01
Linear transducer arrays are readily available for ultrasonic detection in photoacoustic computed tomography. They offer low cost, hand-held convenience, and conventional ultrasonic imaging. However, the elevational resolution of linear transducer arrays, which is usually determined by the weak focus of the cylindrical acoustic lens, is about one order of magnitude worse than the in-plane axial and lateral spatial resolutions. Therefore, conventional linear scanning along the elevational direction cannot provide high-quality three-dimensional photoacoustic images due to the anisotropic spatial resolutions. Here we propose an innovative method to achieve isotropic resolutions for three-dimensional photoacoustic images through combined linear and rotational scanning. In each scan step, we first elevationally scan the linear transducer array, and then rotate the linear transducer array along its center in small steps, and scan again until 180 degrees have been covered. To reconstruct isotropic three-dimensional images from the multiple-directional scanning dataset, we use the standard inverse Radon transform originating from X-ray CT. We acquired a three-dimensional microsphere phantom image through the inverse Radon transform method and compared it with a single-elevational-scan three-dimensional image. The comparison shows that our method improves the elevational resolution by up to one order of magnitude, approaching the in-plane lateral-direction resolution. In vivo rat images were also acquired.
Comparing Simulated and Observed Spectroscopic Signatures of Mix in Omega Capsules
NASA Astrophysics Data System (ADS)
Tregillis, I. L.; Shah, R. C.; Hakel, P.; Cobble, J. A.; Murphy, T. J.; Krasheninnikova, N. S.; Hsu, S. C.; Bradley, P. A.; Schmitt, M. J.; Batha, S. H.; Mancini, R. C.
2012-10-01
The Defect-Induced Mix Experiment (DIME) campaign at Los Alamos National Laboratory uses multi-monochromatic X-ray imaging (MMI)footnotetextT. Nagayama, R.C. Mancini, R. Florido, et al, J. App. Phys. 109, 093303 (2011) to detect the migration of high-Z spectroscopic dopants into the hot core of an imploded capsule. We have developed an MMI post-processing tool for producing synthetic datasets from two- and three-dimensional Lagrangian numerical simulations of Omega and NIF shots. These synthetic datasets are of sufficient quality, and contain sufficient physics, that they can be analyzed in the same manner as actual MMI data. We have carried out an extensive comparison between simulated and observed MMI data for a series of polar direct-drive shots carried out at the Omega laser facility in January, 2011. The capsule diameter was 870 microns; the 15 micron CH ablators contained a 2 micron Ti-doped layer along the inner edge. All capsules were driven with 17 kJ; some capsules were manufactured with an equatorial ``trench'' defect. This talk will focus on the construction of spectroscopic-quality synthetic MMI datasets from numerical simulations, and their correlation with MMI measurements.
Comparing NetCDF and SciDB on managing and querying 5D hydrologic dataset
NASA Astrophysics Data System (ADS)
Liu, Haicheng; Xiao, Xiao
2016-11-01
Efficiently extracting information from high dimensional hydro-meteorological modelling datasets requires smart solutions. Traditional methods are mostly based on files, which can be edited and accessed handily. But they have problems of efficiency due to contiguous storage structure. Others propose databases as an alternative for advantages such as native functionalities for manipulating multidimensional (MD) arrays, smart caching strategy and scalability. In this research, NetCDF file based solutions and the multidimensional array database management system (DBMS) SciDB applying chunked storage structure are benchmarked to determine the best solution for storing and querying 5D large hydrologic modelling dataset. The effect of data storage configurations including chunk size, dimension order and compression on query performance is explored. Results indicate that dimension order to organize storage of 5D data has significant influence on query performance if chunk size is very large. But the effect becomes insignificant when chunk size is properly set. Compression of SciDB mostly has negative influence on query performance. Caching is an advantage but may be influenced by execution of different query processes. On the whole, NetCDF solution without compression is in general more efficient than the SciDB DBMS.
Exploring Genome-Wide Expression Profiles Using Machine Learning Techniques.
Kebschull, Moritz; Papapanou, Panos N
2017-01-01
Although contemporary high-throughput -omics methods produce high-dimensional data, the resulting wealth of information is difficult to assess using traditional statistical procedures. Machine learning methods facilitate the detection of additional patterns, beyond the mere identification of lists of features that differ between groups.Here, we demonstrate the utility of (1) supervised classification algorithms in class validation, and (2) unsupervised clustering in class discovery. We use data from our previous work that described the transcriptional profiles of gingival tissue samples obtained from subjects suffering from chronic or aggressive periodontitis (1) to test whether the two diagnostic entities were also characterized by differences on the molecular level, and (2) to search for a novel, alternative classification of periodontitis based on the tissue transcriptomes.Using machine learning technology, we provide evidence for diagnostic imprecision in the currently accepted classification of periodontitis, and demonstrate that a novel, alternative classification based on differences in gingival tissue transcriptomes is feasible. The outlined procedures allow for the unbiased interrogation of high-dimensional datasets for characteristic underlying classes, and are applicable to a broad range of -omics data.
NASA Astrophysics Data System (ADS)
Hu, Yan-Yan; Li, Dong-Sheng
2016-01-01
The hyperspectral images(HSI) consist of many closely spaced bands carrying the most object information. While due to its high dimensionality and high volume nature, it is hard to get satisfactory classification performance. In order to reduce HSI data dimensionality preparation for high classification accuracy, it is proposed to combine a band selection method of artificial immune systems (AIS) with a hybrid kernels support vector machine (SVM-HK) algorithm. In fact, after comparing different kernels for hyperspectral analysis, the approach mixed radial basis function kernel (RBF-K) with sigmoid kernel (Sig-K) and applied the optimized hybrid kernels in SVM classifiers. Then the SVM-HK algorithm used to induce the bands selection of an improved version of AIS. The AIS was composed of clonal selection and elite antibody mutation, including evaluation process with optional index factor (OIF). Experimental classification performance was on a San Diego Naval Base acquired by AVIRIS, the HRS dataset shows that the method is able to efficiently achieve bands redundancy removal while outperforming the traditional SVM classifier.
Müllenbroich, M Caroline; Silvestri, Ludovico; Onofri, Leonardo; Costantini, Irene; Hoff, Marcel Van't; Sacconi, Leonardo; Iannello, Giulio; Pavone, Francesco S
2015-10-01
Comprehensive mapping and quantification of neuronal projections in the central nervous system requires high-throughput imaging of large volumes with microscopic resolution. To this end, we have developed a confocal light-sheet microscope that has been optimized for three-dimensional (3-D) imaging of structurally intact clarified whole-mount mouse brains. We describe the optical and electromechanical arrangement of the microscope and give details on the organization of the microscope management software. The software orchestrates all components of the microscope, coordinates critical timing and synchronization, and has been written in a versatile and modular structure using the LabVIEW language. It can easily be adapted and integrated to other microscope systems and has been made freely available to the light-sheet community. The tremendous amount of data routinely generated by light-sheet microscopy further requires novel strategies for data handling and storage. To complete the full imaging pipeline of our high-throughput microscope, we further elaborate on big data management from streaming of raw images up to stitching of 3-D datasets. The mesoscale neuroanatomy imaged at micron-scale resolution in those datasets allows characterization and quantification of neuronal projections in unsectioned mouse brains.
Parallel Planes Information Visualization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bush, Brian
2015-12-26
This software presents a user-provided multivariate dataset as an interactive three dimensional visualization so that the user can explore the correlation between variables in the observations and the distribution of observations among the variables.
Upscaling river biomass using dimensional analysis and hydrogeomorphic scaling
NASA Astrophysics Data System (ADS)
Barnes, Elizabeth A.; Power, Mary E.; Foufoula-Georgiou, Efi; Hondzo, Miki; Dietrich, William E.
2007-12-01
We propose a methodology for upscaling biomass in a river using a combination of dimensional analysis and hydro-geomorphologic scaling laws. We first demonstrate the use of dimensional analysis for determining local scaling relationships between Nostoc biomass and hydrologic and geomorphic variables. We then combine these relationships with hydraulic geometry and streamflow scaling in order to upscale biomass from point to reach-averaged quantities. The methodology is demonstrated through an illustrative example using an 18 year dataset of seasonal monitoring of biomass of a stream cyanobacterium (Nostoc parmeloides) in a northern California river.
Clustering by reordering of similarity and Laplacian matrices: Application to galaxy clusters
NASA Astrophysics Data System (ADS)
Mahmoud, E.; Shoukry, A.; Takey, A.
2018-04-01
Similarity metrics, kernels and similarity-based algorithms have gained much attention due to their increasing applications in information retrieval, data mining, pattern recognition and machine learning. Similarity Graphs are often adopted as the underlying representation of similarity matrices and are at the origin of known clustering algorithms such as spectral clustering. Similarity matrices offer the advantage of working in object-object (two-dimensional) space where visualization of clusters similarities is available instead of object-features (multi-dimensional) space. In this paper, sparse ɛ-similarity graphs are constructed and decomposed into strong components using appropriate methods such as Dulmage-Mendelsohn permutation (DMperm) and/or Reverse Cuthill-McKee (RCM) algorithms. The obtained strong components correspond to groups (clusters) in the input (feature) space. Parameter ɛi is estimated locally, at each data point i from a corresponding narrow range of the number of nearest neighbors. Although more advanced clustering techniques are available, our method has the advantages of simplicity, better complexity and direct visualization of the clusters similarities in a two-dimensional space. Also, no prior information about the number of clusters is needed. We conducted our experiments on two and three dimensional, low and high-sized synthetic datasets as well as on an astronomical real-dataset. The results are verified graphically and analyzed using gap statistics over a range of neighbors to verify the robustness of the algorithm and the stability of the results. Combining the proposed algorithm with gap statistics provides a promising tool for solving clustering problems. An astronomical application is conducted for confirming the existence of 45 galaxy clusters around the X-ray positions of galaxy clusters in the redshift range [0.1..0.8]. We re-estimate the photometric redshifts of the identified galaxy clusters and obtain acceptable values compared to published spectroscopic redshifts with a 0.029 standard deviation of their differences.
Improving stability of prediction models based on correlated omics data by using network approaches.
Tissier, Renaud; Houwing-Duistermaat, Jeanine; Rodríguez-Girondo, Mar
2018-01-01
Building prediction models based on complex omics datasets such as transcriptomics, proteomics, metabolomics remains a challenge in bioinformatics and biostatistics. Regularized regression techniques are typically used to deal with the high dimensionality of these datasets. However, due to the presence of correlation in the datasets, it is difficult to select the best model and application of these methods yields unstable results. We propose a novel strategy for model selection where the obtained models also perform well in terms of overall predictability. Several three step approaches are considered, where the steps are 1) network construction, 2) clustering to empirically derive modules or pathways, and 3) building a prediction model incorporating the information on the modules. For the first step, we use weighted correlation networks and Gaussian graphical modelling. Identification of groups of features is performed by hierarchical clustering. The grouping information is included in the prediction model by using group-based variable selection or group-specific penalization. We compare the performance of our new approaches with standard regularized regression via simulations. Based on these results we provide recommendations for selecting a strategy for building a prediction model given the specific goal of the analysis and the sizes of the datasets. Finally we illustrate the advantages of our approach by application of the methodology to two problems, namely prediction of body mass index in the DIetary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome study (DILGOM) and prediction of response of each breast cancer cell line to treatment with specific drugs using a breast cancer cell lines pharmacogenomics dataset.
Training Scalable Restricted Boltzmann Machines Using a Quantum Annealer
NASA Astrophysics Data System (ADS)
Kumar, V.; Bass, G.; Dulny, J., III
2016-12-01
Machine learning and the optimization involved therein is of critical importance for commercial and military applications. Due to the computational complexity of many-variable optimization, the conventional approach is to employ meta-heuristic techniques to find suboptimal solutions. Quantum Annealing (QA) hardware offers a completely novel approach with the potential to obtain significantly better solutions with large speed-ups compared to traditional computing. In this presentation, we describe our development of new machine learning algorithms tailored for QA hardware. We are training restricted Boltzmann machines (RBMs) using QA hardware on large, high-dimensional commercial datasets. Traditional optimization heuristics such as contrastive divergence and other closely related techniques are slow to converge, especially on large datasets. Recent studies have indicated that QA hardware when used as a sampler provides better training performance compared to conventional approaches. Most of these studies have been limited to moderately-sized datasets due to the hardware restrictions imposed by exisitng QA devices, which make it difficult to solve real-world problems at scale. In this work we develop novel strategies to circumvent this issue. We discuss scale-up techniques such as enhanced embedding and partitioned RBMs which allow large commercial datasets to be learned using QA hardware. We present our initial results obtained by training an RBM as an autoencoder on an image dataset. The results obtained so far indicate that the convergence rates can be improved significantly by increasing RBM network connectivity. These ideas can be readily applied to generalized Boltzmann machines and we are currently investigating this in an ongoing project.
Geraci, Joseph; Dharsee, Moyez; Nuin, Paulo; Haslehurst, Alexandria; Koti, Madhuri; Feilotter, Harriet E; Evans, Ken
2014-03-01
We introduce a novel method for visualizing high dimensional data via a discrete dynamical system. This method provides a 2D representation of the relationship between subjects according to a set of variables without geometric projections, transformed axes or principal components. The algorithm exploits a memory-type mechanism inherent in a certain class of discrete dynamical systems collectively referred to as the chaos game that are closely related to iterative function systems. The goal of the algorithm was to create a human readable representation of high dimensional patient data that was capable of detecting unrevealed subclusters of patients from within anticipated classifications. This provides a mechanism to further pursue a more personalized exploration of pathology when used with medical data. For clustering and classification protocols, the dynamical system portion of the algorithm is designed to come after some feature selection filter and before some model evaluation (e.g. clustering accuracy) protocol. In the version given here, a univariate features selection step is performed (in practice more complex feature selection methods are used), a discrete dynamical system is driven by this reduced set of variables (which results in a set of 2D cluster models), these models are evaluated for their accuracy (according to a user-defined binary classification) and finally a visual representation of the top classification models are returned. Thus, in addition to the visualization component, this methodology can be used for both supervised and unsupervised machine learning as the top performing models are returned in the protocol we describe here. Butterfly, the algorithm we introduce and provide working code for, uses a discrete dynamical system to classify high dimensional data and provide a 2D representation of the relationship between subjects. We report results on three datasets (two in the article; one in the appendix) including a public lung cancer dataset that comes along with the included Butterfly R package. In the included R script, a univariate feature selection method is used for the dimension reduction step, but in the future we wish to use a more powerful multivariate feature reduction method based on neural networks (Kriesel, 2007). A script written in R (designed to run on R studio) accompanies this article that implements this algorithm and is available at http://butterflygeraci.codeplex.com/. For details on the R package or for help installing the software refer to the accompanying document, Supporting Material and Appendix.
Fisher, Charles K; Mehta, Pankaj
2015-06-01
Feature selection, identifying a subset of variables that are relevant for predicting a response, is an important and challenging component of many methods in statistics and machine learning. Feature selection is especially difficult and computationally intensive when the number of variables approaches or exceeds the number of samples, as is often the case for many genomic datasets. Here, we introduce a new approach--the Bayesian Ising Approximation (BIA)-to rapidly calculate posterior probabilities for feature relevance in L2 penalized linear regression. In the regime where the regression problem is strongly regularized by the prior, we show that computing the marginal posterior probabilities for features is equivalent to computing the magnetizations of an Ising model with weak couplings. Using a mean field approximation, we show it is possible to rapidly compute the feature selection path described by the posterior probabilities as a function of the L2 penalty. We present simulations and analytical results illustrating the accuracy of the BIA on some simple regression problems. Finally, we demonstrate the applicability of the BIA to high-dimensional regression by analyzing a gene expression dataset with nearly 30 000 features. These results also highlight the impact of correlations between features on Bayesian feature selection. An implementation of the BIA in C++, along with data for reproducing our gene expression analyses, are freely available at http://physics.bu.edu/∼pankajm/BIACode. © The Author 2015. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.
NASA Astrophysics Data System (ADS)
Kachach, Redouane; Cañas, José María
2016-05-01
Using video in traffic monitoring is one of the most active research domains in the computer vision community. TrafficMonitor, a system that employs a hybrid approach for automatic vehicle tracking and classification on highways using a simple stationary calibrated camera, is presented. The proposed system consists of three modules: vehicle detection, vehicle tracking, and vehicle classification. Moving vehicles are detected by an enhanced Gaussian mixture model background estimation algorithm. The design includes a technique to resolve the occlusion problem by using a combination of two-dimensional proximity tracking algorithm and the Kanade-Lucas-Tomasi feature tracking algorithm. The last module classifies the shapes identified into five vehicle categories: motorcycle, car, van, bus, and truck by using three-dimensional templates and an algorithm based on histogram of oriented gradients and the support vector machine classifier. Several experiments have been performed using both real and simulated traffic in order to validate the system. The experiments were conducted on GRAM-RTM dataset and a proper real video dataset which is made publicly available as part of this work.
Dimensionality Reduction Through Classifier Ensembles
NASA Technical Reports Server (NTRS)
Oza, Nikunj C.; Tumer, Kagan; Norwig, Peter (Technical Monitor)
1999-01-01
In data mining, one often needs to analyze datasets with a very large number of attributes. Performing machine learning directly on such data sets is often impractical because of extensive run times, excessive complexity of the fitted model (often leading to overfitting), and the well-known "curse of dimensionality." In practice, to avoid such problems, feature selection and/or extraction are often used to reduce data dimensionality prior to the learning step. However, existing feature selection/extraction algorithms either evaluate features by their effectiveness across the entire data set or simply disregard class information altogether (e.g., principal component analysis). Furthermore, feature extraction algorithms such as principal components analysis create new features that are often meaningless to human users. In this article, we present input decimation, a method that provides "feature subsets" that are selected for their ability to discriminate among the classes. These features are subsequently used in ensembles of classifiers, yielding results superior to single classifiers, ensembles that use the full set of features, and ensembles based on principal component analysis on both real and synthetic datasets.
Convective dynamics - Panel report
NASA Technical Reports Server (NTRS)
Carbone, Richard; Foote, G. Brant; Moncrieff, Mitch; Gal-Chen, Tzvi; Cotton, William; Heymsfield, Gerald
1990-01-01
Aspects of highly organized forms of deep convection at midlatitudes are reviewed. Past emphasis in field work and cloud modeling has been directed toward severe weather as evidenced by research on tornadoes, hail, and strong surface winds. A number of specific issues concerning future thrusts, tactics, and techniques in convective dynamics are presented. These subjects include; convective modes and parameterization, global structure and scale interaction, convective energetics, transport studies, anvils and scale interaction, and scale selection. Also discussed are analysis workshops, four-dimensional data assimilation, matching models with observations, network Doppler analyses, mesoscale variability, and high-resolution/high-performance Doppler. It is also noted, that, classical surface measurements and soundings, flight-level research aircraft data, passive satellite data, and traditional photogrammetric studies are examples of datasets that require assimilation and integration.
Cluster Active Archive: lessons learnt
NASA Astrophysics Data System (ADS)
Laakso, H. E.; Perry, C. H.; Taylor, M. G.; Escoubet, C. P.; Masson, A.
2010-12-01
The ESA Cluster Active Archive (CAA) was opened to public in February 2006 after an initial three-year development phase. It provides access (both web GUI and command-line tool are available) to the calibrated full-resolution datasets of the four-satellite Cluster mission. The data archive is publicly accessible and suitable for science use and publication by the world-wide scientific community. There are more than 350 datasets from each spacecraft, including high-resolution magnetic and electric DC and AC fields as well as full 3-dimensional electron and ion distribution functions and moments from a few eV to hundreds of keV. The Cluster mission has been in operation since February 2001, and currently although the CAA can provide access to some recent observations, the ingestion of some other datasets can be delayed by a few years due to large and difficult calibration routines of aging detectors. The quality of the datasets is the central matter to the CAA. Having the same instrument on four spacecraft allows the cross-instrument comparisons and provide confidence on some of the instrumental calibration parameters. Furthermore it is highly important that many physical parameters are measured by more than one instrument which allow to perform extensive and continuous cross-calibration analyses. In addition some of the instruments can be regarded as absolute or reference measurements for other instruments. The CAA attempts to avoid as much as possible mission-specific acronyms and concepts and tends to use more generic terms in describing the datasets and their contents in order to ease the usage of the CAA data by “non-Cluster” scientists. Currently the CAA has more 1000 users and every month more than 150 different users log in the CAA for plotting and/or downloading observations. The users download about 1 TeraByte of data every month. The CAA has separated the graphical tool from the download tool because full-resolution datasets can be visualized in many ways and so there is no one-to-one correspondence between graphical products and full-resolution datasets. The CAA encourages users to contact the CAA team for all kind of issues whether it concerns the user interface, the content of the datasets, the quality of the observations or provision of new type of services. The CAA runs regular annual reviews on the data products and the user services in order to improve the quality and usability of the CAA system to the world-wide user community. The CAA is continuously being upgraded in terms of datasets and services.
Face recognition by applying wavelet subband representation and kernel associative memory.
Zhang, Bai-Ling; Zhang, Haihong; Ge, Shuzhi Sam
2004-01-01
In this paper, we propose an efficient face recognition scheme which has two features: 1) representation of face images by two-dimensional (2-D) wavelet subband coefficients and 2) recognition by a modular, personalised classification method based on kernel associative memory models. Compared to PCA projections and low resolution "thumb-nail" image representations, wavelet subband coefficients can efficiently capture substantial facial features while keeping computational complexity low. As there are usually very limited samples, we constructed an associative memory (AM) model for each person and proposed to improve the performance of AM models by kernel methods. Specifically, we first applied kernel transforms to each possible training pair of faces sample and then mapped the high-dimensional feature space back to input space. Our scheme using modular autoassociative memory for face recognition is inspired by the same motivation as using autoencoders for optical character recognition (OCR), for which the advantages has been proven. By associative memory, all the prototypical faces of one particular person are used to reconstruct themselves and the reconstruction error for a probe face image is used to decide if the probe face is from the corresponding person. We carried out extensive experiments on three standard face recognition datasets, the FERET data, the XM2VTS data, and the ORL data. Detailed comparisons with earlier published results are provided and our proposed scheme offers better recognition accuracy on all of the face datasets.
NASA Astrophysics Data System (ADS)
Wells, Robert R.; Momm, Henrique G.; Castillo, Carlos
2017-07-01
Spatio-temporal measurements of landform evolution provide the basis for process-based theory formulation and validation. Over time, field measurements of landforms have increased significantly worldwide, driven primarily by the availability of new surveying technologies. However, there is no standardized or coordinated effort within the scientific community to collect morphological data in a dependable and reproducible manner, specifically when performing long-term small-scale process investigation studies. Measurements of the same site using identical methods and equipment, but performed at different time periods, may lead to incorrect estimates of landform change as a result of three-dimensional registration errors. This work evaluated measurements of an ephemeral gully channel located on agricultural land using multiple independent survey techniques for locational accuracy and their applicability in generating information for model development and validation. Terrestrial and unmanned aerial vehicle photogrammetry platforms were compared to terrestrial lidar, defined herein as the reference dataset. Given the small scale of the measured landform, the alignment and ensemble equivalence between data sources was addressed through postprocessing. The utilization of ground control points was a prerequisite to three-dimensional registration between datasets and improved the confidence in the morphology information generated. None of the methods were without limitation; however, careful attention to project preplanning and data nature will ultimately guide the temporal efficacy and practicality of management decisions.
Beretta, Lorenzo; Santaniello, Alessandro; van Riel, Piet L C M; Coenen, Marieke J H; Scorza, Raffaella
2010-08-06
Epistasis is recognized as a fundamental part of the genetic architecture of individuals. Several computational approaches have been developed to model gene-gene interactions in case-control studies, however, none of them is suitable for time-dependent analysis. Herein we introduce the Survival Dimensionality Reduction (SDR) algorithm, a non-parametric method specifically designed to detect epistasis in lifetime datasets. The algorithm requires neither specification about the underlying survival distribution nor about the underlying interaction model and proved satisfactorily powerful to detect a set of causative genes in synthetic epistatic lifetime datasets with a limited number of samples and high degree of right-censorship (up to 70%). The SDR method was then applied to a series of 386 Dutch patients with active rheumatoid arthritis that were treated with anti-TNF biological agents. Among a set of 39 candidate genes, none of which showed a detectable marginal effect on anti-TNF responses, the SDR algorithm did find that the rs1801274 SNP in the Fc gamma RIIa gene and the rs10954213 SNP in the IRF5 gene non-linearly interact to predict clinical remission after anti-TNF biologicals. Simulation studies and application in a real-world setting support the capability of the SDR algorithm to model epistatic interactions in candidate-genes studies in presence of right-censored data. http://sourceforge.net/projects/sdrproject/.
Correlation coefficient based supervised locally linear embedding for pulmonary nodule recognition.
Wu, Panpan; Xia, Kewen; Yu, Hengyong
2016-11-01
Dimensionality reduction techniques are developed to suppress the negative effects of high dimensional feature space of lung CT images on classification performance in computer aided detection (CAD) systems for pulmonary nodule detection. An improved supervised locally linear embedding (SLLE) algorithm is proposed based on the concept of correlation coefficient. The Spearman's rank correlation coefficient is introduced to adjust the distance metric in the SLLE algorithm to ensure that more suitable neighborhood points could be identified, and thus to enhance the discriminating power of embedded data. The proposed Spearman's rank correlation coefficient based SLLE (SC(2)SLLE) is implemented and validated in our pilot CAD system using a clinical dataset collected from the publicly available lung image database consortium and image database resource initiative (LICD-IDRI). Particularly, a representative CAD system for solitary pulmonary nodule detection is designed and implemented. After a sequential medical image processing steps, 64 nodules and 140 non-nodules are extracted, and 34 representative features are calculated. The SC(2)SLLE, as well as SLLE and LLE algorithm, are applied to reduce the dimensionality. Several quantitative measurements are also used to evaluate and compare the performances. Using a 5-fold cross-validation methodology, the proposed algorithm achieves 87.65% accuracy, 79.23% sensitivity, 91.43% specificity, and 8.57% false positive rate, on average. Experimental results indicate that the proposed algorithm outperforms the original locally linear embedding and SLLE coupled with the support vector machine (SVM) classifier. Based on the preliminary results from a limited number of nodules in our dataset, this study demonstrates the great potential to improve the performance of a CAD system for nodule detection using the proposed SC(2)SLLE. Copyright © 2016 Elsevier Ireland Ltd. All rights reserved.
Slab2 - Updated Subduction Zone Geometries and Modeling Tools
NASA Astrophysics Data System (ADS)
Moore, G.; Hayes, G. P.; Portner, D. E.; Furtney, M.; Flamme, H. E.; Hearne, M. G.
2017-12-01
The U.S. Geological Survey database of global subduction zone geometries (Slab1.0), is a highly utilized dataset that has been applied to a wide range of geophysical problems. In 2017, these models have been improved and expanded upon as part of the Slab2 modeling effort. With a new data driven approach that can be applied to a broader range of tectonic settings and geophysical data sets, we have generated a model set that will serve as a more comprehensive, reliable, and reproducible resource for three-dimensional slab geometries at all of the world's convergent margins. The newly developed framework of Slab2 is guided by: (1) a large integrated dataset, consisting of a variety of geophysical sources (e.g., earthquake hypocenters, moment tensors, active-source seismic survey images of the shallow slab, tomography models, receiver functions, bathymetry, trench ages, and sediment thickness information); (2) a dynamic filtering scheme aimed at constraining incorporated seismicity to only slab related events; (3) a 3-D data interpolation approach which captures both high resolution shallow geometries and instances of slab rollback and overlap at depth; and (4) an algorithm which incorporates uncertainties of contributing datasets to identify the most probable surface depth over the extent of each subduction zone. Further layers will also be added to the base geometry dataset, such as historic moment release, earthquake tectonic providence, and interface coupling. Along with access to several queryable data formats, all components have been wrapped into an open source library in Python, such that suites of updated models can be released as further data becomes available. This presentation will discuss the extent of Slab2 development, as well as the current availability of the model and modeling tools.
NASA Astrophysics Data System (ADS)
Fotin, Sergei V.; Yin, Yin; Periaswamy, Senthil; Kunz, Justin; Haldankar, Hrishikesh; Muradyan, Naira; Cornud, François; Turkbey, Baris; Choyke, Peter L.
2012-02-01
Fully automated prostate segmentation helps to address several problems in prostate cancer diagnosis and treatment: it can assist in objective evaluation of multiparametric MR imagery, provides a prostate contour for MR-ultrasound (or CT) image fusion for computer-assisted image-guided biopsy or therapy planning, may facilitate reporting and enables direct prostate volume calculation. Among the challenges in automated analysis of MR images of the prostate are the variations of overall image intensities across scanners, the presence of nonuniform multiplicative bias field within scans and differences in acquisition setup. Furthermore, images acquired with the presence of an endorectal coil suffer from localized high-intensity artifacts at the posterior part of the prostate. In this work, a three-dimensional method for fast automated prostate detection based on normalized gradient fields cross-correlation, insensitive to intensity variations and coil-induced artifacts, is presented and evaluated. The components of the method, offline template learning and the localization algorithm, are described in detail. The method was validated on a dataset of 522 T2-weighted MR images acquired at the National Cancer Institute, USA that was split in two halves for development and testing. In addition, second dataset of 29 MR exams from Centre d'Imagerie Médicale Tourville, France were used to test the algorithm. The 95% confidence intervals for the mean Euclidean distance between automatically and manually identified prostate centroids were 4.06 +/- 0.33 mm and 3.10 +/- 0.43 mm for the first and second test datasets respectively. Moreover, the algorithm provided the centroid within the true prostate volume in 100% of images from both datasets. Obtained results demonstrate high utility of the detection method for a fully automated prostate segmentation.
Nayak, Deepak Ranjan; Dash, Ratnakar; Majhi, Banshidhar
2017-01-01
This paper presents an automatic classification system for segregating pathological brain from normal brains in magnetic resonance imaging scanning. The proposed system employs contrast limited adaptive histogram equalization scheme to enhance the diseased region in brain MR images. Two-dimensional stationary wavelet transform is harnessed to extract features from the preprocessed images. The feature vector is constructed using the energy and entropy values, computed from the level- 2 SWT coefficients. Then, the relevant and uncorrelated features are selected using symmetric uncertainty ranking filter. Subsequently, the selected features are given input to the proposed AdaBoost with support vector machine classifier, where SVM is used as the base classifier of AdaBoost algorithm. To validate the proposed system, three standard MR image datasets, Dataset-66, Dataset-160, and Dataset- 255 have been utilized. The 5 runs of k-fold stratified cross validation results indicate the suggested scheme offers better performance than other existing schemes in terms of accuracy and number of features. The proposed system earns ideal classification over Dataset-66 and Dataset-160; whereas, for Dataset- 255, an accuracy of 99.45% is achieved. Copyright© Bentham Science Publishers; For any queries, please email at epub@benthamscience.org.
NASA Astrophysics Data System (ADS)
Wheatland, Jonathan; Bushby, Andy; Droppo, Ian; Carr, Simon; Spencer, Kate
2015-04-01
Suspended estuarine sediments form flocs that are compositionally complex, fragile and irregularly shaped. The fate and transport of suspended particulate matter (SPM) is determined by the size, shape, density, porosity and stability of these flocs and prediction of SPM transport requires accurate measurements of these three-dimensional (3D) physical properties. However, the multi-scaled nature of flocs in addition to their fragility makes their characterisation in 3D problematic. Correlative microscopy is a strategy involving the spatial registration of information collected at different scales using several imaging modalities. Previously, conventional optical microscopy (COM) and transmission electron microscopy (TEM) have enabled 2-dimensional (2D) floc characterisation at the gross (> 1 µm) and sub-micron scales respectively. Whilst this has proven insightful there remains a critical spatial and dimensional gap preventing the accurate measurement of geometric properties and an understanding of how structures at different scales are related. Within life sciences volumetric imaging techniques such as 3D micro-computed tomography (3D µCT) and focused ion beam scanning electron microscopy [FIB-SEM (or FIB-tomography)] have been combined to characterise materials at the centimetre to micron scale. Combining these techniques with TEM enables an advanced correlative study, allowing material properties across multiple spatial and dimensional scales to be visualised. The aims of this study are; 1) to formulate an advanced correlative imaging strategy combining 3D µCT, FIB-tomography and TEM; 2) to acquire 3D datasets; 3) to produce a model allowing their co-visualisation; 4) to interpret 3D floc structure. To reduce the chance of structural alterations during analysis samples were first 'fixed' in 2.5% glutaraldehyde/2% formaldehyde before being embedding in Durcupan resin. Intermediate steps were implemented to improve contrast and remove pore water, achieved by the addition of heavy metal stains and washing samples in a series of ethanol solutions and acetone. Gross-scale characterisation involved scanning samples using a Nikon Metrology HM X 225 µCT. For micro-scale analysis a working surface was revealed by microtoming the sample. Ultrathin sections were then collected and analysed using a JEOL 1200 Ex II TEM, and FIB-tomography datasets obtained using an FEI Quanta 3D FIB-SEM. Finally, to locate the surface and relate TEM and FIB-tomography datasets to the original floc, samples were rescanned using the µCT. Image processing was initially conducted in ImageJ. Following this datasets were imported into Amira 5.5 where pixel intensity thresholding allowed particle-matrix boundaries to be defined. Using 'landmarks' datasets were then registered to enable their co-visualisation in 3D models. Analysis of registered datasets reveals the complex non-fractal nature of flocs, whose properties span several of orders of magnitude. Primary particles are organised into discrete 'bundles', the arrangement of which directly influences their gross morphology. This strategy, which allows the co-visualisation of spatially registered multi-scale 3D datasets, provides unique insights into the true nature floc which would other have been impossible.
High-speed three-dimensional measurements with a fringe projection-based optical sensor
NASA Astrophysics Data System (ADS)
Bräuer-Burchardt, Christian; Breitbarth, Andreas; Kühmstedt, Peter; Notni, Gunther
2014-11-01
An optical three-dimensional (3-D) sensor based on a fringe projection technique that realizes the acquisition of the surface geometry of small objects was developed for highly resolved and ultrafast measurements. It realizes a data acquisition rate up to 60 high-resolution 3-D datasets per second. The high measurement velocity was achieved by consequent fringe code reduction and parallel data processing. The reduction of the length of the fringe image sequence was obtained by omission of the Gray code sequence using the geometric restrictions of the measurement objects and the geometric constraints of the sensor arrangement. The sensor covers three different measurement fields between 20 mm×20 mm and 40 mm×40 mm with a spatial resolution between 10 and 20 μm, respectively. In order to obtain a robust and fast recalibration of the sensor after change of the measurement field, a calibration procedure based on single shot analysis of a special test object was applied which works with low effort and time. The sensor may be used, e.g., for quality inspection of conductor boards or plugs in real-time industrial applications.
bioWeb3D: an online webGL 3D data visualisation tool
2013-01-01
Background Data visualization is critical for interpreting biological data. However, in practice it can prove to be a bottleneck for non trained researchers; this is especially true for three dimensional (3D) data representation. Whilst existing software can provide all necessary functionalities to represent and manipulate biological 3D datasets, very few are easily accessible (browser based), cross platform and accessible to non-expert users. Results An online HTML5/WebGL based 3D visualisation tool has been developed to allow biologists to quickly and easily view interactive and customizable three dimensional representations of their data along with multiple layers of information. Using the WebGL library Three.js written in Javascript, bioWeb3D allows the simultaneous visualisation of multiple large datasets inputted via a simple JSON, XML or CSV file, which can be read and analysed locally thanks to HTML5 capabilities. Conclusions Using basic 3D representation techniques in a technologically innovative context, we provide a program that is not intended to compete with professional 3D representation software, but that instead enables a quick and intuitive representation of reasonably large 3D datasets. PMID:23758781
NASA Astrophysics Data System (ADS)
Ehricke, Hans-Heino; Daiber, Gerhard; Sonntag, Ralf; Strasser, Wolfgang; Lochner, Mathias; Rudi, Lothar S.; Lorenz, Walter J.
1992-09-01
In stereotactic treatment planning the spatial relationships between a variety of objects has to be taken into account in order to avoid destruction of vital brain structures and rupture of vasculature. The visualization of these highly complex relations may be supported by 3-D computer graphics methods. In this context the three-dimensional display of the intracranial vascular tree and additional objects, such as neuroanatomy, pathology, stereotactic devices, or isodose surfaces, is of high clinical value. We report an advanced rendering method for a depth-enhanced maximum intensity projection from magnetic resonance angiography (MRA) and a walk-through approach to the analysis of MRA volume data. Furthermore, various methods for a multiple-object 3-D rendering in stereotaxy are discussed. The development of advanced applications in medical imaging can hardly be successful if image acquisition problems are disregarded. We put particular emphasis on the use of conventional MRI and MRA for stereotactic guidance. The problem of MR distortion is discussed and a novel three- dimensional approach to the quantification and correction of the distortion patterns is presented. Our results suggest that the sole use of MR for stereotactic guidance is highly practical. The true three-dimensionality of the acquired datasets opens up new perspectives to stereotactic treatment planning. For the first time it is possible now to integrate all the necessary information into 3-D scenes, thus enabling an interactive 3-D planning.
NASA Astrophysics Data System (ADS)
Han, Q.; Hu, X.; Cai, J.; Wei, W.
2016-12-01
Xinzhou geothermal field is located in the Guangdong province and adjacent to the China South Sea, and its hot springs can reach up to 92 degree Celsius. Yanshanian granite expose widely in the south of this geothermal field and four faults cut across each other over it. A dense grid of 176 magnetotelluric (MT) sites with broadband has been acquired over the Xinzhou geothermal field and its surrounding area. Due to the related electromagnetic (EM) noise one permanent observatory was placed as a remote reference to suppress this cultural EM noise interference. The datasets are processed using the mutual reference technique, static shift correction, and structural strike and dimensionality analysis based on tensor decomposition. Data analysis reveals that the underground conductivity structure has obvious three-dimensional characterization. For the high resolution result ,two and three dimensional inversion are both applied in this area employing the non-linear conjugate gradient method (NLCG).These MT data sets are supposed to detect the deep subsurface resistivity structure correlated to the distribution of geothermal reservoir (such as faults and fractured granite) and investigate the channel of the upwelling magma. The whole and cold granite usually present high resistivity but once it functions as reservoir the resistivity will decrease, sometimes it is hard to separate the reservoir from the cap layer. The 3D inversion results delineate three high resistivity anomalies distributed in different locations. At last we put forward that the large areas of granite form the major thermal source for the study area and discuss whether any melt under these magma intrusions exists.
Visualizing the ground motions of the 1906 San Francisco earthquake
Chourasia, A.; Cutchin, S.; Aagaard, Brad T.
2008-01-01
With advances in computational capabilities and refinement of seismic wave-propagation models in the past decade large three-dimensional simulations of earthquake ground motion have become possible. The resulting datasets from these simulations are multivariate, temporal and multi-terabyte in size. Past visual representations of results from seismic studies have been largely confined to static two-dimensional maps. New visual representations provide scientists with alternate ways of viewing and interacting with these results potentially leading to new and significant insight into the physical phenomena. Visualizations can also be used for pedagogic and general dissemination purposes. We present a workflow for visual representation of the data from a ground motion simulation of the great 1906 San Francisco earthquake. We have employed state of the art animation tools for visualization of the ground motions with a high degree of accuracy and visual realism. ?? 2008 Elsevier Ltd.
Cross Validation Through Two-Dimensional Solution Surface for Cost-Sensitive SVM.
Gu, Bin; Sheng, Victor S; Tay, Keng Yeow; Romano, Walter; Li, Shuo
2017-06-01
Model selection plays an important role in cost-sensitive SVM (CS-SVM). It has been proven that the global minimum cross validation (CV) error can be efficiently computed based on the solution path for one parameter learning problems. However, it is a challenge to obtain the global minimum CV error for CS-SVM based on one-dimensional solution path and traditional grid search, because CS-SVM is with two regularization parameters. In this paper, we propose a solution and error surfaces based CV approach (CV-SES). More specifically, we first compute a two-dimensional solution surface for CS-SVM based on a bi-parameter space partition algorithm, which can fit solutions of CS-SVM for all values of both regularization parameters. Then, we compute a two-dimensional validation error surface for each CV fold, which can fit validation errors of CS-SVM for all values of both regularization parameters. Finally, we obtain the CV error surface by superposing K validation error surfaces, which can find the global minimum CV error of CS-SVM. Experiments are conducted on seven datasets for cost sensitive learning and on four datasets for imbalanced learning. Experimental results not only show that our proposed CV-SES has a better generalization ability than CS-SVM with various hybrids between grid search and solution path methods, and than recent proposed cost-sensitive hinge loss SVM with three-dimensional grid search, but also show that CV-SES uses less running time.
NASA Astrophysics Data System (ADS)
Brandl, Miriam B.; Beck, Dominik; Pham, Tuan D.
2011-06-01
The high dimensionality of image-based dataset can be a drawback for classification accuracy. In this study, we propose the application of fuzzy c-means clustering, cluster validity indices and the notation of a joint-feature-clustering matrix to find redundancies of image-features. The introduced matrix indicates how frequently features are grouped in a mutual cluster. The resulting information can be used to find data-derived feature prototypes with a common biological meaning, reduce data storage as well as computation times and improve the classification accuracy.
Onder, Devrim; Sarioglu, Sulen; Karacali, Bilge
2013-04-01
Quasi-supervised learning is a statistical learning algorithm that contrasts two datasets by computing estimate for the posterior probability of each sample in either dataset. This method has not been applied to histopathological images before. The purpose of this study is to evaluate the performance of the method to identify colorectal tissues with or without adenocarcinoma. Light microscopic digital images from histopathological sections were obtained from 30 colorectal radical surgery materials including adenocarcinoma and non-neoplastic regions. The texture features were extracted by using local histograms and co-occurrence matrices. The quasi-supervised learning algorithm operates on two datasets, one containing samples of normal tissues labelled only indirectly, and the other containing an unlabeled collection of samples of both normal and cancer tissues. As such, the algorithm eliminates the need for manually labelled samples of normal and cancer tissues for conventional supervised learning and significantly reduces the expert intervention. Several texture feature vector datasets corresponding to different extraction parameters were tested within the proposed framework. The Independent Component Analysis dimensionality reduction approach was also identified as the one improving the labelling performance evaluated in this series. In this series, the proposed method was applied to the dataset of 22,080 vectors with reduced dimensionality 119 from 132. Regions containing cancer tissue could be identified accurately having false and true positive rates up to 19% and 88% respectively without using manually labelled ground-truth datasets in a quasi-supervised strategy. The resulting labelling performances were compared to that of a conventional powerful supervised classifier using manually labelled ground-truth data. The supervised classifier results were calculated as 3.5% and 95% for the same case. The results in this series in comparison with the benchmark classifier, suggest that quasi-supervised image texture labelling may be a useful method in the analysis and classification of pathological slides but further study is required to improve the results. Copyright © 2013 Elsevier Ltd. All rights reserved.
Arisha, Mohammed J; Hsiung, Ming C; Nanda, Navin C; ElKaryoni, Ahmed; Mohamed, Ahmed H; Wei, Jeng
2017-08-01
Hemangiomas are rarely found in the heart and pericardial involvement is even more rare. We report a case of primary pericardial hemangioma, in which three-dimensional transesophageal echocardiography (3DTEE) provided incremental benefit over standard two-dimensional images. Our case also highlights the importance of systematic cropping of the 3D datasets in making a diagnosis of pericardial hemangioma with a greater degree of certainty. In addition, we also provide a literature review of the features of cardiac/pericardial hemangiomas in a tabular form. © 2017, Wiley Periodicals, Inc.
Spatiotemporal Permutation Entropy as a Measure for Complexity of Cardiac Arrhythmia
NASA Astrophysics Data System (ADS)
Schlemmer, Alexander; Berg, Sebastian; Lilienkamp, Thomas; Luther, Stefan; Parlitz, Ulrich
2018-05-01
Permutation entropy (PE) is a robust quantity for measuring the complexity of time series. In the cardiac community it is predominantly used in the context of electrocardiogram (ECG) signal analysis for diagnoses and predictions with a major application found in heart rate variability parameters. In this article we are combining spatial and temporal PE to form a spatiotemporal PE that captures both, complexity of spatial structures and temporal complexity at the same time. We demonstrate that the spatiotemporal PE (STPE) quantifies complexity using two datasets from simulated cardiac arrhythmia and compare it to phase singularity analysis and spatial PE (SPE). These datasets simulate ventricular fibrillation (VF) on a two-dimensional and a three-dimensional medium using the Fenton-Karma model. We show that SPE and STPE are robust against noise and demonstrate its usefulness for extracting complexity features at different spatial scales.
A High-Resolution Merged Wind Dataset for DYNAMO: Progress and Future Plans
NASA Technical Reports Server (NTRS)
Lang, Timothy J.; Mecikalski, John; Li, Xuanli; Chronis, Themis; Castillo, Tyler; Hoover, Kacie; Brewer, Alan; Churnside, James; McCarty, Brandi; Hein, Paul;
2015-01-01
In order to support research on optimal data assimilation methods for the Cyclone Global Navigation Satellite System (CYGNSS), launching in 2016, work has been ongoing to produce a high-resolution merged wind dataset for the Dynamics of the Madden Julian Oscillation (DYNAMO) field campaign, which took place during late 2011/early 2012. The winds are produced by assimilating DYNAMO observations into the Weather Research and Forecasting (WRF) three-dimensional variational (3DVAR) system. Data sources from the DYNAMO campaign include the upper-air sounding network, radial velocities from the radar network, vector winds from the Advanced Scatterometer (ASCAT) and Oceansat-2 Scatterometer (OSCAT) satellite instruments, the NOAA High Resolution Doppler Lidar (HRDL), and several others. In order the prep them for 3DVAR, significant additional quality control work is being done for the currently available TOGA and SMART-R radar datasets, including automatically dealiasing radial velocities and correcting for intermittent TOGA antenna azimuth angle errors. The assimilated winds are being made available as model output fields from WRF on two separate grids with different horizontal resolutions - a 3-km grid focusing on the main DYNAMO quadrilateral (i.e., Gan Island, the R/V Revelle, the R/V Mirai, and Diego Garcia), and a 1-km grid focusing on the Revelle. The wind dataset is focused on three separate approximately 2-week periods during the Madden Julian Oscillation (MJO) onsets that occurred in October, November, and December 2011. Work is ongoing to convert the 10-m surface winds from these model fields to simulated CYGNSS observations using the CYGNSS End-To-End Simulator (E2ES), and these simulated satellite observations are being compared to radar observations of DYNAMO precipitation systems to document the anticipated ability of CYGNSS to provide information on the relationships between surface winds and oceanic precipitation at the mesoscale level. This research will improve our understanding of the future utility of CYGNSS for documenting key MJO processes.
Discriminant locality preserving projections based on L1-norm maximization.
Zhong, Fujin; Zhang, Jiashu; Li, Defang
2014-11-01
Conventional discriminant locality preserving projection (DLPP) is a dimensionality reduction technique based on manifold learning, which has demonstrated good performance in pattern recognition. However, because its objective function is based on the distance criterion using L2-norm, conventional DLPP is not robust to outliers which are present in many applications. This paper proposes an effective and robust DLPP version based on L1-norm maximization, which learns a set of local optimal projection vectors by maximizing the ratio of the L1-norm-based locality preserving between-class dispersion and the L1-norm-based locality preserving within-class dispersion. The proposed method is proven to be feasible and also robust to outliers while overcoming the small sample size problem. The experimental results on artificial datasets, Binary Alphadigits dataset, FERET face dataset and PolyU palmprint dataset have demonstrated the effectiveness of the proposed method.
X-ray computed tomography library of shark anatomy and lower jaw surface models.
Kamminga, Pepijn; De Bruin, Paul W; Geleijns, Jacob; Brazeau, Martin D
2017-04-11
The cranial diversity of sharks reflects disparate biomechanical adaptations to feeding. In order to be able to investigate and better understand the ecomorphology of extant shark feeding systems, we created a x-ray computed tomography (CT) library of shark cranial anatomy with three-dimensional (3D) lower jaw reconstructions. This is used to examine and quantify lower jaw disparity in extant shark species in a separate study. The library is divided in a dataset comprised of medical CT scans of 122 sharks (Selachimorpha, Chondrichthyes) representing 73 extant species, including digitized morphology of entire shark specimens. This CT dataset and additional data provided by other researchers was used to reconstruct a second dataset containing 3D models of the left lower jaw for 153 individuals representing 94 extant shark species. These datasets form an extensive anatomical record of shark skeletal anatomy, necessary for comparative morphological, biomechanical, ecological and phylogenetic studies.
Ambient Seismic Noise Interferometry on the Island of Hawai`i
NASA Astrophysics Data System (ADS)
Ballmer, Silke
Ambient seismic noise interferometry has been successfully applied in a variety of tectonic settings to gain information about the subsurface. As a passive seismic technique, it extracts the coherent part of ambient seismic noise in-between pairs of seismic receivers. Measurements of subtle temporal changes in seismic velocities, and high-resolution tomographic imaging are then possible - two applications of particular interest for volcano monitoring. Promising results from other volcanic settings motivate its application in Hawai'i, with this work being the first to explore its potential. The dataset used for this purpose was recorded by the Hawaiian Volcano Observatory's permanent seismic network on the Island of Hawai'i. It spans 2.5 years from 5/2007 to 12/2009 and covers two distinct sources of volcanic tremor. After applying standard processing for ambient seismic noise interferometry, we find that volcanic tremor strongly affects the extracted noise information not only close to the tremor source, but unexpectedly, throughout the island-wide network. Besides demonstrating how this long-range observability of volcanic tremor can be used to monitor volcanic activity in the absence of a dense seismic array, our results suggest that care must be taken when applying ambient seismic noise interferometry in volcanic settings. In a second step, we thus exclude days that show signs of volcanic tremor, reducing the dataset to three months, and perform ambient seismic noise tomography. The resulting two-dimensional Rayleigh wave group velocity maps for 0.1 - 0.9 Hz compare very well with images from previous travel time tomography, both, for the main volcanic structures at low frequencies as well as for smaller features at mid-to-high frequencies - a remarkable observation for the temporally truncated dataset. These robust results suggest that ambient seismic noise tomography in Hawai'i is suitable 1) to provide a three-dimensional S-wave model for the volcanoes and 2) to be used for repeated time-sensitive tomography, even though volcanic tremor frequently obscures ambient noise analyses. However, the noise characteristics and the wavefield in Hawai'i in general remain to be investigated in more detail in order to measure unbiased temporal velocity changes.
Application of constrained k-means clustering in ground motion simulation validation
NASA Astrophysics Data System (ADS)
Khoshnevis, N.; Taborda, R.
2017-12-01
The validation of ground motion synthetics has received increased attention over the last few years due to the advances in physics-based deterministic and hybrid simulation methods. Unlike for low frequency simulations (f ≤ 0.5 Hz), for which it has become reasonable to expect a good match between synthetics and data, in the case of high-frequency simulations (f ≥ 1 Hz) it is not possible to match results on a wiggle-by-wiggle basis. This is mostly due to the various complexities and uncertainties involved in earthquake ground motion modeling. Therefore, in order to compare synthetics with data we turn to different time series metrics, which are used as a means to characterize how the synthetics match the data on qualitative and statistical sense. In general, these metrics provide GOF scores that measure the level of similarity in the time and frequency domains. It is common for these scores to be scaled from 0 to 10, with 10 representing a perfect match. Although using individual metrics for particular applications is considered more adequate, there is no consensus or a unified method to classify the comparison between a set of synthetic and recorded seismograms when the various metrics offer different scores. We study the relationship among these metrics through a constrained k-means clustering approach. We define 4 hypothetical stations with scores 3, 5, 7, and 9 for all metrics. We put these stations in the category of cannot-link constraints. We generate the dataset through the validation of the results from a deterministic (physics-based) ground motion simulation for a moderate magnitude earthquake in the greater Los Angeles basin using three velocity models. The maximum frequency of the simulation is 4 Hz. The dataset involves over 300 stations and 11 metrics, or features, as they are understood in the clustering process, where the metrics form a multi-dimensional space. We address the high-dimensional feature effects with a subspace-clustering analysis, generate a final labeled dataset of stations, and discuss the within-class statistical characteristics of each metric. Labeling these stations is the first step towards developing a unified metric to evaluate ground motion simulations in an application-independent manner.
Serag, Ahmed; Wilkinson, Alastair G.; Telford, Emma J.; Pataky, Rozalia; Sparrow, Sarah A.; Anblagan, Devasuda; Macnaught, Gillian; Semple, Scott I.; Boardman, James P.
2017-01-01
Quantitative volumes from brain magnetic resonance imaging (MRI) acquired across the life course may be useful for investigating long term effects of risk and resilience factors for brain development and healthy aging, and for understanding early life determinants of adult brain structure. Therefore, there is an increasing need for automated segmentation tools that can be applied to images acquired at different life stages. We developed an automatic segmentation method for human brain MRI, where a sliding window approach and a multi-class random forest classifier were applied to high-dimensional feature vectors for accurate segmentation. The method performed well on brain MRI data acquired from 179 individuals, analyzed in three age groups: newborns (38–42 weeks gestational age), children and adolescents (4–17 years) and adults (35–71 years). As the method can learn from partially labeled datasets, it can be used to segment large-scale datasets efficiently. It could also be applied to different populations and imaging modalities across the life course. PMID:28163680
Data Shared Lasso: A Novel Tool to Discover Uplift.
Gross, Samuel M; Tibshirani, Robert
2016-09-01
A model is presented for the supervised learning problem where the observations come from a fixed number of pre-specified groups, and the regression coefficients may vary sparsely between groups. The model spans the continuum between individual models for each group and one model for all groups. The resulting algorithm is designed with a high dimensional framework in mind. The approach is applied to a sentiment analysis dataset to show its efficacy and interpretability. One particularly useful application is for finding sub-populations in a randomized trial for which an intervention (treatment) is beneficial, often called the uplift problem. Some new concepts are introduced that are useful for uplift analysis. The value is demonstrated in an application to a real world credit card promotion dataset. In this example, although sending the promotion has a very small average effect, by targeting a particular subgroup with the promotion one can obtain a 15% increase in the proportion of people who purchase the new credit card.
Investigating Bacterial-Animal Symbioses with Light Sheet Microscopy
Taormina, Michael J.; Jemielita, Matthew; Stephens, W. Zac; Burns, Adam R.; Troll, Joshua V.; Parthasarathy, Raghuveer; Guillemin, Karen
2014-01-01
SUMMARY Microbial colonization of the digestive tract is a crucial event in vertebrate development, required for maturation of host immunity and establishment of normal digestive physiology. Advances in genomic, proteomic, and metabolomic technologies are providing a more detailed picture of the constituents of the intestinal habitat, but these approaches lack the spatial and temporal resolution needed to characterize the assembly and dynamics of microbial communities in this complex environment. We report the use of light sheet microscopy to provide high resolution imaging of bacterial colonization of the zebrafish intestine. The methodology allows us to characterize bacterial population dynamics across the entire organ and the behaviors of individual bacterial and host cells throughout the colonization process. The large four-dimensional datasets generated by these imaging approaches require new strategies for image analysis. When integrated with other “omics” datasets, information about the spatial and temporal dynamics of microbial cells within the vertebrate intestine will provide new mechanistic insights into how microbial communities assemble and function within hosts. PMID:22983029
Data Shared Lasso: A Novel Tool to Discover Uplift
Gross, Samuel M.; Tibshirani, Robert
2017-01-01
A model is presented for the supervised learning problem where the observations come from a fixed number of pre-specified groups, and the regression coefficients may vary sparsely between groups. The model spans the continuum between individual models for each group and one model for all groups. The resulting algorithm is designed with a high dimensional framework in mind. The approach is applied to a sentiment analysis dataset to show its efficacy and interpretability. One particularly useful application is for finding sub-populations in a randomized trial for which an intervention (treatment) is beneficial, often called the uplift problem. Some new concepts are introduced that are useful for uplift analysis. The value is demonstrated in an application to a real world credit card promotion dataset. In this example, although sending the promotion has a very small average effect, by targeting a particular subgroup with the promotion one can obtain a 15% increase in the proportion of people who purchase the new credit card. PMID:29056802
Multi-test decision tree and its application to microarray data classification.
Czajkowski, Marcin; Grześ, Marek; Kretowski, Marek
2014-05-01
The desirable property of tools used to investigate biological data is easy to understand models and predictive decisions. Decision trees are particularly promising in this regard due to their comprehensible nature that resembles the hierarchical process of human decision making. However, existing algorithms for learning decision trees have tendency to underfit gene expression data. The main aim of this work is to improve the performance and stability of decision trees with only a small increase in their complexity. We propose a multi-test decision tree (MTDT); our main contribution is the application of several univariate tests in each non-terminal node of the decision tree. We also search for alternative, lower-ranked features in order to obtain more stable and reliable predictions. Experimental validation was performed on several real-life gene expression datasets. Comparison results with eight classifiers show that MTDT has a statistically significantly higher accuracy than popular decision tree classifiers, and it was highly competitive with ensemble learning algorithms. The proposed solution managed to outperform its baseline algorithm on 14 datasets by an average 6%. A study performed on one of the datasets showed that the discovered genes used in the MTDT classification model are supported by biological evidence in the literature. This paper introduces a new type of decision tree which is more suitable for solving biological problems. MTDTs are relatively easy to analyze and much more powerful in modeling high dimensional microarray data than their popular counterparts. Copyright © 2014 Elsevier B.V. All rights reserved.
3D Analysis of Human Embryos and Fetuses Using Digitized Datasets From the Kyoto Collection.
Takakuwa, Tetsuya
2018-06-01
Three-dimensional (3D) analysis of the human embryonic and early-fetal period has been performed using digitized datasets obtained from the Kyoto Collection, in which the digital datasets play a primary role in research. Datasets include magnetic resonance imaging (MRI) acquired with 1.5 T, 2.35 T, and 7 T magnet systems, phase-contrast X-ray computed tomography (CT), and digitized histological serial sections. Large, high-resolution datasets covering a broad range of developmental periods obtained with various methods of acquisition are key elements for the studies. The digital data have gross merits that enabled us to develop various analysis. Digital data analysis accelerated the speed of morphological observations using precise and improved methods by providing a suitable plane for a morphometric analysis from staged human embryos. Morphometric data are useful for quantitatively evaluating and demonstrating the features of development and for screening abnormal samples, which may be suggestive in the pathogenesis of congenital malformations. Morphometric data are also valuable for comparing sonographic data in a process known as "sonoembryology." The 3D coordinates of anatomical landmarks may be useful tools for analyzing the positional change of interesting landmarks and their relationships during development. Several dynamic events could be explained by differential growth using 3D coordinates. Moreover, 3D coordinates can be utilized in mathematical analysis as well as statistical analysis. The 3D analysis in our study may serve to provide accurate morphologic data, including the dynamics of embryonic structures related to developmental stages, which is required for insights into the dynamic and complex processes occurring during organogenesis. Anat Rec, 301:960-969, 2018. © 2018 Wiley Periodicals, Inc. © 2018 Wiley Periodicals, Inc.
Parameterized Spectral Bathymetric Roughness Using the Nonequispaced Fast Fourier Transform
NASA Astrophysics Data System (ADS)
Fabre, David Hanks
The ocean and acoustic modeling community has specifically asked for roughness from bathymetry. An effort has been undertaken to provide what can be thought of as the high frequency content of bathymetry. By contrast, the low frequency content of bathymetry is the set of contours. The two-dimensional amplitude spectrum calculated with the nonequispaced fast Fourier transform (Kunis, 2006) is exploited as the statistic to provide several parameters of roughness following the method of Fox (1996). When an area is uniformly rough, it is termed isotropically rough. When an area exhibits lineation effects (like in a trough or a ridge line in the bathymetry), the term anisotropically rough is used. A predominant spatial azimuth of lineation summarizes anisotropic roughness. The power law model fit produces a roll-off parameter that also provides insight into the roughness of the area. These four parameters give rise to several derived parameters. Algorithmic accomplishments include reviving Fox's method (1985, 1996) and improving the method with the possibly geophysically more appropriate nonequispaced fast Fourier transform. A new composite parameter, simply the overall integral length of the nonlinear parameterizing function, is used to make within-dataset comparisons. A synthetic dataset and six multibeam datasets covering practically all depth regimes have been analyzed with the tools that have been developed. Data specific contributions include possibly discovering an aspect ratio isotropic cutoff level (less than 1.2), showing a range of spectral fall-off values between about -0.5 for a sandybottomed Gulf of Mexico area, to about -1.8 for a coral reef area just outside of the Saipan harbor. We also rank the targeted type of dataset, the best resolution gridded datasets, from smoothest to roughest using a factor based on the kernel dimensions, a percentage from the windowing operation, all multiplied by the overall integration length.
Information extraction from dynamic PS-InSAR time series using machine learning
NASA Astrophysics Data System (ADS)
van de Kerkhof, B.; Pankratius, V.; Chang, L.; van Swol, R.; Hanssen, R. F.
2017-12-01
Due to the increasing number of SAR satellites, with shorter repeat intervals and higher resolutions, SAR data volumes are exploding. Time series analyses of SAR data, i.e. Persistent Scatterer (PS) InSAR, enable the deformation monitoring of the built environment at an unprecedented scale, with hundreds of scatterers per km2, updated weekly. Potential hazards, e.g. due to failure of aging infrastructure, can be detected at an early stage. Yet, this requires the operational data processing of billions of measurement points, over hundreds of epochs, updating this data set dynamically as new data come in, and testing whether points (start to) behave in an anomalous way. Moreover, the quality of PS-InSAR measurements is ambiguous and heterogeneous, which will yield false positives and false negatives. Such analyses are numerically challenging. Here we extract relevant information from PS-InSAR time series using machine learning algorithms. We cluster (group together) time series with similar behaviour, even though they may not be spatially close, such that the results can be used for further analysis. First we reduce the dimensionality of the dataset in order to be able to cluster the data, since applying clustering techniques on high dimensional datasets often result in unsatisfying results. Our approach is to apply t-distributed Stochastic Neighbor Embedding (t-SNE), a machine learning algorithm for dimensionality reduction of high-dimensional data to a 2D or 3D map, and cluster this result using Density-Based Spatial Clustering of Applications with Noise (DBSCAN). The results show that we are able to detect and cluster time series with similar behaviour, which is the starting point for more extensive analysis into the underlying driving mechanisms. The results of the methods are compared to conventional hypothesis testing as well as a Self-Organising Map (SOM) approach. Hypothesis testing is robust and takes the stochastic nature of the observations into account, but is time consuming. Therefore, we successively apply our machine learning approach with the hypothesis testing approach in order to benefit from both the reduced computation time of the machine learning approach as from the robust quality metrics of hypothesis testing. We acknowledge support from NASA AISTNNX15AG84G (PI V. Pankratius)
NASA Astrophysics Data System (ADS)
Wickersham, Andrew Joseph
There are two critical research needs for the study of hydrocarbon combustion in high speed flows: 1) combustion diagnostics with adequate temporal and spatial resolution, and 2) mathematical techniques that can extract key information from large datasets. The goal of this work is to address these needs, respectively, by the use of high speed and multi-perspective chemiluminescence and advanced mathematical algorithms. To obtain the measurements, this work explored the application of high speed chemiluminescence diagnostics and the use of fiber-based endoscopes (FBEs) for non-intrusive and multi-perspective chemiluminescence imaging up to 20 kHz. Non-intrusive and full-field imaging measurements provide a wealth of information for model validation and design optimization of propulsion systems. However, it is challenging to obtain such measurements due to various implementation difficulties such as optical access, thermal management, and equipment cost. This work therefore explores the application of FBEs for non-intrusive imaging to supersonic propulsion systems. The FBEs used in this work are demonstrated to overcome many of the aforementioned difficulties and provided datasets from multiple angular positions up to 20 kHz in a supersonic combustor. The combustor operated on ethylene fuel at Mach 2 with an inlet stagnation temperature and pressure of approximately 640 degrees Fahrenheit and 70 psia, respectively. The imaging measurements were obtained from eight perspectives simultaneously, providing full-field datasets under such flow conditions for the first time, allowing the possibility of inferring multi-dimensional measurements. Due to the high speed and multi-perspective nature, such new diagnostic capability generates a large volume of data and calls for analysis algorithms that can process the data and extract key physics effectively. To extract the key combustion dynamics from the measurements, three mathematical methods were investigated in this work: Fourier analysis, proper orthogonal decomposition (POD), and wavelet analysis (WA). These algorithms were first demonstrated and tested on imaging measurements obtained from one perspective in a sub-sonic combustor (up to Mach 0.2). The results show that these algorithms are effective in extracting the key physics from large datasets, including the characteristic frequencies of flow-flame interactions especially during transient processes such as lean blow off and ignition. After these relatively simple tests and demonstrations, these algorithms were applied to process the measurements obtained from multi-perspective in the supersonic combustor. compared to past analyses (which have been limited to data obtained from one perspective only), the availability of data at multiple perspective provide further insights into the flame and flow structures in high speed flows. In summary, this work shows that high speed chemiluminescence is a simple yet powerful combustion diagnostic. Especially when combined with FBEs and the analyses algorithms described in this work, such diagnostics provide full-field imaging at high repetition rate in challenging flows. Based on such measurements, a wealth of information can be obtained from proper analysis algorithms, including characteristic frequency, dominating flame modes, and even multi-dimensional flame and flow structures.
A Realistic Seizure Prediction Study Based on Multiclass SVM.
Direito, Bruno; Teixeira, César A; Sales, Francisco; Castelo-Branco, Miguel; Dourado, António
2017-05-01
A patient-specific algorithm, for epileptic seizure prediction, based on multiclass support-vector machines (SVM) and using multi-channel high-dimensional feature sets, is presented. The feature sets, combined with multiclass classification and post-processing schemes aim at the generation of alarms and reduced influence of false positives. This study considers 216 patients from the European Epilepsy Database, and includes 185 patients with scalp EEG recordings and 31 with intracranial data. The strategy was tested over a total of 16,729.80[Formula: see text]h of inter-ictal data, including 1206 seizures. We found an overall sensitivity of 38.47% and a false positive rate per hour of 0.20. The performance of the method achieved statistical significance in 24 patients (11% of the patients). Despite the encouraging results previously reported in specific datasets, the prospective demonstration on long-term EEG recording has been limited. Our study presents a prospective analysis of a large heterogeneous, multicentric dataset. The statistical framework based on conservative assumptions, reflects a realistic approach compared to constrained datasets, and/or in-sample evaluations. The improvement of these results, with the definition of an appropriate set of features able to improve the distinction between the pre-ictal and nonpre-ictal states, hence minimizing the effect of confounding variables, remains a key aspect.
Bahrami, Sheyda; Shamsi, Mousa
2017-01-01
Functional magnetic resonance imaging (fMRI) is a popular method to probe the functional organization of the brain using hemodynamic responses. In this method, volume images of the entire brain are obtained with a very good spatial resolution and low temporal resolution. However, they always suffer from high dimensionality in the face of classification algorithms. In this work, we combine a support vector machine (SVM) with a self-organizing map (SOM) for having a feature-based classification by using SVM. Then, a linear kernel SVM is used for detecting the active areas. Here, we use SOM for feature extracting and labeling the datasets. SOM has two major advances: (i) it reduces dimension of data sets for having less computational complexity and (ii) it is useful for identifying brain regions with small onset differences in hemodynamic responses. Our non-parametric model is compared with parametric and non-parametric methods. We use simulated fMRI data sets and block design inputs in this paper and consider the contrast to noise ratio (CNR) value equal to 0.6 for simulated datasets. fMRI simulated dataset has contrast 1-4% in active areas. The accuracy of our proposed method is 93.63% and the error rate is 6.37%.
NASA Astrophysics Data System (ADS)
Delgado, Juan A.; Altuve, Miguel; Nabhan Homsi, Masun
2015-12-01
This paper introduces a robust method based on the Support Vector Machine (SVM) algorithm to detect the presence of Fetal QRS (fQRS) complexes in electrocardiogram (ECG) recordings provided by the PhysioNet/CinC challenge 2013. ECG signals are first segmented into contiguous frames of 250 ms duration and then labeled in six classes. Fetal segments are tagged according to the position of fQRS complex within each one. Next, segment features extraction and dimensionality reduction are obtained by applying principal component analysis on Haar-wavelet transform. After that, two sub-datasets are generated to separate representative segments from atypical ones. Imbalanced class problem is dealt by applying sampling without replacement on each sub-dataset. Finally, two SVMs are trained and cross-validated using the two balanced sub-datasets separately. Experimental results show that the proposed approach achieves high performance rates in fetal heartbeats detection that reach up to 90.95% of accuracy, 92.16% of sensitivity, 88.51% of specificity, 94.13% of positive predictive value and 84.96% of negative predictive value. A comparative study is also carried out to show the performance of other two machine learning algorithms for fQRS complex estimation, which are K-nearest neighborhood and Bayesian network.
A New Method for Predicting Patient Survivorship Using Efficient Bayesian Network Learning
Jiang, Xia; Xue, Diyang; Brufsky, Adam; Khan, Seema; Neapolitan, Richard
2014-01-01
The purpose of this investigation is to develop and evaluate a new Bayesian network (BN)-based patient survivorship prediction method. The central hypothesis is that the method predicts patient survivorship well, while having the capability to handle high-dimensional data and be incorporated into a clinical decision support system (CDSS). We have developed EBMC_Survivorship (EBMC_S), which predicts survivorship for each year individually. EBMC_S is based on the EBMC BN algorithm, which has been shown to handle high-dimensional data. BNs have excellent architecture for decision support systems. In this study, we evaluate EBMC_S using the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) dataset, which concerns breast tumors. A 5-fold cross-validation study indicates that EMBC_S performs better than the Cox proportional hazard model and is comparable to the random survival forest method. We show that EBMC_S provides additional information such as sensitivity analyses, which covariates predict each year, and yearly areas under the ROC curve (AUROCs). We conclude that our investigation supports the central hypothesis. PMID:24558297
A new method for predicting patient survivorship using efficient bayesian network learning.
Jiang, Xia; Xue, Diyang; Brufsky, Adam; Khan, Seema; Neapolitan, Richard
2014-01-01
The purpose of this investigation is to develop and evaluate a new Bayesian network (BN)-based patient survivorship prediction method. The central hypothesis is that the method predicts patient survivorship well, while having the capability to handle high-dimensional data and be incorporated into a clinical decision support system (CDSS). We have developed EBMC_Survivorship (EBMC_S), which predicts survivorship for each year individually. EBMC_S is based on the EBMC BN algorithm, which has been shown to handle high-dimensional data. BNs have excellent architecture for decision support systems. In this study, we evaluate EBMC_S using the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) dataset, which concerns breast tumors. A 5-fold cross-validation study indicates that EMBC_S performs better than the Cox proportional hazard model and is comparable to the random survival forest method. We show that EBMC_S provides additional information such as sensitivity analyses, which covariates predict each year, and yearly areas under the ROC curve (AUROCs). We conclude that our investigation supports the central hypothesis.
Yoshimoto, Junichiro; Shimizu, Yu; Okada, Go; Takamura, Masahiro; Okamoto, Yasumasa; Yamawaki, Shigeto; Doya, Kenji
2017-01-01
We propose a novel method for multiple clustering, which is useful for analysis of high-dimensional data containing heterogeneous types of features. Our method is based on nonparametric Bayesian mixture models in which features are automatically partitioned (into views) for each clustering solution. This feature partition works as feature selection for a particular clustering solution, which screens out irrelevant features. To make our method applicable to high-dimensional data, a co-clustering structure is newly introduced for each view. Further, the outstanding novelty of our method is that we simultaneously model different distribution families, such as Gaussian, Poisson, and multinomial distributions in each cluster block, which widens areas of application to real data. We apply the proposed method to synthetic and real data, and show that our method outperforms other multiple clustering methods both in recovering true cluster structures and in computation time. Finally, we apply our method to a depression dataset with no true cluster structure available, from which useful inferences are drawn about possible clustering structures of the data. PMID:29049392
Inference for High-dimensional Differential Correlation Matrices *
Cai, T. Tony; Zhang, Anru
2015-01-01
Motivated by differential co-expression analysis in genomics, we consider in this paper estimation and testing of high-dimensional differential correlation matrices. An adaptive thresholding procedure is introduced and theoretical guarantees are given. Minimax rate of convergence is established and the proposed estimator is shown to be adaptively rate-optimal over collections of paired correlation matrices with approximately sparse differences. Simulation results show that the procedure significantly outperforms two other natural methods that are based on separate estimation of the individual correlation matrices. The procedure is also illustrated through an analysis of a breast cancer dataset, which provides evidence at the gene co-expression level that several genes, of which a subset has been previously verified, are associated with the breast cancer. Hypothesis testing on the differential correlation matrices is also considered. A test, which is particularly well suited for testing against sparse alternatives, is introduced. In addition, other related problems, including estimation of a single sparse correlation matrix, estimation of the differential covariance matrices, and estimation of the differential cross-correlation matrices, are also discussed. PMID:26500380
NASA Astrophysics Data System (ADS)
Fehm, Thomas Felix; Deán-Ben, Xosé Luís; Razansky, Daniel
2014-10-01
Ultrasonography and optoacoustic imaging share powerful advantages related to the natural aptitude for real-time image rendering with high resolution, the hand-held operation, and lack of ionizing radiation. The two methods also possess very different yet highly complementary advantages of the mechanical and optical contrast in living tissues. Nonetheless, efficient integration of these modalities remains challenging owing to the fundamental differences in the underlying physical contrast, optimal signal acquisition, and image reconstruction approaches. We report on a method for hybrid acquisition and reconstruction of three-dimensional pulse-echo ultrasound and optoacoustic images in real time based on passive ultrasound generation with an optical absorber, thus avoiding the hardware complexity of active ultrasound generation. In this way, complete hybrid datasets are generated with a single laser interrogation pulse, resulting in simultaneous rendering of ultrasound and optoacoustic images at an unprecedented rate of 10 volumetric frames per second. Performance is subsequently showcased in phantom experiments and in-vivo measurements from a healthy human volunteer, confirming general clinical applicability of the method.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fehm, Thomas Felix; Razansky, Daniel, E-mail: dr@tum.de; Faculty of Medicine, Technische Universität München, Munich
2014-10-27
Ultrasonography and optoacoustic imaging share powerful advantages related to the natural aptitude for real-time image rendering with high resolution, the hand-held operation, and lack of ionizing radiation. The two methods also possess very different yet highly complementary advantages of the mechanical and optical contrast in living tissues. Nonetheless, efficient integration of these modalities remains challenging owing to the fundamental differences in the underlying physical contrast, optimal signal acquisition, and image reconstruction approaches. We report on a method for hybrid acquisition and reconstruction of three-dimensional pulse-echo ultrasound and optoacoustic images in real time based on passive ultrasound generation with an opticalmore » absorber, thus avoiding the hardware complexity of active ultrasound generation. In this way, complete hybrid datasets are generated with a single laser interrogation pulse, resulting in simultaneous rendering of ultrasound and optoacoustic images at an unprecedented rate of 10 volumetric frames per second. Performance is subsequently showcased in phantom experiments and in-vivo measurements from a healthy human volunteer, confirming general clinical applicability of the method.« less
Yang, Hao; MacLaren, Ian; Jones, Lewys; ...
2017-04-01
Recent development in fast pixelated detector technology has allowed a two dimensional diffraction pattern to be recorded at every probe position of a two dimensional raster scan in a scanning transmission electron microscope (STEM), forming an information-rich four dimensional (4D) dataset. Electron ptychography has been shown to enable efficient coherent phase imaging of weakly scattering objects from a 4D dataset recorded using a focused electron probe, which is optimised for simultaneous incoherent Z-contrast imaging and spectroscopy in STEM. Thus coherent phase contrast and incoherent Z-contrast imaging modes can be efficiently combined to provide a good sensitivity of both light andmore » heavy elements at atomic resolution. Here, we explore the application of electron ptychography for atomic resolution imaging of strongly scattering crystalline specimens, and present experiments on imaging crystalline specimens including samples containing defects, under dynamical channelling conditions using an aberration corrected microscope. A ptychographic reconstruction method called Wigner distribution deconvolution (WDD) was implemented. Our experimental results and simulation results suggest that ptychography provides a readily interpretable phase image and great sensitivity for imaging light elements at atomic resolution in relatively thin crystalline materials.« less
NASA Astrophysics Data System (ADS)
Yoshimura, K.; Oki, T.; Ohte, N.; Kanae, S.; Ichiyanagi, K.
2004-12-01
A simple water isotope circulation model on a global scale that includes a Rayleigh equation and the use of _grealistic_h external meteorological forcings estimates short-term variability of precipitation 18O. The results are validated by Global Network of Isotopes in Precipitation (GNIP) monthly observations and by daily observations at three sites in Thailand. This good agreement highlights the importance of large scale transport and mixing of vapor masses as a control factor for spatial and temporal variability of precipitation isotopes, rather than in-cloud micro processes. It also indicates the usefulness of the model and the isotopes observation databases for evaluation of two-dimensional atmospheric water circulation fields in forcing datasets. In this regard, two offline simulations for 1978-1993 with major reanalyses, i.e. NCEP and ERA15, were implemented, and the results show that, over Europe ERA15 better matched observations at both monthly and interannual time scales, mainly owing to better precipitation fields in ERA15, while in the tropics both produced similarly accurate isotopic fields. The isotope analyses diagnose accuracy of two-dimensional water circulation fields in datasets with a particular focus on precipitation processes.
Robust learning for optimal treatment decision with NP-dimensionality
Shi, Chengchun; Song, Rui; Lu, Wenbin
2016-01-01
In order to identify important variables that are involved in making optimal treatment decision, Lu, Zhang and Zeng (2013) proposed a penalized least squared regression framework for a fixed number of predictors, which is robust against the misspecification of the conditional mean model. Two problems arise: (i) in a world of explosively big data, effective methods are needed to handle ultra-high dimensional data set, for example, with the dimension of predictors is of the non-polynomial (NP) order of the sample size; (ii) both the propensity score and conditional mean models need to be estimated from data under NP dimensionality. In this paper, we propose a robust procedure for estimating the optimal treatment regime under NP dimensionality. In both steps, penalized regressions are employed with the non-concave penalty function, where the conditional mean model of the response given predictors may be misspecified. The asymptotic properties, such as weak oracle properties, selection consistency and oracle distributions, of the proposed estimators are investigated. In addition, we study the limiting distribution of the estimated value function for the obtained optimal treatment regime. The empirical performance of the proposed estimation method is evaluated by simulations and an application to a depression dataset from the STAR*D study. PMID:28781717
Creation of three-dimensional craniofacial standards from CBCT images
NASA Astrophysics Data System (ADS)
Subramanyan, Krishna; Palomo, Martin; Hans, Mark
2006-03-01
Low-dose three-dimensional Cone Beam Computed Tomography (CBCT) is becoming increasingly popular in the clinical practice of dental medicine. Two-dimensional Bolton Standards of dentofacial development are routinely used to identify deviations from normal craniofacial anatomy. With the advent of CBCT three dimensional imaging, we propose a set of methods to extend these 2D Bolton Standards to anatomically correct surface based 3D standards to allow analysis of morphometric changes seen in craniofacial complex. To create 3D surface standards, we have implemented series of steps. 1) Converting bi-plane 2D tracings into set of splines 2) Converting the 2D splines curves from bi-plane projection into 3D space curves 3) Creating labeled template of facial and skeletal shapes and 4) Creating 3D average surface Bolton standards. We have used datasets from patients scanned with Hitachi MercuRay CBCT scanner providing high resolution and isotropic CT volume images, digitized Bolton Standards from age 3 to 18 years of lateral and frontal male, female and average tracings and converted them into facial and skeletal 3D space curves. This new 3D standard will help in assessing shape variations due to aging in young population and provide reference to correct facial anomalies in dental medicine.
Information Visualization Techniques for Effective Cross-Discipline Communication
NASA Astrophysics Data System (ADS)
Fisher, Ward
2013-04-01
Collaboration between research groups in different fields is a common occurrence, but it can often be frustrating due to the absence of a common vocabulary. This lack of a shared context can make expressing important concepts and discussing results difficult. This problem may be further exacerbated when communicating to an audience of laypeople. Without a clear frame of reference, simple concepts are often rendered difficult-to-understand at best, and unintelligible at worst. An easy way to alleviate this confusion is with the use of clear, well-designed visualizations to illustrate an idea, process or conclusion. There exist a number of well-described machine-learning and statistical techniques which can be used to illuminate the information present within complex high-dimensional datasets. Once the information has been separated from the data, clear communication becomes a matter of selecting an appropriate visualization. Ideally, the visualization is information-rich but data-scarce. Anything from a simple bar chart, to a line chart with confidence intervals, to an animated set of 3D point-clouds can be used to render a complex idea as an easily understood image. Several case studies will be presented in this work. In the first study, we will examine how a complex statistical analysis was applied to a high-dimensional dataset, and how the results were succinctly communicated to an audience of microbiologists and chemical engineers. Next, we will examine a technique used to illustrate the concept of the singular value decomposition, as used in the field of computer vision, to a lay audience of undergraduate students from mixed majors. We will then examine a case where a simple animated line plot was used to communicate an approach to signal decomposition, and will finish with a discussion of the tools available to create these visualizations.
Zhao, Guangjun; Wang, Xuchu; Niu, Yanmin; Tan, Liwen; Zhang, Shao-Xiang
2016-01-01
Cryosection brain images in Chinese Visible Human (CVH) dataset contain rich anatomical structure information of tissues because of its high resolution (e.g., 0.167 mm per pixel). Fast and accurate segmentation of these images into white matter, gray matter, and cerebrospinal fluid plays a critical role in analyzing and measuring the anatomical structures of human brain. However, most existing automated segmentation methods are designed for computed tomography or magnetic resonance imaging data, and they may not be applicable for cryosection images due to the imaging difference. In this paper, we propose a supervised learning-based CVH brain tissues segmentation method that uses stacked autoencoder (SAE) to automatically learn the deep feature representations. Specifically, our model includes two successive parts where two three-layer SAEs take image patches as input to learn the complex anatomical feature representation, and then these features are sent to Softmax classifier for inferring the labels. Experimental results validated the effectiveness of our method and showed that it outperformed four other classical brain tissue detection strategies. Furthermore, we reconstructed three-dimensional surfaces of these tissues, which show their potential in exploring the high-resolution anatomical structures of human brain. PMID:27057543
Becht, Etienne; Simoni, Yannick; Coustan-Smith, Elaine; Maximilien, Evrard; Cheng, Yang; Ng, Lai Guan; Campana, Dario; Newell, Evan
2018-06-21
Recent flow and mass cytometers generate datasets of dimensions 20 to 40 and a million single cells. From these, many tools facilitate the discovery of new cell populations associated with diseases or physiology. These new cell populations require the identification of new gating strategies, but gating strategies become exponentially more difficult to optimize when dimensionality increases. To facilitate this step, we developed Hypergate, an algorithm which given a cell population of interest identifies a gating strategy optimized for high yield and purity. Hypergate achieves higher yield and purity than human experts, Support Vector Machines and Random-Forests on public datasets. We use it to revisit some established gating strategies for the identification of innate lymphoid cells, which identifies concise and efficient strategies that allow gating these cells with fewer parameters but higher yield and purity than the current standards. For phenotypic description, Hypergate's outputs are consistent with fields' knowledge and sparser than those from a competing method. Hypergate is implemented in R and available on CRAN. The source code is published at http://github.com/ebecht/hypergate under an Open Source Initiative-compliant licence. Supplementary data are available at Bioinformatics online.
Zhao, Guangjun; Wang, Xuchu; Niu, Yanmin; Tan, Liwen; Zhang, Shao-Xiang
2016-01-01
Cryosection brain images in Chinese Visible Human (CVH) dataset contain rich anatomical structure information of tissues because of its high resolution (e.g., 0.167 mm per pixel). Fast and accurate segmentation of these images into white matter, gray matter, and cerebrospinal fluid plays a critical role in analyzing and measuring the anatomical structures of human brain. However, most existing automated segmentation methods are designed for computed tomography or magnetic resonance imaging data, and they may not be applicable for cryosection images due to the imaging difference. In this paper, we propose a supervised learning-based CVH brain tissues segmentation method that uses stacked autoencoder (SAE) to automatically learn the deep feature representations. Specifically, our model includes two successive parts where two three-layer SAEs take image patches as input to learn the complex anatomical feature representation, and then these features are sent to Softmax classifier for inferring the labels. Experimental results validated the effectiveness of our method and showed that it outperformed four other classical brain tissue detection strategies. Furthermore, we reconstructed three-dimensional surfaces of these tissues, which show their potential in exploring the high-resolution anatomical structures of human brain.
Big Data Analytics for Demand Response: Clustering Over Space and Time
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chelmis, Charalampos; Kolte, Jahanvi; Prasanna, Viktor K.
The pervasive deployment of advanced sensing infrastructure in Cyber-Physical systems, such as the Smart Grid, has resulted in an unprecedented data explosion. Such data exhibit both large volumes and high velocity characteristics, two of the three pillars of Big Data, and have a time-series notion as datasets in this context typically consist of successive measurements made over a time interval. Time-series data can be valuable for data mining and analytics tasks such as identifying the “right” customers among a diverse population, to target for Demand Response programs. However, time series are challenging to mine due to their high dimensionality. Inmore » this paper, we motivate this problem using a real application from the smart grid domain. We explore novel representations of time-series data for BigData analytics, and propose a clustering technique for determining natural segmentation of customers and identification of temporal consumption patterns. Our method is generizable to large-scale, real-world scenarios, without making any assumptions about the data. We evaluate our technique using real datasets from smart meters, totaling ~ 18,200,000 data points, and show the efficacy of our technique in efficiency detecting the number of optimal number of clusters.« less
AMUC: Associated Motion capture User Categories.
Norman, Sally Jane; Lawson, Sian E M; Olivier, Patrick; Watson, Paul; Chan, Anita M-A; Dade-Robertson, Martyn; Dunphy, Paul; Green, Dave; Hiden, Hugo; Hook, Jonathan; Jackson, Daniel G
2009-07-13
The AMUC (Associated Motion capture User Categories) project consisted of building a prototype sketch retrieval client for exploring motion capture archives. High-dimensional datasets reflect the dynamic process of motion capture and comprise high-rate sampled data of a performer's joint angles; in response to multiple query criteria, these data can potentially yield different kinds of information. The AMUC prototype harnesses graphic input via an electronic tablet as a query mechanism, time and position signals obtained from the sketch being mapped to the properties of data streams stored in the motion capture repository. As well as proposing a pragmatic solution for exploring motion capture datasets, the project demonstrates the conceptual value of iterative prototyping in innovative interdisciplinary design. The AMUC team was composed of live performance practitioners and theorists conversant with a variety of movement techniques, bioengineers who recorded and processed motion data for integration into the retrieval tool, and computer scientists who designed and implemented the retrieval system and server architecture, scoped for Grid-based applications. Creative input on information system design and navigation, and digital image processing, underpinned implementation of the prototype, which has undergone preliminary trials with diverse users, allowing identification of rich potential development areas.
Evolutionary fields can explain patterns of high-dimensional complexity in ecology
NASA Astrophysics Data System (ADS)
Wilsenach, James; Landi, Pietro; Hui, Cang
2017-04-01
One of the properties that make ecological systems so unique is the range of complex behavioral patterns that can be exhibited by even the simplest communities with only a few species. Much of this complexity is commonly attributed to stochastic factors that have very high-degrees of freedom. Orthodox study of the evolution of these simple networks has generally been limited in its ability to explain complexity, since it restricts evolutionary adaptation to an inertia-free process with few degrees of freedom in which only gradual, moderately complex behaviors are possible. We propose a model inspired by particle-mediated field phenomena in classical physics in combination with fundamental concepts in adaptation, which suggests that small but high-dimensional chaotic dynamics near to the adaptive trait optimum could help explain complex properties shared by most ecological datasets, such as aperiodicity and pink, fractal noise spectra. By examining a simple predator-prey model and appealing to real ecological data, we show that this type of complexity could be easily confused for or confounded by stochasticity, especially when spurred on or amplified by stochastic factors that share variational and spectral properties with the underlying dynamics.
COVARIANCE ESTIMATION USING CONJUGATE GRADIENT FOR 3D CLASSIFICATION IN CRYO-EM.
Andén, Joakim; Katsevich, Eugene; Singer, Amit
2015-04-01
Classifying structural variability in noisy projections of biological macromolecules is a central problem in Cryo-EM. In this work, we build on a previous method for estimating the covariance matrix of the three-dimensional structure present in the molecules being imaged. Our proposed method allows for incorporation of contrast transfer function and non-uniform distribution of viewing angles, making it more suitable for real-world data. We evaluate its performance on a synthetic dataset and an experimental dataset obtained by imaging a 70S ribosome complex.
Data science for mental health: a UK perspective on a global challenge.
McIntosh, Andrew M; Stewart, Robert; John, Ann; Smith, Daniel J; Davis, Katrina; Sudlow, Cathie; Corvin, Aiden; Nicodemus, Kristin K; Kingdon, David; Hassan, Lamiece; Hotopf, Matthew; Lawrie, Stephen M; Russ, Tom C; Geddes, John R; Wolpert, Miranda; Wölbert, Eva; Porteous, David J
2016-10-01
Data science uses computer science and statistics to extract new knowledge from high-dimensional datasets (ie, those with many different variables and data types). Mental health research, diagnosis, and treatment could benefit from data science that uses cohort studies, genomics, and routine health-care and administrative data. The UK is well placed to trial these approaches through robust NHS-linked data science projects, such as the UK Biobank, Generation Scotland, and the Clinical Record Interactive Search (CRIS) programme. Data science has great potential as a low-cost, high-return catalyst for improved mental health recognition, understanding, support, and outcomes. Lessons learnt from such studies could have global implications. Copyright © 2016 Elsevier Ltd. All rights reserved.
Leung, Yuk Yee; Chang, Chun Qi; Hung, Yeung Sam
2012-01-01
Using hybrid approach for gene selection and classification is common as results obtained are generally better than performing the two tasks independently. Yet, for some microarray datasets, both classification accuracy and stability of gene sets obtained still have rooms for improvement. This may be due to the presence of samples with wrong class labels (i.e. outliers). Outlier detection algorithms proposed so far are either not suitable for microarray data, or only solve the outlier detection problem on their own. We tackle the outlier detection problem based on a previously proposed Multiple-Filter-Multiple-Wrapper (MFMW) model, which was demonstrated to yield promising results when compared to other hybrid approaches (Leung and Hung, 2010). To incorporate outlier detection and overcome limitations of the existing MFMW model, three new features are introduced in our proposed MFMW-outlier approach: 1) an unbiased external Leave-One-Out Cross-Validation framework is developed to replace internal cross-validation in the previous MFMW model; 2) wrongly labeled samples are identified within the MFMW-outlier model; and 3) a stable set of genes is selected using an L1-norm SVM that removes any redundant genes present. Six binary-class microarray datasets were tested. Comparing with outlier detection studies on the same datasets, MFMW-outlier could detect all the outliers found in the original paper (for which the data was provided for analysis), and the genes selected after outlier removal were proven to have biological relevance. We also compared MFMW-outlier with PRAPIV (Zhang et al., 2006) based on same synthetic datasets. MFMW-outlier gave better average precision and recall values on three different settings. Lastly, artificially flipped microarray datasets were created by removing our detected outliers and flipping some of the remaining samples' labels. Almost all the 'wrong' (artificially flipped) samples were detected, suggesting that MFMW-outlier was sufficiently powerful to detect outliers in high-dimensional microarray datasets.
Wu, Dingming; Wang, Dongfang; Zhang, Michael Q; Gu, Jin
2015-12-01
One major goal of large-scale cancer omics study is to identify molecular subtypes for more accurate cancer diagnoses and treatments. To deal with high-dimensional cancer multi-omics data, a promising strategy is to find an effective low-dimensional subspace of the original data and then cluster cancer samples in the reduced subspace. However, due to data-type diversity and big data volume, few methods can integrative and efficiently find the principal low-dimensional manifold of the high-dimensional cancer multi-omics data. In this study, we proposed a novel low-rank approximation based integrative probabilistic model to fast find the shared principal subspace across multiple data types: the convexity of the low-rank regularized likelihood function of the probabilistic model ensures efficient and stable model fitting. Candidate molecular subtypes can be identified by unsupervised clustering hundreds of cancer samples in the reduced low-dimensional subspace. On testing datasets, our method LRAcluster (low-rank approximation based multi-omics data clustering) runs much faster with better clustering performances than the existing method. Then, we applied LRAcluster on large-scale cancer multi-omics data from TCGA. The pan-cancer analysis results show that the cancers of different tissue origins are generally grouped as independent clusters, except squamous-like carcinomas. While the single cancer type analysis suggests that the omics data have different subtyping abilities for different cancer types. LRAcluster is a very useful method for fast dimension reduction and unsupervised clustering of large-scale multi-omics data. LRAcluster is implemented in R and freely available via http://bioinfo.au.tsinghua.edu.cn/software/lracluster/ .
Volonghi, Paola; Tresoldi, Daniele; Cadioli, Marcello; Usuelli, Antonio M; Ponzini, Raffaele; Morbiducci, Umberto; Esposito, Antonio; Rizzo, Giovanna
2016-02-01
To propose and assess a new method that automatically extracts a three-dimensional (3D) geometric model of the thoracic aorta (TA) from 3D cine phase contrast MRI (PCMRI) acquisitions. The proposed method is composed of two steps: segmentation of the TA and creation of the 3D geometric model. The segmentation algorithm, based on Level Set, was set and applied to healthy subjects acquired in three different modalities (with and without SENSE reduction factors). Accuracy was evaluated using standard quality indices. The 3D model is characterized by the vessel surface mesh and its centerline; the comparison of models obtained from the three different datasets was also carried out in terms of radius of curvature (RC) and average tortuosity (AT). In all datasets, the segmentation quality indices confirmed very good agreement between manual and automatic contours (average symmetric distance < 1.44 mm, DICE Similarity Coefficient > 0.88). The 3D models extracted from the three datasets were found to be comparable, with differences of less than 10% for RC and 11% for AT. Our method was found effective on PCMRI data to provide a 3D geometric model of the TA, to support morphometric and hemodynamic characterization of the aorta. © 2015 Wiley Periodicals, Inc.
2010-01-01
Background Epistasis is recognized as a fundamental part of the genetic architecture of individuals. Several computational approaches have been developed to model gene-gene interactions in case-control studies, however, none of them is suitable for time-dependent analysis. Herein we introduce the Survival Dimensionality Reduction (SDR) algorithm, a non-parametric method specifically designed to detect epistasis in lifetime datasets. Results The algorithm requires neither specification about the underlying survival distribution nor about the underlying interaction model and proved satisfactorily powerful to detect a set of causative genes in synthetic epistatic lifetime datasets with a limited number of samples and high degree of right-censorship (up to 70%). The SDR method was then applied to a series of 386 Dutch patients with active rheumatoid arthritis that were treated with anti-TNF biological agents. Among a set of 39 candidate genes, none of which showed a detectable marginal effect on anti-TNF responses, the SDR algorithm did find that the rs1801274 SNP in the FcγRIIa gene and the rs10954213 SNP in the IRF5 gene non-linearly interact to predict clinical remission after anti-TNF biologicals. Conclusions Simulation studies and application in a real-world setting support the capability of the SDR algorithm to model epistatic interactions in candidate-genes studies in presence of right-censored data. Availability: http://sourceforge.net/projects/sdrproject/ PMID:20691091
Feature Screening for Ultrahigh Dimensional Categorical Data with Applications.
Huang, Danyang; Li, Runze; Wang, Hansheng
2014-01-01
Ultrahigh dimensional data with both categorical responses and categorical covariates are frequently encountered in the analysis of big data, for which feature screening has become an indispensable statistical tool. We propose a Pearson chi-square based feature screening procedure for categorical response with ultrahigh dimensional categorical covariates. The proposed procedure can be directly applied for detection of important interaction effects. We further show that the proposed procedure possesses screening consistency property in the terminology of Fan and Lv (2008). We investigate the finite sample performance of the proposed procedure by Monte Carlo simulation studies, and illustrate the proposed method by two empirical datasets.
Luck, Margaux; Bertho, Gildas; Bateson, Mathilde; Karras, Alexandre; Yartseva, Anastasia; Thervet, Eric
2016-01-01
1H Nuclear Magnetic Resonance (NMR)-based metabolic profiling is very promising for the diagnostic of the stages of chronic kidney disease (CKD). Because of the high dimension of NMR spectra datasets and the complex mixture of metabolites in biological samples, the identification of discriminant biomarkers of a disease is challenging. None of the widely used chemometric methods in NMR metabolomics performs a local exhaustive exploration of the data. We developed a descriptive and easily understandable approach that searches for discriminant local phenomena using an original exhaustive rule-mining algorithm in order to predict two groups of patients: 1) patients having low to mild CKD stages with no renal failure and 2) patients having moderate to established CKD stages with renal failure. Our predictive algorithm explores the m-dimensional variable space to capture the local overdensities of the two groups of patients under the form of easily interpretable rules. Afterwards, a L2-penalized logistic regression on the discriminant rules was used to build predictive models of the CKD stages. We explored a complex multi-source dataset that included the clinical, demographic, clinical chemistry, renal pathology and urine metabolomic data of a cohort of 110 patients. Given this multi-source dataset and the complex nature of metabolomic data, we analyzed 1- and 2-dimensional rules in order to integrate the information carried by the interactions between the variables. The results indicated that our local algorithm is a valuable analytical method for the precise characterization of multivariate CKD stage profiles and as efficient as the classical global model using chi2 variable section with an approximately 70% of good classification level. The resulting predictive models predominantly identify urinary metabolites (such as 3-hydroxyisovalerate, carnitine, citrate, dimethylsulfone, creatinine and N-methylnicotinamide) as relevant variables indicating that CKD significantly affects the urinary metabolome. In addition, the simple knowledge of the concentration of urinary metabolites classifies the CKD stage of the patients correctly. PMID:27861591
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moriya, S; National Cancer Center, Kashiwa, Chiba; Tachibana, H
Purpose: Daily CT-based three-dimensional image-guided and adaptive (CTIGRT-ART) proton therapy system was designed and developed. We also evaluated the effectiveness of the CTIGRT-ART. Methods: Retrospective analysis was performed in three lung cancer patients: Proton treatment planning was performed using CT image datasets acquired by Toshiba Aquilion ONE. Planning target volume and surrounding organs were contoured by a well-trained radiation oncologist. Dose distribution was optimized using 180-deg. and 270-deg. two fields in passive scattering proton therapy. Well commissioned Simplified Monte Carlo algorithm was used as dose calculation engine. Daily consecutive CT image datasets was acquired by an in-room CT (Toshiba Aquilionmore » LB). In our in-house program, two image registrations for bone and tumor were performed to shift the isocenter using treatment CT image dataset. Subsequently, dose recalculation was performed after the shift of the isocenter. When the dose distribution after the tumor registration exhibits change of dosimetric parameter of CTV D90% compared to the initial plan, an additional process of was performed that the range shifter thickness was optimized. Dose distribution with CTV D90% for the bone registration, the tumor registration only and adaptive plan with the tumor registration was compared to the initial plan. Results: In the bone registration, tumor dose coverage was decreased by 16% on average (Maximum: 56%). The tumor registration shows better coverage than the bone registration, however the coverage was also decreased by 9% (Maximum: 22%) The adaptive plan shows similar dose coverage of the tumor (Average: 2%, Maximum: 7%). Conclusion: There is a high possibility that only image registration for bone and tumor may reduce tumor coverage. Thus, our proposed methodology of image guidance and adaptive planning using the range adaptation after tumor registration would be effective for proton therapy. This research is partially supported by Japan Agency for Medical Research and Development (AMED).« less
a Probabilistic Embedding Clustering Method for Urban Structure Detection
NASA Astrophysics Data System (ADS)
Lin, X.; Li, H.; Zhang, Y.; Gao, L.; Zhao, L.; Deng, M.
2017-09-01
Urban structure detection is a basic task in urban geography. Clustering is a core technology to detect the patterns of urban spatial structure, urban functional region, and so on. In big data era, diverse urban sensing datasets recording information like human behaviour and human social activity, suffer from complexity in high dimension and high noise. And unfortunately, the state-of-the-art clustering methods does not handle the problem with high dimension and high noise issues concurrently. In this paper, a probabilistic embedding clustering method is proposed. Firstly, we come up with a Probabilistic Embedding Model (PEM) to find latent features from high dimensional urban sensing data by "learning" via probabilistic model. By latent features, we could catch essential features hidden in high dimensional data known as patterns; with the probabilistic model, we can also reduce uncertainty caused by high noise. Secondly, through tuning the parameters, our model could discover two kinds of urban structure, the homophily and structural equivalence, which means communities with intensive interaction or in the same roles in urban structure. We evaluated the performance of our model by conducting experiments on real-world data and experiments with real data in Shanghai (China) proved that our method could discover two kinds of urban structure, the homophily and structural equivalence, which means clustering community with intensive interaction or under the same roles in urban space.
NASA Astrophysics Data System (ADS)
Hamrouni, Sameh; Rougon, Nicolas; Pr"teux, Françoise
2011-03-01
In perfusion MRI (p-MRI) exams, short-axis (SA) image sequences are captured at multiple slice levels along the long-axis of the heart during the transit of a vascular contrast agent (Gd-DTPA) through the cardiac chambers and muscle. Compensating cardio-thoracic motions is a requirement for enabling computer-aided quantitative assessment of myocardial ischaemia from contrast-enhanced p-MRI sequences. The classical paradigm consists of registering each sequence frame on a reference image using some intensity-based matching criterion. In this paper, we introduce a novel unsupervised method for the spatio-temporal groupwise registration of cardiac p-MRI exams based on normalized mutual information (NMI) between high-dimensional feature distributions. Here, local contrast enhancement curves are used as a dense set of spatio-temporal features, and statistically matched through variational optimization to a target feature distribution derived from a registered reference template. The hard issue of probability density estimation in high-dimensional state spaces is bypassed by using consistent geometric entropy estimators, allowing NMI to be computed directly from feature samples. Specifically, a computationally efficient kth-nearest neighbor (kNN) estimation framework is retained, leading to closed-form expressions for the gradient flow of NMI over finite- and infinite-dimensional motion spaces. This approach is applied to the groupwise alignment of cardiac p-MRI exams using a free-form Deformation (FFD) model for cardio-thoracic motions. Experiments on simulated and natural datasets suggest its accuracy and robustness for registering p-MRI exams comprising more than 30 frames.
NASA Astrophysics Data System (ADS)
Sankey, T.; Donald, J.; McVay, J.
2015-12-01
High resolution remote sensing images and datasets are typically acquired at a large cost, which poses big a challenge for many scientists. Northern Arizona University recently acquired a custom-engineered, cutting-edge UAV and we can now generate our own images with the instrument. The UAV has a unique capability to carry a large payload including a hyperspectral sensor, which images the Earth surface in over 350 spectral bands at 5 cm resolution, and a lidar scanner, which images the land surface and vegetation in 3-dimensions. Both sensors represent the newest available technology with very high resolution, precision, and accuracy. Using the UAV sensors, we are monitoring the effects of regional forest restoration treatment efforts. Individual tree canopy width and height are measured in the field and via the UAV sensors. The high-resolution UAV images are then used to segment individual tree canopies and to derive 3-dimensional estimates. The UAV image-derived variables are then correlated to the field-based measurements and scaled to satellite-derived tree canopy measurements. The relationships between the field-based and UAV-derived estimates are then extrapolated to a larger area to scale the tree canopy dimensions and to estimate tree density within restored and control forest sites.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hazelaar, Colien, E-mail: c.hazelaar@vumc.nl; Dahele, Max; Mostafavi, Hassan
Purpose: Spine stereotactic body radiation therapy (SBRT) requires highly accurate positioning. We report our experience with markerless template matching and triangulation of kilovoltage images routinely acquired during spine SBRT, to determine spine position. Methods and Materials: Kilovoltage images, continuously acquired at 7, 11 or 15 frames/s during volumetric modulated spine SBRT of 18 patients, consisting of 93 fluoroscopy datasets (1 dataset/arc), were analyzed off-line. Four patients were immobilized in a head/neck mask, 14 had no immobilization. Two-dimensional (2D) templates were created for each gantry angle from planning computed tomography data and registered to prefiltered kilovoltage images to determine 2D shiftsmore » between actual and planned spine position. Registrations were considered valid if the normalized cross correlation score was ≥0.15. Multiple registrations were triangulated to determine 3D position. For each spine position dataset, average positional offset and standard deviation were calculated. To verify the accuracy and precision of the technique, mean positional offset and standard deviation for twenty stationary phantom datasets with different baseline shifts were measured. Results: For the phantom, average standard deviations were 0.18 mm for left-right (LR), 0.17 mm for superior-inferior (SI), and 0.23 mm for the anterior-posterior (AP) direction. Maximum difference in average detected and applied shift was 0.09 mm. For the 93 clinical datasets, the percentage of valid matched frames was, on average, 90.7% (range: 49.9-96.1%) per dataset. Average standard deviations for all datasets were 0.28, 0.19, and 0.28 mm for LR, SI, and AP, respectively. Spine position offsets were, on average, −0.05 (range: −1.58 to 2.18), −0.04 (range: −3.56 to 0.82), and −0.03 mm (range: −1.16 to 1.51), respectively. Average positional deviation was <1 mm in all directions in 92% of the arcs. Conclusions: Template matching and triangulation using kilovoltage images acquired during irradiation allows spine position detection with submillimeter accuracy at subsecond intervals. Although the majority of patients were not immobilized, most vertebrae were stable at the sub-mm level during spine SBRT delivery.« less
Data-driven approaches in the investigation of social perception
Adolphs, Ralph; Nummenmaa, Lauri; Todorov, Alexander; Haxby, James V.
2016-01-01
The complexity of social perception poses a challenge to traditional approaches to understand its psychological and neurobiological underpinnings. Data-driven methods are particularly well suited to tackling the often high-dimensional nature of stimulus spaces and of neural representations that characterize social perception. Such methods are more exploratory, capitalize on rich and large datasets, and attempt to discover patterns often without strict hypothesis testing. We present four case studies here: behavioural studies on face judgements, two neuroimaging studies of movies, and eyetracking studies in autism. We conclude with suggestions for particular topics that seem ripe for data-driven approaches, as well as caveats and limitations. PMID:27069045
Big geo data surface approximation using radial basis functions: A comparative study
NASA Astrophysics Data System (ADS)
Majdisova, Zuzana; Skala, Vaclav
2017-12-01
Approximation of scattered data is often a task in many engineering problems. The Radial Basis Function (RBF) approximation is appropriate for big scattered datasets in n-dimensional space. It is a non-separable approximation, as it is based on the distance between two points. This method leads to the solution of an overdetermined linear system of equations. In this paper the RBF approximation methods are briefly described, a new approach to the RBF approximation of big datasets is presented, and a comparison for different Compactly Supported RBFs (CS-RBFs) is made with respect to the accuracy of the computation. The proposed approach uses symmetry of a matrix, partitioning the matrix into blocks and data structures for storage of the sparse matrix. The experiments are performed for synthetic and real datasets.
Zhao, Yan; Chang, Cheng; Qin, Peibin; Cao, Qichen; Tian, Fang; Jiang, Jing; Li, Xianyu; Yu, Wenfeng; Zhu, Yunping; He, Fuchu; Ying, Wantao; Qian, Xiaohong
2016-01-21
Human plasma is a readily available clinical sample that reflects the status of the body in normal physiological and disease states. Although the wide dynamic range and immense complexity of plasma proteins are obstacles, comprehensive proteomic analysis of human plasma is necessary for biomarker discovery and further verification. Various methods such as immunodepletion, protein equalization and hyper fractionation have been applied to reduce the influence of high-abundance proteins (HAPs) and to reduce the high level of complexity. However, the depth at which the human plasma proteome has been explored in a relatively short time frame has been limited, which impedes the transfer of proteomic techniques to clinical research. Development of an optimal strategy is expected to improve the efficiency of human plasma proteome profiling. Here, five three-dimensional strategies combining HAP depletion (the 1st dimension) and protein fractionation (the 2nd dimension), followed by LC-MS/MS analysis (the 3rd dimension) were developed and compared for human plasma proteome profiling. Pros and cons of the five strategies are discussed for two issues: HAP depletion and complexity reduction. Strategies A and B used proteome equalization and tandem Seppro IgY14 immunodepletion, respectively, as the first dimension. Proteome equalization (strategy A) was biased toward the enrichment of basic and low-molecular weight proteins and had limited ability to enrich low-abundance proteins. By tandem removal of HAPs (strategy B), the efficiency of HAP depletion was significantly increased, whereas more off-target proteins were subtracted simultaneously. In the comparison of complexity reduction, strategy D involved a deglycosylation step before high-pH RPLC separation. However, the increase in sequence coverage did not increase the protein number as expected. Strategy E introduced SDS-PAGE separation of proteins, and the results showed oversampling of HAPs and identification of fewer proteins. Strategy C combined single Seppro IgY14 immunodepletion, high-pH RPLC fractionation and LC-MS/MS analysis. It generated the largest dataset, containing 1544 plasma protein groups and 258 newly identified proteins in a 30-h-machine-time analysis, making it the optimum three-dimensional strategy in our study. Further analysis of the integrated data from the five strategies showed identical distribution patterns in terms of sequence features and GO functional analysis with the 1929-plasma-protein dataset, further supporting the reliability of our plasma protein identifications. The characterization of 20 cytokines in the concentration range from sub-nanograms/milliliter to micrograms/milliliter demonstrated the sensitivity of the current strategies. Copyright © 2015 Elsevier B.V. All rights reserved.
Fuzzy support vector machine for microarray imbalanced data classification
NASA Astrophysics Data System (ADS)
Ladayya, Faroh; Purnami, Santi Wulan; Irhamah
2017-11-01
DNA microarrays are data containing gene expression with small sample sizes and high number of features. Furthermore, imbalanced classes is a common problem in microarray data. This occurs when a dataset is dominated by a class which have significantly more instances than the other minority classes. Therefore, it is needed a classification method that solve the problem of high dimensional and imbalanced data. Support Vector Machine (SVM) is one of the classification methods that is capable of handling large or small samples, nonlinear, high dimensional, over learning and local minimum issues. SVM has been widely applied to DNA microarray data classification and it has been shown that SVM provides the best performance among other machine learning methods. However, imbalanced data will be a problem because SVM treats all samples in the same importance thus the results is bias for minority class. To overcome the imbalanced data, Fuzzy SVM (FSVM) is proposed. This method apply a fuzzy membership to each input point and reformulate the SVM such that different input points provide different contributions to the classifier. The minority classes have large fuzzy membership so FSVM can pay more attention to the samples with larger fuzzy membership. Given DNA microarray data is a high dimensional data with a very large number of features, it is necessary to do feature selection first using Fast Correlation based Filter (FCBF). In this study will be analyzed by SVM, FSVM and both methods by applying FCBF and get the classification performance of them. Based on the overall results, FSVM on selected features has the best classification performance compared to SVM.
SpectralNET – an application for spectral graph analysis and visualization
Forman, Joshua J; Clemons, Paul A; Schreiber, Stuart L; Haggarty, Stephen J
2005-01-01
Background Graph theory provides a computational framework for modeling a variety of datasets including those emerging from genomics, proteomics, and chemical genetics. Networks of genes, proteins, small molecules, or other objects of study can be represented as graphs of nodes (vertices) and interactions (edges) that can carry different weights. SpectralNET is a flexible application for analyzing and visualizing these biological and chemical networks. Results Available both as a standalone .NET executable and as an ASP.NET web application, SpectralNET was designed specifically with the analysis of graph-theoretic metrics in mind, a computational task not easily accessible using currently available applications. Users can choose either to upload a network for analysis using a variety of input formats, or to have SpectralNET generate an idealized random network for comparison to a real-world dataset. Whichever graph-generation method is used, SpectralNET displays detailed information about each connected component of the graph, including graphs of degree distribution, clustering coefficient by degree, and average distance by degree. In addition, extensive information about the selected vertex is shown, including degree, clustering coefficient, various distance metrics, and the corresponding components of the adjacency, Laplacian, and normalized Laplacian eigenvectors. SpectralNET also displays several graph visualizations, including a linear dimensionality reduction for uploaded datasets (Principal Components Analysis) and a non-linear dimensionality reduction that provides an elegant view of global graph structure (Laplacian eigenvectors). Conclusion SpectralNET provides an easily accessible means of analyzing graph-theoretic metrics for data modeling and dimensionality reduction. SpectralNET is publicly available as both a .NET application and an ASP.NET web application from . Source code is available upon request. PMID:16236170
SpectralNET--an application for spectral graph analysis and visualization.
Forman, Joshua J; Clemons, Paul A; Schreiber, Stuart L; Haggarty, Stephen J
2005-10-19
Graph theory provides a computational framework for modeling a variety of datasets including those emerging from genomics, proteomics, and chemical genetics. Networks of genes, proteins, small molecules, or other objects of study can be represented as graphs of nodes (vertices) and interactions (edges) that can carry different weights. SpectralNET is a flexible application for analyzing and visualizing these biological and chemical networks. Available both as a standalone .NET executable and as an ASP.NET web application, SpectralNET was designed specifically with the analysis of graph-theoretic metrics in mind, a computational task not easily accessible using currently available applications. Users can choose either to upload a network for analysis using a variety of input formats, or to have SpectralNET generate an idealized random network for comparison to a real-world dataset. Whichever graph-generation method is used, SpectralNET displays detailed information about each connected component of the graph, including graphs of degree distribution, clustering coefficient by degree, and average distance by degree. In addition, extensive information about the selected vertex is shown, including degree, clustering coefficient, various distance metrics, and the corresponding components of the adjacency, Laplacian, and normalized Laplacian eigenvectors. SpectralNET also displays several graph visualizations, including a linear dimensionality reduction for uploaded datasets (Principal Components Analysis) and a non-linear dimensionality reduction that provides an elegant view of global graph structure (Laplacian eigenvectors). SpectralNET provides an easily accessible means of analyzing graph-theoretic metrics for data modeling and dimensionality reduction. SpectralNET is publicly available as both a .NET application and an ASP.NET web application from http://chembank.broad.harvard.edu/resources/. Source code is available upon request.
Benchmarking the mesoscale variability in global ocean eddy-permitting numerical systems
NASA Astrophysics Data System (ADS)
Cipollone, Andrea; Masina, Simona; Storto, Andrea; Iovino, Doroteaciro
2017-10-01
The role of data assimilation procedures on representing ocean mesoscale variability is assessed by applying eddy statistics to a state-of-the-art global ocean reanalysis (C-GLORS), a free global ocean simulation (performed with the NEMO system) and an observation-based dataset (ARMOR3D) used as an independent benchmark. Numerical results are computed on a 1/4 ∘ horizontal grid (ORCA025) and share the same resolution with ARMOR3D dataset. This "eddy-permitting" resolution is sufficient to allow ocean eddies to form. Further to assessing the eddy statistics from three different datasets, a global three-dimensional eddy detection system is implemented in order to bypass the need of regional-dependent definition of thresholds, typical of commonly adopted eddy detection algorithms. It thus provides full three-dimensional eddy statistics segmenting vertical profiles from local rotational velocities. This criterion is crucial for discerning real eddies from transient surface noise that inevitably affects any two-dimensional algorithm. Data assimilation enhances and corrects mesoscale variability on a wide range of features that cannot be well reproduced otherwise. The free simulation fairly reproduces eddies emerging from western boundary currents and deep baroclinic instabilities, while underestimates shallower vortexes that populate the full basin. The ocean reanalysis recovers most of the missing turbulence, shown by satellite products , that is not generated by the model itself and consistently projects surface variability deep into the water column. The comparison with the statistically reconstructed vertical profiles from ARMOR3D show that ocean data assimilation is able to embed variability into the model dynamics, constraining eddies with in situ and altimetry observation and generating them consistently with local environment.
Pataky, Todd C; Vanrenterghem, Jos; Robinson, Mark A
2015-05-01
Biomechanical processes are often manifested as one-dimensional (1D) trajectories. It has been shown that 1D confidence intervals (CIs) are biased when based on 0D statistical procedures, and the non-parametric 1D bootstrap CI has emerged in the Biomechanics literature as a viable solution. The primary purpose of this paper was to clarify that, for 1D biomechanics datasets, the distinction between 0D and 1D methods is much more important than the distinction between parametric and non-parametric procedures. A secondary purpose was to demonstrate that a parametric equivalent to the 1D bootstrap exists in the form of a random field theory (RFT) correction for multiple comparisons. To emphasize these points we analyzed six datasets consisting of force and kinematic trajectories in one-sample, paired, two-sample and regression designs. Results showed, first, that the 1D bootstrap and other 1D non-parametric CIs were qualitatively identical to RFT CIs, and all were very different from 0D CIs. Second, 1D parametric and 1D non-parametric hypothesis testing results were qualitatively identical for all six datasets. Last, we highlight the limitations of 1D CIs by demonstrating that they are complex, design-dependent, and thus non-generalizable. These results suggest that (i) analyses of 1D data based on 0D models of randomness are generally biased unless one explicitly identifies 0D variables before the experiment, and (ii) parametric and non-parametric 1D hypothesis testing provide an unambiguous framework for analysis when one׳s hypothesis explicitly or implicitly pertains to whole 1D trajectories. Copyright © 2015 Elsevier Ltd. All rights reserved.
A Fast, Open EEG Classification Framework Based on Feature Compression and Channel Ranking
Han, Jiuqi; Zhao, Yuwei; Sun, Hongji; Chen, Jiayun; Ke, Ang; Xu, Gesen; Zhang, Hualiang; Zhou, Jin; Wang, Changyong
2018-01-01
Superior feature extraction, channel selection and classification methods are essential for designing electroencephalography (EEG) classification frameworks. However, the performance of most frameworks is limited by their improper channel selection methods and too specifical design, leading to high computational complexity, non-convergent procedure and narrow expansibility. In this paper, to remedy these drawbacks, we propose a fast, open EEG classification framework centralized by EEG feature compression, low-dimensional representation, and convergent iterative channel ranking. First, to reduce the complexity, we use data clustering to compress the EEG features channel-wise, packing the high-dimensional EEG signal, and endowing them with numerical signatures. Second, to provide easy access to alternative superior methods, we structurally represent each EEG trial in a feature vector with its corresponding numerical signature. Thus, the recorded signals of many trials shrink to a low-dimensional structural matrix compatible with most pattern recognition methods. Third, a series of effective iterative feature selection approaches with theoretical convergence is introduced to rank the EEG channels and remove redundant ones, further accelerating the EEG classification process and ensuring its stability. Finally, a classical linear discriminant analysis (LDA) model is employed to classify a single EEG trial with selected channels. Experimental results on two real world brain-computer interface (BCI) competition datasets demonstrate the promising performance of the proposed framework over state-of-the-art methods. PMID:29713262
Al-Aziz, Jameel; Christou, Nicolas; Dinov, Ivo D.
2011-01-01
The amount, complexity and provenance of data have dramatically increased in the past five years. Visualization of observed and simulated data is a critical component of any social, environmental, biomedical or scientific quest. Dynamic, exploratory and interactive visualization of multivariate data, without preprocessing by dimensionality reduction, remains a nearly insurmountable challenge. The Statistics Online Computational Resource (www.SOCR.ucla.edu) provides portable online aids for probability and statistics education, technology-based instruction and statistical computing. We have developed a new Java-based infrastructure, SOCR Motion Charts, for discovery-based exploratory analysis of multivariate data. This interactive data visualization tool enables the visualization of high-dimensional longitudinal data. SOCR Motion Charts allows mapping of ordinal, nominal and quantitative variables onto time, 2D axes, size, colors, glyphs and appearance characteristics, which facilitates the interactive display of multidimensional data. We validated this new visualization paradigm using several publicly available multivariate datasets including Ice-Thickness, Housing Prices, Consumer Price Index, and California Ozone Data. SOCR Motion Charts is designed using object-oriented programming, implemented as a Java Web-applet and is available to the entire community on the web at www.socr.ucla.edu/SOCR_MotionCharts. It can be used as an instructional tool for rendering and interrogating high-dimensional data in the classroom, as well as a research tool for exploratory data analysis. PMID:21479108
Matta, Ragai-Edward; Bergauer, Bastian; Adler, Werner; Wichmann, Manfred; Nickenig, Hans-Joachim
2017-06-01
The use of a surgical template is a well-established method in advanced implantology. In addition to conventional fabrication, computer-aided design and computer-aided manufacturing (CAD/CAM) work-flow provides an opportunity to engineer implant drilling templates via a three-dimensional printer. In order to transfer the virtual planning to the oral situation, a highly accurate surgical guide is needed. The aim of this study was to evaluate the impact of the fabrication method on the three-dimensional accuracy. The same virtual planning based on a scanned plaster model was used to fabricate a conventional thermo-formed and a three-dimensional printed surgical guide for each of 13 patients (single tooth implants). Both templates were acquired individually on the respective plaster model using an optical industrial white-light scanner (ATOS II, GOM mbh, Braunschweig, Germany), and the virtual datasets were superimposed. Using the three-dimensional geometry of the implant sleeve, the deviation between both surgical guides was evaluated. The mean discrepancy of the angle was 3.479° (standard deviation, 1.904°) based on data from 13 patients. Concerning the three-dimensional position of the implant sleeve, the highest deviation was in the Z-axis at 0.594 mm. The mean deviation of the Euclidian distance, dxyz, was 0.864 mm. Although the two different fabrication methods delivered statistically significantly different templates, the deviations ranged within a decimillimeter span. Both methods are appropriate for clinical use. Copyright © 2017 European Association for Cranio-Maxillo-Facial Surgery. Published by Elsevier Ltd. All rights reserved.
SPHERE Sheds New Light on the Collisional History of Main-belt Asteroids
NASA Astrophysics Data System (ADS)
Marsset, M.; Carry, B.; Pajuelo, M.; Viikinkoski, M.; Hanuš, J.; Vernazza, P.; Dumas, C.; Yang, B.
2017-09-01
The Spectro-Polarimetric High-contrast Exoplanet REsearch (SPHERE) instrument has unveiled unprecedented details of the three-dimensional shape, surface topography and cratering record of four medium-sized ( 200 km) asteroids, opening the prospect of a new era of ground-based exploration of the asteroid belt. Although two of the targets, (130) Elektra and (107) Camilla, have been observed extensively for more than fifteen years by the first-generation adaptive optics imagers, two new moonlets were discovered around these targets, illustrating the unique power of SPHERE. In the next two years SPHERE will continue to collect high- angular-resolution and high-contrast measurements of about 40 asteroids. These observations of a large number of asteroids will provide a unique dataset to better understand the collisional history and multiplicity rate of the asteroid belt.
Hoffmann, Nils; Keck, Matthias; Neuweger, Heiko; Wilhelm, Mathias; Högy, Petra; Niehaus, Karsten; Stoye, Jens
2012-08-27
Modern analytical methods in biology and chemistry use separation techniques coupled to sensitive detectors, such as gas chromatography-mass spectrometry (GC-MS) and liquid chromatography-mass spectrometry (LC-MS). These hyphenated methods provide high-dimensional data. Comparing such data manually to find corresponding signals is a laborious task, as each experiment usually consists of thousands of individual scans, each containing hundreds or even thousands of distinct signals. In order to allow for successful identification of metabolites or proteins within such data, especially in the context of metabolomics and proteomics, an accurate alignment and matching of corresponding features between two or more experiments is required. Such a matching algorithm should capture fluctuations in the chromatographic system which lead to non-linear distortions on the time axis, as well as systematic changes in recorded intensities. Many different algorithms for the retention time alignment of GC-MS and LC-MS data have been proposed and published, but all of them focus either on aligning previously extracted peak features or on aligning and comparing the complete raw data containing all available features. In this paper we introduce two algorithms for retention time alignment of multiple GC-MS datasets: multiple alignment by bidirectional best hits peak assignment and cluster extension (BIPACE) and center-star multiple alignment by pairwise partitioned dynamic time warping (CeMAPP-DTW). We show how the similarity-based peak group matching method BIPACE may be used for multiple alignment calculation individually and how it can be used as a preprocessing step for the pairwise alignments performed by CeMAPP-DTW. We evaluate the algorithms individually and in combination on a previously published small GC-MS dataset studying the Leishmania parasite and on a larger GC-MS dataset studying grains of wheat (Triticum aestivum). We have shown that BIPACE achieves very high precision and recall and a very low number of false positive peak assignments on both evaluation datasets. CeMAPP-DTW finds a high number of true positives when executed on its own, but achieves even better results when BIPACE is used to constrain its search space. The source code of both algorithms is included in the OpenSource software framework Maltcms, which is available from http://maltcms.sf.net. The evaluation scripts of the present study are available from the same source.
2012-01-01
Background Modern analytical methods in biology and chemistry use separation techniques coupled to sensitive detectors, such as gas chromatography-mass spectrometry (GC-MS) and liquid chromatography-mass spectrometry (LC-MS). These hyphenated methods provide high-dimensional data. Comparing such data manually to find corresponding signals is a laborious task, as each experiment usually consists of thousands of individual scans, each containing hundreds or even thousands of distinct signals. In order to allow for successful identification of metabolites or proteins within such data, especially in the context of metabolomics and proteomics, an accurate alignment and matching of corresponding features between two or more experiments is required. Such a matching algorithm should capture fluctuations in the chromatographic system which lead to non-linear distortions on the time axis, as well as systematic changes in recorded intensities. Many different algorithms for the retention time alignment of GC-MS and LC-MS data have been proposed and published, but all of them focus either on aligning previously extracted peak features or on aligning and comparing the complete raw data containing all available features. Results In this paper we introduce two algorithms for retention time alignment of multiple GC-MS datasets: multiple alignment by bidirectional best hits peak assignment and cluster extension (BIPACE) and center-star multiple alignment by pairwise partitioned dynamic time warping (CeMAPP-DTW). We show how the similarity-based peak group matching method BIPACE may be used for multiple alignment calculation individually and how it can be used as a preprocessing step for the pairwise alignments performed by CeMAPP-DTW. We evaluate the algorithms individually and in combination on a previously published small GC-MS dataset studying the Leishmania parasite and on a larger GC-MS dataset studying grains of wheat (Triticum aestivum). Conclusions We have shown that BIPACE achieves very high precision and recall and a very low number of false positive peak assignments on both evaluation datasets. CeMAPP-DTW finds a high number of true positives when executed on its own, but achieves even better results when BIPACE is used to constrain its search space. The source code of both algorithms is included in the OpenSource software framework Maltcms, which is available from http://maltcms.sf.net. The evaluation scripts of the present study are available from the same source. PMID:22920415
Hidden discriminative features extraction for supervised high-order time series modeling.
Nguyen, Ngoc Anh Thi; Yang, Hyung-Jeong; Kim, Sunhee
2016-11-01
In this paper, an orthogonal Tucker-decomposition-based extraction of high-order discriminative subspaces from a tensor-based time series data structure is presented, named as Tensor Discriminative Feature Extraction (TDFE). TDFE relies on the employment of category information for the maximization of the between-class scatter and the minimization of the within-class scatter to extract optimal hidden discriminative feature subspaces that are simultaneously spanned by every modality for supervised tensor modeling. In this context, the proposed tensor-decomposition method provides the following benefits: i) reduces dimensionality while robustly mining the underlying discriminative features, ii) results in effective interpretable features that lead to an improved classification and visualization, and iii) reduces the processing time during the training stage and the filtering of the projection by solving the generalized eigenvalue issue at each alternation step. Two real third-order tensor-structures of time series datasets (an epilepsy electroencephalogram (EEG) that is modeled as channel×frequency bin×time frame and a microarray data that is modeled as gene×sample×time) were used for the evaluation of the TDFE. The experiment results corroborate the advantages of the proposed method with averages of 98.26% and 89.63% for the classification accuracies of the epilepsy dataset and the microarray dataset, respectively. These performance averages represent an improvement on those of the matrix-based algorithms and recent tensor-based, discriminant-decomposition approaches; this is especially the case considering the small number of samples that are used in practice. Copyright © 2016 Elsevier Ltd. All rights reserved.
3-D QSARS FOR RANKING AND PRIORITIZATION OF LARGE CHEMICAL DATASETS: AN EDC CASE STUDY
The COmmon REactivity Pattern (COREPA) approach is a three-dimensional structure activity (3-D QSAR) technique that permits identification and quantification of specific global and local steroelectronic characteristics associated with a chemical's biological activity. It goes bey...
Multidimensional Compressed Sensing MRI Using Tensor Decomposition-Based Sparsifying Transform
Yu, Yeyang; Jin, Jin; Liu, Feng; Crozier, Stuart
2014-01-01
Compressed Sensing (CS) has been applied in dynamic Magnetic Resonance Imaging (MRI) to accelerate the data acquisition without noticeably degrading the spatial-temporal resolution. A suitable sparsity basis is one of the key components to successful CS applications. Conventionally, a multidimensional dataset in dynamic MRI is treated as a series of two-dimensional matrices, and then various matrix/vector transforms are used to explore the image sparsity. Traditional methods typically sparsify the spatial and temporal information independently. In this work, we propose a novel concept of tensor sparsity for the application of CS in dynamic MRI, and present the Higher-order Singular Value Decomposition (HOSVD) as a practical example. Applications presented in the three- and four-dimensional MRI data demonstrate that HOSVD simultaneously exploited the correlations within spatial and temporal dimensions. Validations based on cardiac datasets indicate that the proposed method achieved comparable reconstruction accuracy with the low-rank matrix recovery methods and, outperformed the conventional sparse recovery methods. PMID:24901331
Fukunaga-Koontz transform based dimensionality reduction for hyperspectral imagery
NASA Astrophysics Data System (ADS)
Ochilov, S.; Alam, M. S.; Bal, A.
2006-05-01
Fukunaga-Koontz Transform based technique offers some attractive properties for desired class oriented dimensionality reduction in hyperspectral imagery. In FKT, feature selection is performed by transforming into a new space where feature classes have complimentary eigenvectors. Dimensionality reduction technique based on these complimentary eigenvector analysis can be described under two classes, desired class and background clutter, such that each basis function best represent one class while carrying the least amount of information from the second class. By selecting a few eigenvectors which are most relevant to desired class, one can reduce the dimension of hyperspectral cube. Since the FKT based technique reduces data size, it provides significant advantages for near real time detection applications in hyperspectral imagery. Furthermore, the eigenvector selection approach significantly reduces computation burden via the dimensionality reduction processes. The performance of the proposed dimensionality reduction algorithm has been tested using real-world hyperspectral dataset.
Lie, Octavian V; van Mierlo, Pieter
2017-01-01
The visual interpretation of intracranial EEG (iEEG) is the standard method used in complex epilepsy surgery cases to map the regions of seizure onset targeted for resection. Still, visual iEEG analysis is labor-intensive and biased due to interpreter dependency. Multivariate parametric functional connectivity measures using adaptive autoregressive (AR) modeling of the iEEG signals based on the Kalman filter algorithm have been used successfully to localize the electrographic seizure onsets. Due to their high computational cost, these methods have been applied to a limited number of iEEG time-series (<60). The aim of this study was to test two Kalman filter implementations, a well-known multivariate adaptive AR model (Arnold et al. 1998) and a simplified, computationally efficient derivation of it, for their potential application to connectivity analysis of high-dimensional (up to 192 channels) iEEG data. When used on simulated seizures together with a multivariate connectivity estimator, the partial directed coherence, the two AR models were compared for their ability to reconstitute the designed seizure signal connections from noisy data. Next, focal seizures from iEEG recordings (73-113 channels) in three patients rendered seizure-free after surgery were mapped with the outdegree, a graph-theory index of outward directed connectivity. Simulation results indicated high levels of mapping accuracy for the two models in the presence of low-to-moderate noise cross-correlation. Accordingly, both AR models correctly mapped the real seizure onset to the resection volume. This study supports the possibility of conducting fully data-driven multivariate connectivity estimations on high-dimensional iEEG datasets using the Kalman filter approach.
Williams, Alex H; Kim, Tony Hyun; Wang, Forea; Vyas, Saurabh; Ryu, Stephen I; Shenoy, Krishna V; Schnitzer, Mark; Kolda, Tamara G; Ganguli, Surya
2018-06-27
Perceptions, thoughts, and actions unfold over millisecond timescales, while learned behaviors can require many days to mature. While recent experimental advances enable large-scale and long-term neural recordings with high temporal fidelity, it remains a formidable challenge to extract unbiased and interpretable descriptions of how rapid single-trial circuit dynamics change slowly over many trials to mediate learning. We demonstrate a simple tensor component analysis (TCA) can meet this challenge by extracting three interconnected, low-dimensional descriptions of neural data: neuron factors, reflecting cell assemblies; temporal factors, reflecting rapid circuit dynamics mediating perceptions, thoughts, and actions within each trial; and trial factors, describing both long-term learning and trial-to-trial changes in cognitive state. We demonstrate the broad applicability of TCA by revealing insights into diverse datasets derived from artificial neural networks, large-scale calcium imaging of rodent prefrontal cortex during maze navigation, and multielectrode recordings of macaque motor cortex during brain machine interface learning. Copyright © 2018 Elsevier Inc. All rights reserved.
Topological patterns of mesh textures in serpentinites
NASA Astrophysics Data System (ADS)
Miyazawa, M.; Suzuki, A.; Shimizu, H.; Okamoto, A.; Hiraoka, Y.; Obayashi, I.; Tsuji, T.; Ito, T.
2017-12-01
Serpentinization is a hydration process that forms serpentine minerals and magnetite within the oceanic lithosphere. Microfractures crosscut these minerals during the reactions, and the structures look like mesh textures. It has been known that the patterns of microfractures and the system evolutions are affected by the hydration reaction and fluid transport in fractures and within matrices. This study aims at quantifying the topological patterns of the mesh textures and understanding possible conditions of fluid transport and reaction during serpentinization in the oceanic lithosphere. Two-dimensional simulation by the distinct element method (DEM) generates fracture patterns due to serpentinization. The microfracture patterns are evaluated by persistent homology, which measures features of connected components of a topological space and encodes multi-scale topological features in the persistence diagrams. The persistence diagrams of the different mesh textures are evaluated by principal component analysis to bring out the strong patterns of persistence diagrams. This approach help extract feature values of fracture patterns from high-dimensional and complex datasets.
A Generic multi-dimensional feature extraction method using multiobjective genetic programming.
Zhang, Yang; Rockett, Peter I
2009-01-01
In this paper, we present a generic feature extraction method for pattern classification using multiobjective genetic programming. This not only evolves the (near-)optimal set of mappings from a pattern space to a multi-dimensional decision space, but also simultaneously optimizes the dimensionality of that decision space. The presented framework evolves vector-to-vector feature extractors that maximize class separability. We demonstrate the efficacy of our approach by making statistically-founded comparisons with a wide variety of established classifier paradigms over a range of datasets and find that for most of the pairwise comparisons, our evolutionary method delivers statistically smaller misclassification errors. At very worst, our method displays no statistical difference in a few pairwise comparisons with established classifier/dataset combinations; crucially, none of the misclassification results produced by our method is worse than any comparator classifier. Although principally focused on feature extraction, feature selection is also performed as an implicit side effect; we show that both feature extraction and selection are important to the success of our technique. The presented method has the practical consequence of obviating the need to exhaustively evaluate a large family of conventional classifiers when faced with a new pattern recognition problem in order to attain a good classification accuracy.
CT-based manual segmentation and evaluation of paranasal sinuses.
Pirner, S; Tingelhoff, K; Wagner, I; Westphal, R; Rilk, M; Wahl, F M; Bootz, F; Eichhorn, Klaus W G
2009-04-01
Manual segmentation of computed tomography (CT) datasets was performed for robot-assisted endoscope movement during functional endoscopic sinus surgery (FESS). Segmented 3D models are needed for the robots' workspace definition. A total of 50 preselected CT datasets were each segmented in 150-200 coronal slices with 24 landmarks being set. Three different colors for segmentation represent diverse risk areas. Extension and volumetric measurements were performed. Three-dimensional reconstruction was generated after segmentation. Manual segmentation took 8-10 h for each CT dataset. The mean volumes were: right maxillary sinus 17.4 cm(3), left side 17.9 cm(3), right frontal sinus 4.2 cm(3), left side 4.0 cm(3), total frontal sinuses 7.9 cm(3), sphenoid sinus right side 5.3 cm(3), left side 5.5 cm(3), total sphenoid sinus volume 11.2 cm(3). Our manually segmented 3D-models present the patient's individual anatomy with a special focus on structures in danger according to the diverse colored risk areas. For safe robot assistance, the high-accuracy models represent an average of the population for anatomical variations, extension and volumetric measurements. They can be used as a database for automatic model-based segmentation. None of the segmentation methods so far described provide risk segmentation. The robot's maximum distance to the segmented border can be adjusted according to the differently colored areas.
Lepre, Jorge; Rice, J Jeremy; Tu, Yuhai; Stolovitzky, Gustavo
2004-05-01
Despite the growing literature devoted to finding differentially expressed genes in assays probing different tissues types, little attention has been paid to the combinatorial nature of feature selection inherent to large, high-dimensional gene expression datasets. New flexible data analysis approaches capable of searching relevant subgroups of genes and experiments are needed to understand multivariate associations of gene expression patterns with observed phenotypes. We present in detail a deterministic algorithm to discover patterns of multivariate gene associations in gene expression data. The patterns discovered are differential with respect to a control dataset. The algorithm is exhaustive and efficient, reporting all existent patterns that fit a given input parameter set while avoiding enumeration of the entire pattern space. The value of the pattern discovery approach is demonstrated by finding a set of genes that differentiate between two types of lymphoma. Moreover, these genes are found to behave consistently in an independent dataset produced in a different laboratory using different arrays, thus validating the genes selected using our algorithm. We show that the genes deemed significant in terms of their multivariate statistics will be missed using other methods. Our set of pattern discovery algorithms including a user interface is distributed as a package called Genes@Work. This package is freely available to non-commercial users and can be downloaded from our website (http://www.research.ibm.com/FunGen).
NASA Astrophysics Data System (ADS)
Shah, Syed Muhammad Saqlain; Batool, Safeera; Khan, Imran; Ashraf, Muhammad Usman; Abbas, Syed Hussnain; Hussain, Syed Adnan
2017-09-01
Automatic diagnosis of human diseases are mostly achieved through decision support systems. The performance of these systems is mainly dependent on the selection of the most relevant features. This becomes harder when the dataset contains missing values for the different features. Probabilistic Principal Component Analysis (PPCA) has reputation to deal with the problem of missing values of attributes. This research presents a methodology which uses the results of medical tests as input, extracts a reduced dimensional feature subset and provides diagnosis of heart disease. The proposed methodology extracts high impact features in new projection by using Probabilistic Principal Component Analysis (PPCA). PPCA extracts projection vectors which contribute in highest covariance and these projection vectors are used to reduce feature dimension. The selection of projection vectors is done through Parallel Analysis (PA). The feature subset with the reduced dimension is provided to radial basis function (RBF) kernel based Support Vector Machines (SVM). The RBF based SVM serves the purpose of classification into two categories i.e., Heart Patient (HP) and Normal Subject (NS). The proposed methodology is evaluated through accuracy, specificity and sensitivity over the three datasets of UCI i.e., Cleveland, Switzerland and Hungarian. The statistical results achieved through the proposed technique are presented in comparison to the existing research showing its impact. The proposed technique achieved an accuracy of 82.18%, 85.82% and 91.30% for Cleveland, Hungarian and Switzerland dataset respectively.
DataPflex: a MATLAB-based tool for the manipulation and visualization of multidimensional datasets.
Hendriks, Bart S; Espelin, Christopher W
2010-02-01
DataPflex is a MATLAB-based application that facilitates the manipulation and visualization of multidimensional datasets. The strength of DataPflex lies in the intuitive graphical user interface for the efficient incorporation, manipulation and visualization of high-dimensional data that can be generated by multiplexed protein measurement platforms including, but not limited to Luminex or Meso-Scale Discovery. Such data can generally be represented in the form of multidimensional datasets [for example (time x stimulation x inhibitor x inhibitor concentration x cell type x measurement)]. For cases where measurements are made in a combinational fashion across multiple dimensions, there is a need for a tool to efficiently manipulate and reorganize such data for visualization. DataPflex accepts data consisting of up to five arbitrary dimensions in addition to a measurement dimension. Data are imported from a simple .xls format and can be exported to MATLAB or .xls. Data dimensions can be reordered, subdivided, merged, normalized and visualized in the form of collections of line graphs, bar graphs, surface plots, heatmaps, IC50's and other custom plots. Open source implementation in MATLAB enables easy extension for custom plotting routines and integration with more sophisticated analysis tools. DataPflex is distributed under the GPL license (http://www.gnu.org/licenses/) together with documentation, source code and sample data files at: http://code.google.com/p/datapflex. Supplementary data available at Bioinformatics online.
NASA Astrophysics Data System (ADS)
Mitra, S.; Dey, S.; Siddartha, G.; Bhattacharya, S.
2016-12-01
We estimate 1-dimensional path average fundamental mode group velocity dispersion curves from regional Rayleigh and Love waves sampling the Indian subcontinent. The path average measurements are combined through a tomographic inversion to obtain 2-dimensional group velocity variation maps between periods of 10 and 80 s. The region of study is parametrised as triangular grids with 1° sides for the tomographic inversion. Rayleigh and Love wave dispersion curves from each node point is subsequently extracted and jointly inverted to obtain a radially anisotropic shear wave velocity model through global optimisation using Genetic Algorithm. The parametrization of the model space is done using three crustal layers and four mantle layers over a half-space with varying VpH , VsV and VsH. The anisotropic parameter (η) is calculated from empirical relations and the density of the layers are taken from PREM. Misfit for the model is calculated as a sum of error-weighted average dispersion curves. The 1-dimensional anisotropic shear wave velocity at each node point is combined using linear interpolation to obtain 3-dimensional structure beneath the region. Synthetic tests are performed to estimate the resolution of the tomographic maps which will be presented with our results. We envision to extend this to a larger dataset in near future to obtain high resolution anisotrpic shear wave velocity structure beneath India, Himalaya and Tibet.
Kupinski, M. K.; Clarkson, E.
2015-01-01
We present a new method for computing optimized channels for channelized quadratic observers (CQO) that is feasible for high-dimensional image data. The method for calculating channels is applicable in general and optimal for Gaussian distributed image data. Gradient-based algorithms for determining the channels are presented for five different information-based figures of merit (FOMs). Analytic solutions for the optimum channels for each of the five FOMs are derived for the case of equal mean data for both classes. The optimum channels for three of the FOMs under the equal mean condition are shown to be the same. This result is critical since some of the FOMs are much easier to compute. Implementing the CQO requires a set of channels and the first- and second-order statistics of channelized image data from both classes. The dimensionality reduction from M measurements to L channels is a critical advantage of CQO since estimating image statistics from channelized data requires smaller sample sizes and inverting a smaller covariance matrix is easier. In a simulation study we compare the performance of ideal and Hotelling observers to CQO. The optimal CQO channels are calculated using both eigenanalysis and a new gradient-based algorithm for maximizing Jeffrey's divergence (J). Optimal channel selection without eigenanalysis makes the J-CQO on large-dimensional image data feasible. PMID:26366764
3D Printing of CT Dataset: Validation of an Open Source and Consumer-Available Workflow.
Bortolotto, Chandra; Eshja, Esmeralda; Peroni, Caterina; Orlandi, Matteo A; Bizzotto, Nicola; Poggi, Paolo
2016-02-01
The broad availability of cheap three-dimensional (3D) printing equipment has raised the need for a thorough analysis on its effects on clinical accuracy. Our aim is to determine whether the accuracy of 3D printing process is affected by the use of a low-budget workflow based on open source software and consumer's commercially available 3D printers. A group of test objects was scanned with a 64-slice computed tomography (CT) in order to build their 3D copies. CT datasets were elaborated using a software chain based on three free and open source software. Objects were printed out with a commercially available 3D printer. Both the 3D copies and the test objects were measured using a digital professional caliper. Overall, the objects' mean absolute difference between test objects and 3D copies is 0.23 mm and the mean relative difference amounts to 0.55 %. Our results demonstrate that the accuracy of 3D printing process remains high despite the use of a low-budget workflow.
Knowledge discovery by accuracy maximization
Cacciatore, Stefano; Luchinat, Claudio; Tenori, Leonardo
2014-01-01
Here we describe KODAMA (knowledge discovery by accuracy maximization), an unsupervised and semisupervised learning algorithm that performs feature extraction from noisy and high-dimensional data. Unlike other data mining methods, the peculiarity of KODAMA is that it is driven by an integrated procedure of cross-validation of the results. The discovery of a local manifold’s topology is led by a classifier through a Monte Carlo procedure of maximization of cross-validated predictive accuracy. Briefly, our approach differs from previous methods in that it has an integrated procedure of validation of the results. In this way, the method ensures the highest robustness of the obtained solution. This robustness is demonstrated on experimental datasets of gene expression and metabolomics, where KODAMA compares favorably with other existing feature extraction methods. KODAMA is then applied to an astronomical dataset, revealing unexpected features. Interesting and not easily predictable features are also found in the analysis of the State of the Union speeches by American presidents: KODAMA reveals an abrupt linguistic transition sharply separating all post-Reagan from all pre-Reagan speeches. The transition occurs during Reagan’s presidency and not from its beginning. PMID:24706821
3D local feature BKD to extract road information from mobile laser scanning point clouds
NASA Astrophysics Data System (ADS)
Yang, Bisheng; Liu, Yuan; Dong, Zhen; Liang, Fuxun; Li, Bijun; Peng, Xiangyang
2017-08-01
Extracting road information from point clouds obtained through mobile laser scanning (MLS) is essential for autonomous vehicle navigation, and has hence garnered a growing amount of research interest in recent years. However, the performance of such systems is seriously affected due to varying point density and noise. This paper proposes a novel three-dimensional (3D) local feature called the binary kernel descriptor (BKD) to extract road information from MLS point clouds. The BKD consists of Gaussian kernel density estimation and binarization components to encode the shape and intensity information of the 3D point clouds that are fed to a random forest classifier to extract curbs and markings on the road. These are then used to derive road information, such as the number of lanes, the lane width, and intersections. In experiments, the precision and recall of the proposed feature for the detection of curbs and road markings on an urban dataset and a highway dataset were as high as 90%, thus showing that the BKD is accurate and robust against varying point density and noise.
NASA Astrophysics Data System (ADS)
Candeo, Alessia; Sana, Ilenia; Ferrari, Eleonora; Maiuri, Luigi; D'Andrea, Cosimo; Valentini, Gianluca; Bassi, Andrea
2016-05-01
Light sheet fluorescence microscopy has proven to be a powerful tool to image fixed and chemically cleared samples, providing in depth and high resolution reconstructions of intact mouse organs. We applied light sheet microscopy to image the mouse intestine. We found that large portions of the sample can be readily visualized, assessing the organ status and highlighting the presence of regions with impaired morphology. Yet, three-dimensional (3-D) sectioning of the intestine leads to a large dataset that produces unnecessary storage and processing overload. We developed a routine that extracts the relevant information from a large image stack and provides quantitative analysis of the intestine morphology. This result was achieved by a three step procedure consisting of: (1) virtually unfold the 3-D reconstruction of the intestine; (2) observe it layer-by-layer; and (3) identify distinct villi and statistically analyze multiple samples belonging to different intestinal regions. Even if the procedure has been developed for the murine intestine, most of the underlying concepts have a general applicability.
Statistical Significance for Hierarchical Clustering
Kimes, Patrick K.; Liu, Yufeng; Hayes, D. Neil; Marron, J. S.
2017-01-01
Summary Cluster analysis has proved to be an invaluable tool for the exploratory and unsupervised analysis of high dimensional datasets. Among methods for clustering, hierarchical approaches have enjoyed substantial popularity in genomics and other fields for their ability to simultaneously uncover multiple layers of clustering structure. A critical and challenging question in cluster analysis is whether the identified clusters represent important underlying structure or are artifacts of natural sampling variation. Few approaches have been proposed for addressing this problem in the context of hierarchical clustering, for which the problem is further complicated by the natural tree structure of the partition, and the multiplicity of tests required to parse the layers of nested clusters. In this paper, we propose a Monte Carlo based approach for testing statistical significance in hierarchical clustering which addresses these issues. The approach is implemented as a sequential testing procedure guaranteeing control of the family-wise error rate. Theoretical justification is provided for our approach, and its power to detect true clustering structure is illustrated through several simulation studies and applications to two cancer gene expression datasets. PMID:28099990
Manifold regularized multitask learning for semi-supervised multilabel image classification.
Luo, Yong; Tao, Dacheng; Geng, Bo; Xu, Chao; Maybank, Stephen J
2013-02-01
It is a significant challenge to classify images with multiple labels by using only a small number of labeled samples. One option is to learn a binary classifier for each label and use manifold regularization to improve the classification performance by exploring the underlying geometric structure of the data distribution. However, such an approach does not perform well in practice when images from multiple concepts are represented by high-dimensional visual features. Thus, manifold regularization is insufficient to control the model complexity. In this paper, we propose a manifold regularized multitask learning (MRMTL) algorithm. MRMTL learns a discriminative subspace shared by multiple classification tasks by exploiting the common structure of these tasks. It effectively controls the model complexity because different tasks limit one another's search volume, and the manifold regularization ensures that the functions in the shared hypothesis space are smooth along the data manifold. We conduct extensive experiments, on the PASCAL VOC'07 dataset with 20 classes and the MIR dataset with 38 classes, by comparing MRMTL with popular image classification algorithms. The results suggest that MRMTL is effective for image classification.
Diverse power iteration embeddings: Theory and practice
Huang, Hao; Yoo, Shinjae; Yu, Dantong; ...
2015-11-09
Manifold learning, especially spectral embedding, is known as one of the most effective learning approaches on high dimensional data, but for real-world applications it raises a serious computational burden in constructing spectral embeddings for large datasets. To overcome this computational complexity, we propose a novel efficient embedding construction, Diverse Power Iteration Embedding (DPIE). DPIE shows almost the same effectiveness of spectral embeddings and yet is three order of magnitude faster than spectral embeddings computed from eigen-decomposition. Our DPIE is unique in that (1) it finds linearly independent embeddings and thus shows diverse aspects of dataset; (2) the proposed regularized DPIEmore » is effective if we need many embeddings; (3) we show how to efficiently orthogonalize DPIE if one needs; and (4) Diverse Power Iteration Value (DPIV) provides the importance of each DPIE like an eigen value. As a result, such various aspects of DPIE and DPIV ensure that our algorithm is easy to apply to various applications, and we also show the effectiveness and efficiency of DPIE on clustering, anomaly detection, and feature selection as our case studies.« less
A clustering algorithm for sample data based on environmental pollution characteristics
NASA Astrophysics Data System (ADS)
Chen, Mei; Wang, Pengfei; Chen, Qiang; Wu, Jiadong; Chen, Xiaoyun
2015-04-01
Environmental pollution has become an issue of serious international concern in recent years. Among the receptor-oriented pollution models, CMB, PMF, UNMIX, and PCA are widely used as source apportionment models. To improve the accuracy of source apportionment and classify the sample data for these models, this study proposes an easy-to-use, high-dimensional EPC algorithm that not only organizes all of the sample data into different groups according to the similarities in pollution characteristics such as pollution sources and concentrations but also simultaneously detects outliers. The main clustering process consists of selecting the first unlabelled point as the cluster centre, then assigning each data point in the sample dataset to its most similar cluster centre according to both the user-defined threshold and the value of similarity function in each iteration, and finally modifying the clusters using a method similar to k-Means. The validity and accuracy of the algorithm are tested using both real and synthetic datasets, which makes the EPC algorithm practical and effective for appropriately classifying sample data for source apportionment models and helpful for better understanding and interpreting the sources of pollution.
Discovering biclusters in gene expression data based on high-dimensional linear geometries
Gan, Xiangchao; Liew, Alan Wee-Chung; Yan, Hong
2008-01-01
Background In DNA microarray experiments, discovering groups of genes that share similar transcriptional characteristics is instrumental in functional annotation, tissue classification and motif identification. However, in many situations a subset of genes only exhibits consistent pattern over a subset of conditions. Conventional clustering algorithms that deal with the entire row or column in an expression matrix would therefore fail to detect these useful patterns in the data. Recently, biclustering has been proposed to detect a subset of genes exhibiting consistent pattern over a subset of conditions. However, most existing biclustering algorithms are based on searching for sub-matrices within a data matrix by optimizing certain heuristically defined merit functions. Moreover, most of these algorithms can only detect a restricted set of bicluster patterns. Results In this paper, we present a novel geometric perspective for the biclustering problem. The biclustering process is interpreted as the detection of linear geometries in a high dimensional data space. Such a new perspective views biclusters with different patterns as hyperplanes in a high dimensional space, and allows us to handle different types of linear patterns simultaneously by matching a specific set of linear geometries. This geometric viewpoint also inspires us to propose a generic bicluster pattern, i.e. the linear coherent model that unifies the seemingly incompatible additive and multiplicative bicluster models. As a particular realization of our framework, we have implemented a Hough transform-based hyperplane detection algorithm. The experimental results on human lymphoma gene expression dataset show that our algorithm can find biologically significant subsets of genes. Conclusion We have proposed a novel geometric interpretation of the biclustering problem. We have shown that many common types of bicluster are just different spatial arrangements of hyperplanes in a high dimensional data space. An implementation of the geometric framework using the Fast Hough transform for hyperplane detection can be used to discover biologically significant subsets of genes under subsets of conditions for microarray data analysis. PMID:18433477
A web-based instruction module for interpretation of craniofacial cone beam CT anatomy.
Hassan, B A; Jacobs, R; Scarfe, W C; Al-Rawi, W T
2007-09-01
To develop a web-based module for learner instruction in the interpretation and recognition of osseous anatomy on craniofacial cone-beam CT (CBCT) images. Volumetric datasets from three CBCT systems were acquired (i-CAT, NewTom 3G and AccuiTomo FPD) for various subjects using equipment-specific scanning protocols. The datasets were processed using multiple software to provide two-dimensional (2D) multiplanar reformatted (MPR) images (e.g. sagittal, coronal and axial) and three-dimensional (3D) visual representations (e.g. maximum intensity projection, minimum intensity projection, ray sum, surface and volume rendering). Distinct didactic modules which illustrate the principles of CBCT systems, guided navigation of the volumetric dataset, and anatomic correlation of 3D models and 2D MPR graphics were developed using a hybrid combination of web authoring and image analysis techniques. Interactive web multimedia instruction was facilitated by the use of dynamic highlighting and labelling, and rendered video illustrations, supplemented with didactic textual material. HTML coding and Java scripting were heavily implemented for the blending of the educational modules. An interactive, multimedia educational tool for visualizing the morphology and interrelationships of osseous craniofacial anatomy, as depicted on CBCT MPR and 3D images, was designed and implemented. The present design of a web-based instruction module may assist radiologists and clinicians in learning how to recognize and interpret the craniofacial anatomy of CBCT based images more efficiently.
Application of 3D Zernike descriptors to shape-based ligand similarity searching.
Venkatraman, Vishwesh; Chakravarthy, Padmasini Ramji; Kihara, Daisuke
2009-12-17
The identification of promising drug leads from a large database of compounds is an important step in the preliminary stages of drug design. Although shape is known to play a key role in the molecular recognition process, its application to virtual screening poses significant hurdles both in terms of the encoding scheme and speed. In this study, we have examined the efficacy of the alignment independent three-dimensional Zernike descriptor (3DZD) for fast shape based similarity searching. Performance of this approach was compared with several other methods including the statistical moments based ultrafast shape recognition scheme (USR) and SIMCOMP, a graph matching algorithm that compares atom environments. Three benchmark datasets are used to thoroughly test the methods in terms of their ability for molecular classification, retrieval rate, and performance under the situation that simulates actual virtual screening tasks over a large pharmaceutical database. The 3DZD performed better than or comparable to the other methods examined, depending on the datasets and evaluation metrics used. Reasons for the success and the failure of the shape based methods for specific cases are investigated. Based on the results for the three datasets, general conclusions are drawn with regard to their efficiency and applicability. The 3DZD has unique ability for fast comparison of three-dimensional shape of compounds. Examples analyzed illustrate the advantages and the room for improvements for the 3DZD.
Application of 3D Zernike descriptors to shape-based ligand similarity searching
2009-01-01
Background The identification of promising drug leads from a large database of compounds is an important step in the preliminary stages of drug design. Although shape is known to play a key role in the molecular recognition process, its application to virtual screening poses significant hurdles both in terms of the encoding scheme and speed. Results In this study, we have examined the efficacy of the alignment independent three-dimensional Zernike descriptor (3DZD) for fast shape based similarity searching. Performance of this approach was compared with several other methods including the statistical moments based ultrafast shape recognition scheme (USR) and SIMCOMP, a graph matching algorithm that compares atom environments. Three benchmark datasets are used to thoroughly test the methods in terms of their ability for molecular classification, retrieval rate, and performance under the situation that simulates actual virtual screening tasks over a large pharmaceutical database. The 3DZD performed better than or comparable to the other methods examined, depending on the datasets and evaluation metrics used. Reasons for the success and the failure of the shape based methods for specific cases are investigated. Based on the results for the three datasets, general conclusions are drawn with regard to their efficiency and applicability. Conclusion The 3DZD has unique ability for fast comparison of three-dimensional shape of compounds. Examples analyzed illustrate the advantages and the room for improvements for the 3DZD. PMID:20150998
Yasui, Yutaka; Pepe, Margaret; Thompson, Mary Lou; Adam, Bao-Ling; Wright, George L; Qu, Yinsheng; Potter, John D; Winget, Marcy; Thornquist, Mark; Feng, Ziding
2003-07-01
With recent advances in mass spectrometry techniques, it is now possible to investigate proteins over a wide range of molecular weights in small biological specimens. This advance has generated data-analytic challenges in proteomics, similar to those created by microarray technologies in genetics, namely, discovery of 'signature' protein profiles specific to each pathologic state (e.g. normal vs. cancer) or differential profiles between experimental conditions (e.g. treated by a drug of interest vs. untreated) from high-dimensional data. We propose a data-analytic strategy for discovering protein biomarkers based on such high-dimensional mass spectrometry data. A real biomarker-discovery project on prostate cancer is taken as a concrete example throughout the paper: the project aims to identify proteins in serum that distinguish cancer, benign hyperplasia, and normal states of prostate using the Surface Enhanced Laser Desorption/Ionization (SELDI) technology, a recently developed mass spectrometry technique. Our data-analytic strategy takes properties of the SELDI mass spectrometer into account: the SELDI output of a specimen contains about 48,000 (x, y) points where x is the protein mass divided by the number of charges introduced by ionization and y is the protein intensity of the corresponding mass per charge value, x, in that specimen. Given high coefficients of variation and other characteristics of protein intensity measures (y values), we reduce the measures of protein intensities to a set of binary variables that indicate peaks in the y-axis direction in the nearest neighborhoods of each mass per charge point in the x-axis direction. We then account for a shifting (measurement error) problem of the x-axis in SELDI output. After this pre-analysis processing of data, we combine the binary predictors to generate classification rules for cancer, benign hyperplasia, and normal states of prostate. Our approach is to apply the boosting algorithm to select binary predictors and construct a summary classifier. We empirically evaluate sensitivity and specificity of the resulting summary classifiers with a test dataset that is independent from the training dataset used to construct the summary classifiers. The proposed method performed nearly perfectly in distinguishing cancer and benign hyperplasia from normal. In the classification of cancer vs. benign hyperplasia, however, an appreciable proportion of the benign specimens were classified incorrectly as cancer. We discuss practical issues associated with our proposed approach to the analysis of SELDI output and its application in cancer biomarker discovery.
Temkin, Bharti; Acosta, Eric; Malvankar, Ameya; Vaidyanath, Sreeram
2006-04-01
The Visible Human digital datasets make it possible to develop computer-based anatomical training systems that use virtual anatomical models (virtual body structures-VBS). Medical schools are combining these virtual training systems and classical anatomy teaching methods that use labeled images and cadaver dissection. In this paper we present a customizable web-based three-dimensional anatomy training system, W3D-VBS. W3D-VBS uses National Library of Medicine's (NLM) Visible Human Male datasets to interactively locate, explore, select, extract, highlight, label, and visualize, realistic 2D (using axial, coronal, and sagittal views) and 3D virtual structures. A real-time self-guided virtual tour of the entire body is designed to provide detailed anatomical information about structures, substructures, and proximal structures. The system thus facilitates learning of visuospatial relationships at a level of detail that may not be possible by any other means. The use of volumetric structures allows for repeated real-time virtual dissections, from any angle, at the convenience of the user. Volumetric (3D) virtual dissections are performed by adding, removing, highlighting, and labeling individual structures (and/or entire anatomical systems). The resultant virtual explorations (consisting of anatomical 2D/3D illustrations and animations), with user selected highlighting colors and label positions, can be saved and used for generating lesson plans and evaluation systems. Tracking users' progress using the evaluation system helps customize the curriculum, making W3D-VBS a powerful learning tool. Our plan is to incorporate other Visible Human segmented datasets, especially datasets with higher resolutions, that make it possible to include finer anatomical structures such as nerves and small vessels. (c) 2006 Wiley-Liss, Inc.
Uncertainty propagation for statistical impact prediction of space debris
NASA Astrophysics Data System (ADS)
Hoogendoorn, R.; Mooij, E.; Geul, J.
2018-01-01
Predictions of the impact time and location of space debris in a decaying trajectory are highly influenced by uncertainties. The traditional Monte Carlo (MC) method can be used to perform accurate statistical impact predictions, but requires a large computational effort. A method is investigated that directly propagates a Probability Density Function (PDF) in time, which has the potential to obtain more accurate results with less computational effort. The decaying trajectory of Delta-K rocket stages was used to test the methods using a six degrees-of-freedom state model. The PDF of the state of the body was propagated in time to obtain impact-time distributions. This Direct PDF Propagation (DPP) method results in a multi-dimensional scattered dataset of the PDF of the state, which is highly challenging to process. No accurate results could be obtained, because of the structure of the DPP data and the high dimensionality. Therefore, the DPP method is less suitable for practical uncontrolled entry problems and the traditional MC method remains superior. Additionally, the MC method was used with two improved uncertainty models to obtain impact-time distributions, which were validated using observations of true impacts. For one of the two uncertainty models, statistically more valid impact-time distributions were obtained than in previous research.
PLANETarium - Visualizing Earth Sciences in the Planetarium
NASA Astrophysics Data System (ADS)
Ballmer, M. D.; Wiethoff, T.; Kraupe, T. W.
2013-12-01
In the past decade, projection systems in most planetariums, traditional sites of outreach and public education, have advanced from instruments that can visualize the motion of stars as beam spots moving over spherical projection areas to systems that are able to display multicolor, high-resolution, immersive full-dome videos or images. These extraordinary capabilities are ideally suited for visualization of global processes occurring on the surface and within the interior of the Earth, a spherical body just as the full dome. So far, however, our community has largely ignored this wonderful interface for outreach and education. A few documentaries on e.g. climate change or volcanic eruptions have been brought to planetariums, but are taking little advantage of the true potential of the medium, as mostly based on standard two-dimensional videos and cartoon-style animations. Along these lines, we here propose a framework to convey recent scientific results on the origin and evolution of our PLANET to the >100,000,000 per-year worldwide audience of planetariums, making the traditionally astronomy-focussed interface a true PLANETarium. In order to do this most efficiently, we intend to directly show visualizations of scientific datasets or models, originally designed for basic research. Such visualizations in solid-Earth, as well as athmospheric and ocean sciences, are expected to be renderable to the dome with little or no effort. For example, showing global geophysical datasets (e.g., surface temperature, gravity, magnetic field), or horizontal slices of seismic-tomography images and of spherical computer simulations (e.g., climate evolution, mantle flow or ocean currents) requires almost no rendering at all. Three-dimensional Cartesian datasets or models can be rendered using standard methods. With the appropriate audio support, present-day science visualizations are typically as intuitive as cartoon-style animations, yet more appealing visually, and clearly more informative as revealing the complexity and beauty of our planet. In addition to e.g. climate change and natural hazards, themes of interest may include the coupled evolution of the Earth's interior and life, from the accretion of our planet to the generation and sustainment of the magnetic field as well as of habitable conditions in the atmosphere and oceans. We believe that high-quality tax-funded science visualizations should not exclusively be used to facilitate communication amoung scientists, but also be directly recycled to raise the public's awareness and appreciation of geosciences.
ClimateSpark: An in-memory distributed computing framework for big climate data analytics
NASA Astrophysics Data System (ADS)
Hu, Fei; Yang, Chaowei; Schnase, John L.; Duffy, Daniel Q.; Xu, Mengchao; Bowen, Michael K.; Lee, Tsengdar; Song, Weiwei
2018-06-01
The unprecedented growth of climate data creates new opportunities for climate studies, and yet big climate data pose a grand challenge to climatologists to efficiently manage and analyze big data. The complexity of climate data content and analytical algorithms increases the difficulty of implementing algorithms on high performance computing systems. This paper proposes an in-memory, distributed computing framework, ClimateSpark, to facilitate complex big data analytics and time-consuming computational tasks. Chunking data structure improves parallel I/O efficiency, while a spatiotemporal index is built for the chunks to avoid unnecessary data reading and preprocessing. An integrated, multi-dimensional, array-based data model (ClimateRDD) and ETL operations are developed to address big climate data variety by integrating the processing components of the climate data lifecycle. ClimateSpark utilizes Spark SQL and Apache Zeppelin to develop a web portal to facilitate the interaction among climatologists, climate data, analytic operations and computing resources (e.g., using SQL query and Scala/Python notebook). Experimental results show that ClimateSpark conducts different spatiotemporal data queries/analytics with high efficiency and data locality. ClimateSpark is easily adaptable to other big multiple-dimensional, array-based datasets in various geoscience domains.
Sparse PCA with Oracle Property.
Gu, Quanquan; Wang, Zhaoran; Liu, Han
In this paper, we study the estimation of the k -dimensional sparse principal subspace of covariance matrix Σ in the high-dimensional setting. We aim to recover the oracle principal subspace solution, i.e., the principal subspace estimator obtained assuming the true support is known a priori. To this end, we propose a family of estimators based on the semidefinite relaxation of sparse PCA with novel regularizations. In particular, under a weak assumption on the magnitude of the population projection matrix, one estimator within this family exactly recovers the true support with high probability, has exact rank- k , and attains a [Formula: see text] statistical rate of convergence with s being the subspace sparsity level and n the sample size. Compared to existing support recovery results for sparse PCA, our approach does not hinge on the spiked covariance model or the limited correlation condition. As a complement to the first estimator that enjoys the oracle property, we prove that, another estimator within the family achieves a sharper statistical rate of convergence than the standard semidefinite relaxation of sparse PCA, even when the previous assumption on the magnitude of the projection matrix is violated. We validate the theoretical results by numerical experiments on synthetic datasets.
Sparse PCA with Oracle Property
Gu, Quanquan; Wang, Zhaoran; Liu, Han
2014-01-01
In this paper, we study the estimation of the k-dimensional sparse principal subspace of covariance matrix Σ in the high-dimensional setting. We aim to recover the oracle principal subspace solution, i.e., the principal subspace estimator obtained assuming the true support is known a priori. To this end, we propose a family of estimators based on the semidefinite relaxation of sparse PCA with novel regularizations. In particular, under a weak assumption on the magnitude of the population projection matrix, one estimator within this family exactly recovers the true support with high probability, has exact rank-k, and attains a s/n statistical rate of convergence with s being the subspace sparsity level and n the sample size. Compared to existing support recovery results for sparse PCA, our approach does not hinge on the spiked covariance model or the limited correlation condition. As a complement to the first estimator that enjoys the oracle property, we prove that, another estimator within the family achieves a sharper statistical rate of convergence than the standard semidefinite relaxation of sparse PCA, even when the previous assumption on the magnitude of the projection matrix is violated. We validate the theoretical results by numerical experiments on synthetic datasets. PMID:25684971
Application of a fast skyline computation algorithm for serendipitous searching problems
NASA Astrophysics Data System (ADS)
Koizumi, Kenichi; Hiraki, Kei; Inaba, Mary
2018-02-01
Skyline computation is a method of extracting interesting entries from a large population with multiple attributes. These entries, called skyline or Pareto optimal entries, are known to have extreme characteristics that cannot be found by outlier detection methods. Skyline computation is an important task for characterizing large amounts of data and selecting interesting entries with extreme features. When the population changes dynamically, the task of calculating a sequence of skyline sets is called continuous skyline computation. This task is known to be difficult to perform for the following reasons: (1) information of non-skyline entries must be stored since they may join the skyline in the future; (2) the appearance or disappearance of even a single entry can change the skyline drastically; (3) it is difficult to adopt a geometric acceleration algorithm for skyline computation tasks with high-dimensional datasets. Our new algorithm called jointed rooted-tree (JR-tree) manages entries using a rooted tree structure. JR-tree delays extend the tree to deep levels to accelerate tree construction and traversal. In this study, we presented the difficulties in extracting entries tagged with a rare label in high-dimensional space and the potential of fast skyline computation in low-latency cell identification technology.
Previous modelling of the median lethal dose (oral rat LD50) has indicated that local class-based models yield better correlations than global models. We evaluated the hypothesis that dividing the dataset by pesticidal mechanisms would improve prediction accuracy. A linear discri...
NASA Technical Reports Server (NTRS)
Smith, Peter; Kempler, Steven; Leptoukh, Gregory; Chen, Aijun
2010-01-01
This poster paper represents the NASA funded project that was to employ the latest three dimensional visualization technology to explore and provide direct data access to heterogeneous A-Train datasets. Google Earth (tm) provides foundation for organizing, visualizing, publishing and synergizing Earth science data .
A Graphics Design Framework to Visualize Multi-Dimensional Economic Datasets
ERIC Educational Resources Information Center
Chandramouli, Magesh; Narayanan, Badri; Bertoline, Gary R.
2013-01-01
This study implements a prototype graphics visualization framework to visualize multidimensional data. This graphics design framework serves as a "visual analytical database" for visualization and simulation of economic models. One of the primary goals of any kind of visualization is to extract useful information from colossal volumes of…
Assessment of metal artifact reduction methods in pelvic CT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Abdoli, Mehrsima; Mehranian, Abolfazl; Ailianou, Angeliki
2016-04-15
Purpose: Metal artifact reduction (MAR) produces images with improved quality potentially leading to confident and reliable clinical diagnosis and therapy planning. In this work, the authors evaluate the performance of five MAR techniques for the assessment of computed tomography images of patients with hip prostheses. Methods: Five MAR algorithms were evaluated using simulation and clinical studies. The algorithms included one-dimensional linear interpolation (LI) of the corrupted projection bins in the sinogram, two-dimensional interpolation (2D), a normalized metal artifact reduction (NMAR) technique, a metal deletion technique, and a maximum a posteriori completion (MAPC) approach. The algorithms were applied to ten simulatedmore » datasets as well as 30 clinical studies of patients with metallic hip implants. Qualitative evaluations were performed by two blinded experienced radiologists who ranked overall artifact severity and pelvic organ recognition for each algorithm by assigning scores from zero to five (zero indicating totally obscured organs with no structures identifiable and five indicating recognition with high confidence). Results: Simulation studies revealed that 2D, NMAR, and MAPC techniques performed almost equally well in all regions. LI falls behind the other approaches in terms of reducing dark streaking artifacts as well as preserving unaffected regions (p < 0.05). Visual assessment of clinical datasets revealed the superiority of NMAR and MAPC in the evaluated pelvic organs and in terms of overall image quality. Conclusions: Overall, all methods, except LI, performed equally well in artifact-free regions. Considering both clinical and simulation studies, 2D, NMAR, and MAPC seem to outperform the other techniques.« less
Statistical mechanics of complex neural systems and high dimensional data
NASA Astrophysics Data System (ADS)
Advani, Madhu; Lahiri, Subhaneil; Ganguli, Surya
2013-03-01
Recent experimental advances in neuroscience have opened new vistas into the immense complexity of neuronal networks. This proliferation of data challenges us on two parallel fronts. First, how can we form adequate theoretical frameworks for understanding how dynamical network processes cooperate across widely disparate spatiotemporal scales to solve important computational problems? Second, how can we extract meaningful models of neuronal systems from high dimensional datasets? To aid in these challenges, we give a pedagogical review of a collection of ideas and theoretical methods arising at the intersection of statistical physics, computer science and neurobiology. We introduce the interrelated replica and cavity methods, which originated in statistical physics as powerful ways to quantitatively analyze large highly heterogeneous systems of many interacting degrees of freedom. We also introduce the closely related notion of message passing in graphical models, which originated in computer science as a distributed algorithm capable of solving large inference and optimization problems involving many coupled variables. We then show how both the statistical physics and computer science perspectives can be applied in a wide diversity of contexts to problems arising in theoretical neuroscience and data analysis. Along the way we discuss spin glasses, learning theory, illusions of structure in noise, random matrices, dimensionality reduction and compressed sensing, all within the unified formalism of the replica method. Moreover, we review recent conceptual connections between message passing in graphical models, and neural computation and learning. Overall, these ideas illustrate how statistical physics and computer science might provide a lens through which we can uncover emergent computational functions buried deep within the dynamical complexities of neuronal networks.
Statistical Analysis of Big Data on Pharmacogenomics
Fan, Jianqing; Liu, Han
2013-01-01
This paper discusses statistical methods for estimating complex correlation structure from large pharmacogenomic datasets. We selectively review several prominent statistical methods for estimating large covariance matrix for understanding correlation structure, inverse covariance matrix for network modeling, large-scale simultaneous tests for selecting significantly differently expressed genes and proteins and genetic markers for complex diseases, and high dimensional variable selection for identifying important molecules for understanding molecule mechanisms in pharmacogenomics. Their applications to gene network estimation and biomarker selection are used to illustrate the methodological power. Several new challenges of Big data analysis, including complex data distribution, missing data, measurement error, spurious correlation, endogeneity, and the need for robust statistical methods, are also discussed. PMID:23602905
Wolters, Mark A; Dean, C B
2017-01-01
Remote sensing images from Earth-orbiting satellites are a potentially rich data source for monitoring and cataloguing atmospheric health hazards that cover large geographic regions. A method is proposed for classifying such images into hazard and nonhazard regions using the autologistic regression model, which may be viewed as a spatial extension of logistic regression. The method includes a novel and simple approach to parameter estimation that makes it well suited to handling the large and high-dimensional datasets arising from satellite-borne instruments. The methodology is demonstrated on both simulated images and a real application to the identification of forest fire smoke.
Hyperspectral image classification based on local binary patterns and PCANet
NASA Astrophysics Data System (ADS)
Yang, Huizhen; Gao, Feng; Dong, Junyu; Yang, Yang
2018-04-01
Hyperspectral image classification has been well acknowledged as one of the challenging tasks of hyperspectral data processing. In this paper, we propose a novel hyperspectral image classification framework based on local binary pattern (LBP) features and PCANet. In the proposed method, linear prediction error (LPE) is first employed to select a subset of informative bands, and LBP is utilized to extract texture features. Then, spectral and texture features are stacked into a high dimensional vectors. Next, the extracted features of a specified position are transformed to a 2-D image. The obtained images of all pixels are fed into PCANet for classification. Experimental results on real hyperspectral dataset demonstrate the effectiveness of the proposed method.
High-speed upper-airway imaging using full-range optical coherence tomography
NASA Astrophysics Data System (ADS)
Jing, Joseph; Zhang, Jun; Loy, Anthony Chin; Wong, Brian J. F.; Chen, Zhongping
2012-11-01
Obstruction in the upper airway can often cause reductions in breathing or gas exchange efficiency and lead to rest disorders such as sleep apnea. Imaging diagnosis of the obstruction region has been accomplished using computed tomography (CT) and magnetic resonance imaging (MRI). However CT requires the use of ionizing radiation, and MRI typically requires sedation of the patient to prevent motion artifacts. Long-range optical coherence tomography (OCT) has the potential to provide high-speed three-dimensional tomographic images with high resolution and without the use of ionizing radiation. In this paper, we present work on the development of a long-range OCT endoscopic probe with 1.2 mm OD and 20 mm working distance used in conjunction with a modified Fourier domain swept source OCT system to acquire structural and anatomical datasets of the human airway. Imaging from the bottom of the larynx to the end of the nasal cavity is completed within 40 s.
Rising dough and baking bread at the Australian synchrotron
NASA Astrophysics Data System (ADS)
Mayo, S. C.; McCann, T.; Day, L.; Favaro, J.; Tuhumury, H.; Thompson, D.; Maksimenko, A.
2016-01-01
Wheat protein quality and the amount of common salt added in dough formulation can have a significant effect on the microstructure and loaf volume of bread. High-speed synchrotron micro-CT provides an ideal tool for observing the three dimensional structure of bread dough in situ during proving (rising) and baking. In this work, the synchrotron micro-CT technique was used to observe the structure and time evolution of doughs made from high and low protein flour and three different salt additives. These experiments showed that, as expected, high protein flour produces a higher volume loaf compared to low protein flour regardless of salt additives. Furthermore the results show that KCl in particular has a very negative effect on dough properties resulting in much reduced porosity. The hundreds of datasets produced and analysed during this experiment also provided a valuable test case for handling large quantities of data using tools on the Australian Synchrotron's MASSIVE cluster.
Automated measurement of retinal blood vessel tortuosity
NASA Astrophysics Data System (ADS)
Joshi, Vinayak; Reinhardt, Joseph M.; Abramoff, Michael D.
2010-03-01
Abnormalities in the vascular pattern of the retina are associated with retinal diseases and are also risk factors for systemic diseases, especially cardiovascular diseases. The three-dimensional retinal vascular pattern is mostly formed congenitally, but is then modified over life, in response to aging, vessel wall dystrophies and long term changes in blood flow and pressure. A characteristic of the vascular pattern that is appreciated by clinicians is vascular tortuosity, i.e. how curved or kinked a blood vessel, either vein or artery, appears along its course. We developed a new quantitative metric for vascular tortuosity, based on the vessel's angle of curvature, length of the curved vessel over its chord length (arc to chord ratio), number of curvature sign changes, and combined these into a unidimensional metric, Tortuosity Index (TI). In comparison to other published methods this method can estimate appropriate TI for vessels with constant curvature sign and vessels with equal arc to chord ratios, as well. We applied this method to a dataset of 15 digital fundus images of 8 patients with Facioscapulohumeral muscular dystrophy (FSHD), and to the other publically available dataset of 60 fundus images of normal cases and patients with hypertensive retinopathy, of which the arterial and venous tortuosities have also been graded by masked experts (ophthalmologists). The method produced exactly the same rank-ordered list of vessel tortuosity (TI) values as obtained by averaging the tortuosity grading given by 3 ophthalmologists for FSHD dataset and a list of TI values with high ranking correlation with the ophthalmologist's grading for the other dataset. Our results show that TI has potential to detect and evaluate abnormal retinal vascular structure in early diagnosis and prognosis of retinopathies.
Long, Yi; Du, Zhi-Jiang; Chen, Chao-Feng; Dong, Wei; Wang, Wei-Dong
2017-07-01
The most important step for lower extremity exoskeleton is to infer human motion intent (HMI), which contributes to achieve human exoskeleton collaboration. Since the user is in the control loop, the relationship between human robot interaction (HRI) information and HMI is nonlinear and complicated, which is difficult to be modeled by using mathematical approaches. The nonlinear approximation can be learned by using machine learning approaches. Gaussian Process (GP) regression is suitable for high-dimensional and small-sample nonlinear regression problems. GP regression is restrictive for large data sets due to its computation complexity. In this paper, an online sparse GP algorithm is constructed to learn the HMI. The original training dataset is collected when the user wears the exoskeleton system with friction compensation to perform unconstrained movement as far as possible. The dataset has two kinds of data, i.e., (1) physical HRI, which is collected by torque sensors placed at the interaction cuffs for the active joints, i.e., knee joints; (2) joint angular position, which is measured by optical position sensors. To reduce the computation complexity of GP, grey relational analysis (GRA) is utilized to specify the original dataset and provide the final training dataset. Those hyper-parameters are optimized offline by maximizing marginal likelihood and will be applied into online GP regression algorithm. The HMI, i.e., angular position of human joints, will be regarded as the reference trajectory for the mechanical legs. To verify the effectiveness of the proposed algorithm, experiments are performed on a subject at a natural speed. The experimental results show the HMI can be obtained in real time, which can be extended and employed in the similar exoskeleton systems.
Next-Generation Machine Learning for Biological Networks.
Camacho, Diogo M; Collins, Katherine M; Powers, Rani K; Costello, James C; Collins, James J
2018-06-14
Machine learning, a collection of data-analytical techniques aimed at building predictive models from multi-dimensional datasets, is becoming integral to modern biological research. By enabling one to generate models that learn from large datasets and make predictions on likely outcomes, machine learning can be used to study complex cellular systems such as biological networks. Here, we provide a primer on machine learning for life scientists, including an introduction to deep learning. We discuss opportunities and challenges at the intersection of machine learning and network biology, which could impact disease biology, drug discovery, microbiome research, and synthetic biology. Copyright © 2018 Elsevier Inc. All rights reserved.
Coman, Daniel; de Graaf, Robin A; Rothman, Douglas L; Hyder, Fahmeed
2013-11-01
Spectroscopic signals which emanate from complexes between paramagnetic lanthanide (III) ions (e.g. Tm(3+)) and macrocyclic chelates (e.g. 1,4,7,10-tetramethyl-1,4,7,10-tetraazacyclododecane-1,4,7,10-tetraacetate, or DOTMA(4-)) are sensitive to physiology (e.g. temperature). Because nonexchanging protons from these lanthanide-based macrocyclic agents have relaxation times on the order of a few milliseconds, rapid data acquisition is possible with chemical shift imaging (CSI). Thus, Biosensor Imaging of Redundant Deviation in Shifts (BIRDS) which originate from nonexchanging protons of these paramagnetic agents, but exclude water proton detection, can allow molecular imaging. Previous two-dimensional CSI experiments with such lanthanide-based macrocyclics allowed acquisition from ~12-μL voxels in rat brain within 5 min using rectangular encoding of k space. Because cubical encoding of k space in three dimensions for whole-brain coverage increases the CSI acquisition time to several tens of minutes or more, a faster CSI technique is required for BIRDS to be of practical use. Here, we demonstrate a CSI acquisition method to improve three-dimensional molecular imaging capabilities with lanthanide-based macrocyclics. Using TmDOTMA(-), we show datasets from a 20 × 20 × 20-mm(3) field of view with voxels of ~1 μL effective volume acquired within 5 min (at 11.7 T) for temperature mapping. By employing reduced spherical encoding with Gaussian weighting (RESEGAW) instead of cubical encoding of k space, a significant increase in CSI signal is obtained. In vitro and in vivo three-dimensional CSI data with TmDOTMA(-), and presumably similar lanthanide-based macrocyclics, suggest that acquisition using RESEGAW can be used for high spatiotemporal resolution molecular mapping with BIRDS. Copyright © 2013 John Wiley & Sons, Ltd.
An empirical approach to inversion of an unconventional helicopter electromagnetic dataset
Pellerin, L.; Labson, V.F.
2003-01-01
A helicopter electromagnetic (HEM) survey acquired at the U.S. Idaho National Engineering and Environmental Laboratory (INEEL) used a modification of a traditional mining airborne method flown at low levels for detailed characterization of shallow waste sites. The low sensor height, used to increase resolution, invalidates standard assumptions used in processing HEM data. Although the survey design strategy was sound, traditional interpretation techniques, routinely used in industry, proved ineffective. Processed data and apparent resistivity maps were severely distorted, and hence unusable, due to low flight height effects, high magnetic permeability of the basalt host, and the conductive, three-dimensional nature of the waste site targets.To accommodate these interpretation challenges, we modified a one-dimensional inversion routine to include a linear term in the objective function that allows for the magnetic and three-dimensional electromagnetic responses in the in-phase data. Although somewhat ad hoc, the use of this term in the inverse routine, referred to as the shift factor, was successful in defining the waste sites and reducing noise due to the low flight height and magnetic characteristics of the host rock. Many inversion scenarios were applied to the data and careful analysis was necessary to determine the parameters appropriate for interpretation, hence the approach was empirical. Data from three areas were processed with this scheme to highlight different interpretational aspects of the method. Wastes sites were delineated with the shift terms in two of the areas, allowing for separation of the anthropomorphic targets from the natural one-dimensional host. In the third area, the estimated resistivity and the shift factor were used for geological mapping. The high magnetic content of the native soil enabled the mapping of disturbed soil with the shift term. Published by Elsevier Science B.V.
Cao, Peng; Liu, Xiaoli; Yang, Jinzhu; Zhao, Dazhe; Huang, Min; Zhang, Jian; Zaiane, Osmar
2017-12-01
Alzheimer's disease (AD) has been not only a substantial financial burden to the health care system but also an emotional burden to patients and their families. Making accurate diagnosis of AD based on brain magnetic resonance imaging (MRI) is becoming more and more critical and emphasized at the earliest stages. However, the high dimensionality and imbalanced data issues are two major challenges in the study of computer aided AD diagnosis. The greatest limitations of existing dimensionality reduction and over-sampling methods are that they assume a linear relationship between the MRI features (predictor) and the disease status (response). To better capture the complicated but more flexible relationship, we propose a multi-kernel based dimensionality reduction and over-sampling approaches. We combined Marginal Fisher Analysis with ℓ 2,1 -norm based multi-kernel learning (MKMFA) to achieve the sparsity of region-of-interest (ROI), which leads to simultaneously selecting a subset of the relevant brain regions and learning a dimensionality transformation. Meanwhile, a multi-kernel over-sampling (MKOS) was developed to generate synthetic instances in the optimal kernel space induced by MKMFA, so as to compensate for the class imbalanced distribution. We comprehensively evaluate the proposed models for the diagnostic classification (binary class and multi-class classification) including all subjects from the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset. The experimental results not only demonstrate the proposed method has superior performance over multiple comparable methods, but also identifies relevant imaging biomarkers that are consistent with prior medical knowledge. Copyright © 2017 Elsevier Ltd. All rights reserved.
Parsimony and goodness-of-fit in multi-dimensional NMR inversion
NASA Astrophysics Data System (ADS)
Babak, Petro; Kryuchkov, Sergey; Kantzas, Apostolos
2017-01-01
Multi-dimensional nuclear magnetic resonance (NMR) experiments are often used for study of molecular structure and dynamics of matter in core analysis and reservoir evaluation. Industrial applications of multi-dimensional NMR involve a high-dimensional measurement dataset with complicated correlation structure and require rapid and stable inversion algorithms from the time domain to the relaxation rate and/or diffusion domains. In practice, applying existing inverse algorithms with a large number of parameter values leads to an infinite number of solutions with a reasonable fit to the NMR data. The interpretation of such variability of multiple solutions and selection of the most appropriate solution could be a very complex problem. In most cases the characteristics of materials have sparse signatures, and investigators would like to distinguish the most significant relaxation and diffusion values of the materials. To produce an easy to interpret and unique NMR distribution with the finite number of the principal parameter values, we introduce a new method for NMR inversion. The method is constructed based on the trade-off between the conventional goodness-of-fit approach to multivariate data and the principle of parsimony guaranteeing inversion with the least number of parameter values. We suggest performing the inversion of NMR data using the forward stepwise regression selection algorithm. To account for the trade-off between goodness-of-fit and parsimony, the objective function is selected based on Akaike Information Criterion (AIC). The performance of the developed multi-dimensional NMR inversion method and its comparison with conventional methods are illustrated using real data for samples with bitumen, water and clay.
Rectified factor networks for biclustering of omics data.
Clevert, Djork-Arné; Unterthiner, Thomas; Povysil, Gundula; Hochreiter, Sepp
2017-07-15
Biclustering has become a major tool for analyzing large datasets given as matrix of samples times features and has been successfully applied in life sciences and e-commerce for drug design and recommender systems, respectively. actor nalysis for cluster cquisition (FABIA), one of the most successful biclustering methods, is a generative model that represents each bicluster by two sparse membership vectors: one for the samples and one for the features. However, FABIA is restricted to about 20 code units because of the high computational complexity of computing the posterior. Furthermore, code units are sometimes insufficiently decorrelated and sample membership is difficult to determine. We propose to use the recently introduced unsupervised Deep Learning approach Rectified Factor Networks (RFNs) to overcome the drawbacks of existing biclustering methods. RFNs efficiently construct very sparse, non-linear, high-dimensional representations of the input via their posterior means. RFN learning is a generalized alternating minimization algorithm based on the posterior regularization method which enforces non-negative and normalized posterior means. Each code unit represents a bicluster, where samples for which the code unit is active belong to the bicluster and features that have activating weights to the code unit belong to the bicluster. On 400 benchmark datasets and on three gene expression datasets with known clusters, RFN outperformed 13 other biclustering methods including FABIA. On data of the 1000 Genomes Project, RFN could identify DNA segments which indicate, that interbreeding with other hominins starting already before ancestors of modern humans left Africa. https://github.com/bioinf-jku/librfn. djork-arne.clevert@bayer.com or hochreit@bioinf.jku.at. © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
VisIVO: A Library and Integrated Tools for Large Astrophysical Dataset Exploration
NASA Astrophysics Data System (ADS)
Becciani, U.; Costa, A.; Ersotelos, N.; Krokos, M.; Massimino, P.; Petta, C.; Vitello, F.
2012-09-01
VisIVO provides an integrated suite of tools and services that can be used in many scientific fields. VisIVO development starts in the Virtual Observatory framework. VisIVO allows users to visualize meaningfully highly-complex, large-scale datasets and create movies of these visualizations based on distributed infrastructures. VisIVO supports high-performance, multi-dimensional visualization of large-scale astrophysical datasets. Users can rapidly obtain meaningful visualizations while preserving full and intuitive control of the relevant parameters. VisIVO consists of VisIVO Desktop - a stand-alone application for interactive visualization on standard PCs, VisIVO Server - a platform for high performance visualization, VisIVO Web - a custom designed web portal, VisIVOSmartphone - an application to exploit the VisIVO Server functionality and the latest VisIVO features: VisIVO Library allows a job running on a computational system (grid, HPC, etc.) to produce movies directly with the code internal data arrays without the need to produce intermediate files. This is particularly important when running on large computational facilities, where the user wants to have a look at the results during the data production phase. For example, in grid computing facilities, images can be produced directly in the grid catalogue while the user code is running in a system that cannot be directly accessed by the user (a worker node). The deployment of VisIVO on the DG and gLite is carried out with the support of EDGI and EGI-Inspire projects. Depending on the structure and size of datasets under consideration, the data exploration process could take several hours of CPU for creating customized views and the production of movies could potentially last several days. For this reason an MPI parallel version of VisIVO could play a fundamental role in increasing performance, e.g. it could be automatically deployed on nodes that are MPI aware. A central concept in our development is thus to produce unified code that can run either on serial nodes or in parallel by using HPC oriented grid nodes. Another important aspect, to obtain as high performance as possible, is the integration of VisIVO processes with grid nodes where GPUs are available. We have selected CUDA for implementing a range of computationally heavy modules. VisIVO is supported by EGI-Inspire, EDGI and SCI-BUS projects.
Privacy preserving data publishing of categorical data through k-anonymity and feature selection.
Aristodimou, Aristos; Antoniades, Athos; Pattichis, Constantinos S
2016-03-01
In healthcare, there is a vast amount of patients' data, which can lead to important discoveries if combined. Due to legal and ethical issues, such data cannot be shared and hence such information is underused. A new area of research has emerged, called privacy preserving data publishing (PPDP), which aims in sharing data in a way that privacy is preserved while the information lost is kept at a minimum. In this Letter, a new anonymisation algorithm for PPDP is proposed, which is based on k-anonymity through pattern-based multidimensional suppression (kPB-MS). The algorithm uses feature selection for reducing the data dimensionality and then combines attribute and record suppression for obtaining k-anonymity. Five datasets from different areas of life sciences [RETINOPATHY, Single Proton Emission Computed Tomography imaging, gene sequencing and drug discovery (two datasets)], were anonymised with kPB-MS. The produced anonymised datasets were evaluated using four different classifiers and in 74% of the test cases, they produced similar or better accuracies than using the full datasets.
Optimal SVM parameter selection for non-separable and unbalanced datasets.
Jiang, Peng; Missoum, Samy; Chen, Zhao
2014-10-01
This article presents a study of three validation metrics used for the selection of optimal parameters of a support vector machine (SVM) classifier in the case of non-separable and unbalanced datasets. This situation is often encountered when the data is obtained experimentally or clinically. The three metrics selected in this work are the area under the ROC curve (AUC), accuracy, and balanced accuracy. These validation metrics are tested using computational data only, which enables the creation of fully separable sets of data. This way, non-separable datasets, representative of a real-world problem, can be created by projection onto a lower dimensional sub-space. The knowledge of the separable dataset, unknown in real-world problems, provides a reference to compare the three validation metrics using a quantity referred to as the "weighted likelihood". As an application example, the study investigates a classification model for hip fracture prediction. The data is obtained from a parameterized finite element model of a femur. The performance of the various validation metrics is studied for several levels of separability, ratios of unbalance, and training set sizes.
Yilmaz, E; Kayikcioglu, T; Kayipmaz, S
2017-07-01
In this article, we propose a decision support system for effective classification of dental periapical cyst and keratocystic odontogenic tumor (KCOT) lesions obtained via cone beam computed tomography (CBCT). CBCT has been effectively used in recent years for diagnosing dental pathologies and determining their boundaries and content. Unlike other imaging techniques, CBCT provides detailed and distinctive information about the pathologies by enabling a three-dimensional (3D) image of the region to be displayed. We employed 50 CBCT 3D image dataset files as the full dataset of our study. These datasets were identified by experts as periapical cyst and KCOT lesions according to the clinical, radiographic and histopathologic features. Segmentation operations were performed on the CBCT images using viewer software that we developed. Using the tools of this software, we marked the lesional volume of interest and calculated and applied the order statistics and 3D gray-level co-occurrence matrix for each CBCT dataset. A feature vector of the lesional region, including 636 different feature items, was created from those statistics. Six classifiers were used for the classification experiments. The Support Vector Machine (SVM) classifier achieved the best classification performance with 100% accuracy, and 100% F-score (F1) scores as a result of the experiments in which a ten-fold cross validation method was used with a forward feature selection algorithm. SVM achieved the best classification performance with 96.00% accuracy, and 96.00% F1 scores in the experiments in which a split sample validation method was used with a forward feature selection algorithm. SVM additionally achieved the best performance of 94.00% accuracy, and 93.88% F1 in which a leave-one-out (LOOCV) method was used with a forward feature selection algorithm. Based on the results, we determined that periapical cyst and KCOT lesions can be classified with a high accuracy with the models that we built using the new dataset selected for this study. The studies mentioned in this article, along with the selected 3D dataset, 3D statistics calculated from the dataset, and performance results of the different classifiers, comprise an important contribution to the field of computer-aided diagnosis of dental apical lesions. Copyright © 2017 Elsevier B.V. All rights reserved.
Ramachandra, Ranjan; de Jonge, Niels
2012-01-01
Three-dimensional (3D) data sets were recorded of gold nanoparticles placed on both sides of silicon nitride membranes using focal series aberration-corrected scanning transmission electron microscopy (STEM). The deconvolution of the 3D datasets was optimized to obtain the highest possible axial resolution. The deconvolution involved two different point spread function (PSF)s, each calculated iteratively via blind deconvolution.. Supporting membranes of different thicknesses were tested to study the effect of beam broadening on the deconvolution. It was found that several iterations of deconvolution was efficient in reducing the imaging noise. With an increasing number of iterations, the axial resolution was increased, and most of the structural information was preserved. Additional iterations improved the axial resolution by maximal a factor of 4 to 6, depending on the particular dataset, and up to 8 nm maximal, but at the cost of a reduction of the lateral size of the nanoparticles in the image. Thus, the deconvolution procedure optimized for highest axial resolution is best suited for applications where one is interested in the 3D locations of nanoparticles only. PMID:22152090
A Four Dimensional Spatio-Temporal Analysis of an Agricultural Dataset
Donald, Margaret R.; Mengersen, Kerrie L.; Young, Rick R.
2015-01-01
While a variety of statistical models now exist for the spatio-temporal analysis of two-dimensional (surface) data collected over time, there are few published examples of analogous models for the spatial analysis of data taken over four dimensions: latitude, longitude, height or depth, and time. When taking account of the autocorrelation of data within and between dimensions, the notion of closeness often differs for each of the dimensions. Here, we consider a number of approaches to the analysis of such a dataset, which arises from an agricultural experiment exploring the impact of different cropping systems on soil moisture. The proposed models vary in their representation of the spatial correlation in the data, the assumed temporal pattern and choice of conditional autoregressive (CAR) and other priors. In terms of the substantive question, we find that response cropping is generally more effective than long fallow cropping in reducing soil moisture at the depths considered (100 cm to 220 cm). Thus, if we wish to reduce the possibility of deep drainage and increased groundwater salinity, the recommended cropping system is response cropping. PMID:26513746
NASA Astrophysics Data System (ADS)
Bentaieb, Samia; Ouamri, Abdelaziz; Nait-Ali, Amine; Keche, Mokhtar
2018-01-01
We propose and evaluate a three-dimensional (3D) face recognition approach that applies the speeded up robust feature (SURF) algorithm to the depth representation of shape index map, under real-world conditions, using only a single gallery sample for each subject. First, the 3D scans are preprocessed, then SURF is applied on the shape index map to find interest points and their descriptors. Each 3D face scan is represented by keypoints descriptors, and a large dictionary is built from all the gallery descriptors. At the recognition step, descriptors of a probe face scan are sparsely represented by the dictionary. A multitask sparse representation classification is used to determine the identity of each probe face. The feasibility of the approach that uses the SURF algorithm on the shape index map for face identification/authentication is checked through an experimental investigation conducted on Bosphorus, University of Milano Bicocca, and CASIA 3D datasets. It achieves an overall rank one recognition rate of 97.75%, 80.85%, and 95.12%, respectively, on these datasets.
Discriminative clustering on manifold for adaptive transductive classification.
Zhang, Zhao; Jia, Lei; Zhang, Min; Li, Bing; Zhang, Li; Li, Fanzhang
2017-10-01
In this paper, we mainly propose a novel adaptive transductive label propagation approach by joint discriminative clustering on manifolds for representing and classifying high-dimensional data. Our framework seamlessly combines the unsupervised manifold learning, discriminative clustering and adaptive classification into a unified model. Also, our method incorporates the adaptive graph weight construction with label propagation. Specifically, our method is capable of propagating label information using adaptive weights over low-dimensional manifold features, which is different from most existing studies that usually predict the labels and construct the weights in the original Euclidean space. For transductive classification by our formulation, we first perform the joint discriminative K-means clustering and manifold learning to capture the low-dimensional nonlinear manifolds. Then, we construct the adaptive weights over the learnt manifold features, where the adaptive weights are calculated through performing the joint minimization of the reconstruction errors over features and soft labels so that the graph weights can be joint-optimal for data representation and classification. Using the adaptive weights, we can easily estimate the unknown labels of samples. After that, our method returns the updated weights for further updating the manifold features. Extensive simulations on image classification and segmentation show that our proposed algorithm can deliver the state-of-the-art performance on several public datasets. Copyright © 2017 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Pham, Tien-Lam; Nguyen, Nguyen-Duong; Nguyen, Van-Doan; Kino, Hiori; Miyake, Takashi; Dam, Hieu-Chi
2018-05-01
We have developed a descriptor named Orbital Field Matrix (OFM) for representing material structures in datasets of multi-element materials. The descriptor is based on the information regarding atomic valence shell electrons and their coordination. In this work, we develop an extension of OFM called OFM1. We have shown that these descriptors are highly applicable in predicting the physical properties of materials and in providing insights on the materials space by mapping into a low embedded dimensional space. Our experiments with transition metal/lanthanide metal alloys show that the local magnetic moments and formation energies can be accurately reproduced using simple nearest-neighbor regression, thus confirming the relevance of our descriptors. Using kernel ridge regressions, we could accurately reproduce formation energies and local magnetic moments calculated based on first-principles, with mean absolute errors of 0.03 μB and 0.10 eV/atom, respectively. We show that meaningful low-dimensional representations can be extracted from the original descriptor using descriptive learning algorithms. Intuitive prehension on the materials space, qualitative evaluation on the similarities in local structures or crystalline materials, and inference in the designing of new materials by element substitution can be performed effectively based on these low-dimensional representations.
Optical 3D surface digitizing in forensic medicine: 3D documentation of skin and bone injuries.
Thali, Michael J; Braun, Marcel; Dirnhofer, Richard
2003-11-26
Photography process reduces a three-dimensional (3D) wound to a two-dimensional level. If there is a need for a high-resolution 3D dataset of an object, it needs to be three-dimensionally scanned. No-contact optical 3D digitizing surface scanners can be used as a powerful tool for wound and injury-causing instrument analysis in trauma cases. The 3D skin wound and a bone injury documentation using the optical scanner Advanced TOpometric Sensor (ATOS II, GOM International, Switzerland) will be demonstrated using two illustrative cases. Using this 3D optical digitizing method the wounds (the virtual 3D computer model of the skin and the bone injuries) and the virtual 3D model of the injury-causing tool are graphically documented in 3D in real-life size and shape and can be rotated in the CAD program on the computer screen. In addition, the virtual 3D models of the bone injuries and tool can now be compared in a 3D CAD program against one another in virtual space, to see if there are matching areas. Further steps in forensic medicine will be a full 3D surface documentation of the human body and all the forensic relevant injuries using optical 3D scanners.
Factor structure and dimensionality of the two depression scales in STAR*D using level 1 datasets.
Bech, P; Fava, M; Trivedi, M H; Wisniewski, S R; Rush, A J
2011-08-01
The factor structure and dimensionality of the HAM-D(17) and the IDS-C(30) are as yet uncertain, because psychometric analyses of these scales have been performed without a clear separation between factor structure profile and dimensionality (total scores being a sufficient statistic). The first treatment step (Level 1) in the STAR*D study provided a dataset of 4041 outpatients with DSM-IV nonpsychotic major depression. The HAM-D(17) and IDS-C(30) were evaluated by principal component analysis (PCA) without rotation. Mokken analysis tested the unidimensionality of the IDS-C(6), which corresponds to the unidimensional HAM-D(6.) For both the HAM-D(17) and IDS-C(30), PCA identified a bi-directional factor contrasting the depressive symptoms versus the neurovegetative symptoms. The HAM-D(6) and the corresponding IDS-C(6) symptoms all emerged in the depression factor. Both the HAM-D(6) and IDS-C(6) were found to be unidimensional scales, i.e., their total scores are each a sufficient statistic for the measurement of depressive states. STAR*D used only one medication in Level 1. The unidimensional HAM-D(6) and IDS-C(6) should be used when evaluating the pure clinical effect of antidepressive treatment, whereas the multidimensional HAM-D(17) and IDS-C(30) should be considered when selecting antidepressant treatment. Copyright © 2011 Elsevier B.V. All rights reserved.
Improved method for predicting protein fold patterns with ensemble classifiers.
Chen, W; Liu, X; Huang, Y; Jiang, Y; Zou, Q; Lin, C
2012-01-27
Protein folding is recognized as a critical problem in the field of biophysics in the 21st century. Predicting protein-folding patterns is challenging due to the complex structure of proteins. In an attempt to solve this problem, we employed ensemble classifiers to improve prediction accuracy. In our experiments, 188-dimensional features were extracted based on the composition and physical-chemical property of proteins and 20-dimensional features were selected using a coupled position-specific scoring matrix. Compared with traditional prediction methods, these methods were superior in terms of prediction accuracy. The 188-dimensional feature-based method achieved 71.2% accuracy in five cross-validations. The accuracy rose to 77% when we used a 20-dimensional feature vector. These methods were used on recent data, with 54.2% accuracy. Source codes and dataset, together with web server and software tools for prediction, are available at: http://datamining.xmu.edu.cn/main/~cwc/ProteinPredict.html.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Levakhina, Y. M.; Mueller, J.; Buzug, T. M.
Purpose: This paper introduces a nonlinear weighting scheme into the backprojection operation within the simultaneous algebraic reconstruction technique (SART). It is designed for tomosynthesis imaging of objects with high-attenuation features in order to reduce limited angle artifacts. Methods: The algorithm estimates which projections potentially produce artifacts in a voxel. The contribution of those projections into the updating term is reduced. In order to identify those projections automatically, a four-dimensional backprojected space representation is used. Weighting coefficients are calculated based on a dissimilarity measure, evaluated in this space. For each combination of an angular view direction and a voxel position anmore » individual weighting coefficient for the updating term is calculated. Results: The feasibility of the proposed approach is shown based on reconstructions of the following real three-dimensional tomosynthesis datasets: a mammography quality phantom, an apple with metal needles, a dried finger bone in water, and a human hand. Datasets have been acquired with a Siemens Mammomat Inspiration tomosynthesis device and reconstructed using SART with and without suggested weighting. Out-of-focus artifacts are described using line profiles and measured using standard deviation (STD) in the plane and below the plane which contains artifact-causing features. Artifacts distribution in axial direction is measured using an artifact spread function (ASF). The volumes reconstructed with the weighting scheme demonstrate the reduction of out-of-focus artifacts, lower STD (meaning reduction of artifacts), and narrower ASF compared to nonweighted SART reconstruction. It is achieved successfully for different kinds of structures: point-like structures such as phantom features, long structures such as metal needles, and fine structures such as trabecular bone structures. Conclusions: Results indicate the feasibility of the proposed algorithm to reduce typical tomosynthesis artifacts produced by high-attenuation features. The proposed algorithm assigns weighting coefficients automatically and no segmentation or tissue-classification steps are required. The algorithm can be included into various iterative reconstruction algorithms with an additive updating strategy. It can also be extended to computed tomography case with the complete set of angular data.« less
Guo, Yang; Liu, Shuhui; Li, Zhanhuai; Shang, Xuequn
2018-04-11
The classification of cancer subtypes is of great importance to cancer disease diagnosis and therapy. Many supervised learning approaches have been applied to cancer subtype classification in the past few years, especially of deep learning based approaches. Recently, the deep forest model has been proposed as an alternative of deep neural networks to learn hyper-representations by using cascade ensemble decision trees. It has been proved that the deep forest model has competitive or even better performance than deep neural networks in some extent. However, the standard deep forest model may face overfitting and ensemble diversity challenges when dealing with small sample size and high-dimensional biology data. In this paper, we propose a deep learning model, so-called BCDForest, to address cancer subtype classification on small-scale biology datasets, which can be viewed as a modification of the standard deep forest model. The BCDForest distinguishes from the standard deep forest model with the following two main contributions: First, a named multi-class-grained scanning method is proposed to train multiple binary classifiers to encourage diversity of ensemble. Meanwhile, the fitting quality of each classifier is considered in representation learning. Second, we propose a boosting strategy to emphasize more important features in cascade forests, thus to propagate the benefits of discriminative features among cascade layers to improve the classification performance. Systematic comparison experiments on both microarray and RNA-Seq gene expression datasets demonstrate that our method consistently outperforms the state-of-the-art methods in application of cancer subtype classification. The multi-class-grained scanning and boosting strategy in our model provide an effective solution to ease the overfitting challenge and improve the robustness of deep forest model working on small-scale data. Our model provides a useful approach to the classification of cancer subtypes by using deep learning on high-dimensional and small-scale biology data.
A reversible-jump Markov chain Monte Carlo algorithm for 1D inversion of magnetotelluric data
NASA Astrophysics Data System (ADS)
Mandolesi, Eric; Ogaya, Xenia; Campanyà, Joan; Piana Agostinetti, Nicola
2018-04-01
This paper presents a new computer code developed to solve the 1D magnetotelluric (MT) inverse problem using a Bayesian trans-dimensional Markov chain Monte Carlo algorithm. MT data are sensitive to the depth-distribution of rock electric conductivity (or its reciprocal, resistivity). The solution provided is a probability distribution - the so-called posterior probability distribution (PPD) for the conductivity at depth, together with the PPD of the interface depths. The PPD is sampled via a reversible-jump Markov Chain Monte Carlo (rjMcMC) algorithm, using a modified Metropolis-Hastings (MH) rule to accept or discard candidate models along the chains. As the optimal parameterization for the inversion process is generally unknown a trans-dimensional approach is used to allow the dataset itself to indicate the most probable number of parameters needed to sample the PPD. The algorithm is tested against two simulated datasets and a set of MT data acquired in the Clare Basin (County Clare, Ireland). For the simulated datasets the correct number of conductive layers at depth and the associated electrical conductivity values is retrieved, together with reasonable estimates of the uncertainties on the investigated parameters. Results from the inversion of field measurements are compared with results obtained using a deterministic method and with well-log data from a nearby borehole. The PPD is in good agreement with the well-log data, showing as a main structure a high conductive layer associated with the Clare Shale formation. In this study, we demonstrate that our new code go beyond algorithms developend using a linear inversion scheme, as it can be used: (1) to by-pass the subjective choices in the 1D parameterizations, i.e. the number of horizontal layers in the 1D parameterization, and (2) to estimate realistic uncertainties on the retrieved parameters. The algorithm is implemented using a simple MPI approach, where independent chains run on isolated CPU, to take full advantage of parallel computer architectures. In case of a large number of data, a master/slave appoach can be used, where the master CPU samples the parameter space and the slave CPUs compute forward solutions.
Sepulveda, W; Wong, A E; Viñals, F; Andreeva, E; Adzehova, N; Martinez-Ten, P
2012-02-01
To describe a new ultrasound technique that may be useful for the diagnosis of micrognathia in the first trimester of pregnancy. The retronasal triangle (RNT) view is a technique that captures the coronal plane of the face in which the primary palate and the frontal processes of the maxilla are visualized simultaneously. Normal first-trimester fetuses display a characteristic gap between the right and left body of the mandible in this view (the 'mandibular gap'). The presence or absence of this gap was evaluated and measured prospectively during real-time scanning (n = 154) and retrospectively by analyzing three-dimensional (3D) datasets (n = 50) in normal first-trimester fetuses undergoing screening for aneuploidy at 11-13 weeks' gestation. 3D datasets from 12 fetuses with suspected micrognathia were also collected and examined retrospectively for the same features. The mandibular gap was identified in all 204 normal fetuses and increased linearly with increasing crown-rump length (y = 0.033x + 0.435; R(2) = 0.316), with no statistically significant differences between measurements obtained by two-dimensional ultrasound and 3D offline analysis. Among fetuses with suspected micrognathia, three 3D datasets were excluded from analysis because of poor image quality in one and the diagnosis of a normal chin in two. In the remaining nine fetuses, the mandibular gap was absent and was replaced by a bony structure representing the receding chin in seven (77.8%) cases and was not visualized due to severe retrognathia in the remaining two (22.2%) cases. All fetuses with micrognathia had associated anomalies, including seven with aneuploidy and two with skeletal dysplasia. The RNT view may be a helpful technique for detecting micrognathia in the first trimester. The absence of the mandibular gap or failure to identify the mandible in this view is highly suggestive of micrognathia and should prompt a targeted ultrasound scan to assess for other anomalies. Further research is needed to determine the false-positive and false-negative rates of this technique. Copyright © 2012 ISUOG. Published by John Wiley & Sons, Ltd.
Server-based Approach to Web Visualization of Integrated Three-dimensional Brain Imaging Data
Poliakov, Andrew V.; Albright, Evan; Hinshaw, Kevin P.; Corina, David P.; Ojemann, George; Martin, Richard F.; Brinkley, James F.
2005-01-01
The authors describe a client-server approach to three-dimensional (3-D) visualization of neuroimaging data, which enables researchers to visualize, manipulate, and analyze large brain imaging datasets over the Internet. All computationally intensive tasks are done by a graphics server that loads and processes image volumes and 3-D models, renders 3-D scenes, and sends the renderings back to the client. The authors discuss the system architecture and implementation and give several examples of client applications that allow visualization and analysis of integrated language map data from single and multiple patients. PMID:15561787
Bouhrara, Mustapha; Spencer, Richard G.
2015-01-01
Myelin water fraction (MWF) mapping with magnetic resonance imaging has led to the ability to directly observe myelination and demyelination in both the developing brain and in disease. Multicomponent driven equilibrium single pulse observation of T1 and T2 (mcDESPOT) has been proposed as a rapid approach for multicomponent relaxometry and has been applied to map MWF in human brain. However, even for the simplest two-pool signal model consisting of MWF and non-myelin-associated water, the dimensionality of the parameter space for obtaining MWF estimates remains high. This renders parameter estimation difficult, especially at low-to-moderate signal-to-noise ratios (SNR), due to the presence of local minima and the flatness of the fit residual energy surface used for parameter determination using conventional nonlinear least squares (NLLS)-based algorithms. In this study, we introduce three Bayesian approaches for analysis of the mcDESPOT signal model to determine MWF. Given the high dimensional nature of mcDESPOT signal model, and, thereby, the high dimensional marginalizations over nuisance parameters needed to derive the posterior probability distribution of MWF parameter, the introduced Bayesian analyses use different approaches to reduce the dimensionality of the parameter space. The first approach uses normalization by average signal amplitude, and assumes that noise can be accurately estimated from signal-free regions of the image. The second approach likewise uses average amplitude normalization, but incorporates a full treatment of noise as an unknown variable through marginalization. The third approach does not use amplitude normalization and incorporates marginalization over both noise and signal amplitude. Through extensive Monte Carlo numerical simulations and analysis of in-vivo human brain datasets exhibiting a range of SNR and spatial resolution, we demonstrated the markedly improved accuracy and precision in the estimation of MWF using these Bayesian methods as compared to the stochastic region contraction (SRC) implementation of NLLS. PMID:26499810
Effects of VR system fidelity on analyzing isosurface visualization of volume datasets.
Laha, Bireswar; Bowman, Doug A; Socha, John J
2014-04-01
Volume visualization is an important technique for analyzing datasets from a variety of different scientific domains. Volume data analysis is inherently difficult because volumes are three-dimensional, dense, and unfamiliar, requiring scientists to precisely control the viewpoint and to make precise spatial judgments. Researchers have proposed that more immersive (higher fidelity) VR systems might improve task performance with volume datasets, and significant results tied to different components of display fidelity have been reported. However, more information is needed to generalize these results to different task types, domains, and rendering styles. We visualized isosurfaces extracted from synchrotron microscopic computed tomography (SR-μCT) scans of beetles, in a CAVE-like display. We ran a controlled experiment evaluating the effects of three components of system fidelity (field of regard, stereoscopy, and head tracking) on a variety of abstract task categories that are applicable to various scientific domains, and also compared our results with those from our prior experiment using 3D texture-based rendering. We report many significant findings. For example, for search and spatial judgment tasks with isosurface visualization, a stereoscopic display provides better performance, but for tasks with 3D texture-based rendering, displays with higher field of regard were more effective, independent of the levels of the other display components. We also found that systems with high field of regard and head tracking improve performance in spatial judgment tasks. Our results extend existing knowledge and produce new guidelines for designing VR systems to improve the effectiveness of volume data analysis.
The Kalman Filter and High Performance Computing at NASA's Data Assimilation Office (DAO)
NASA Technical Reports Server (NTRS)
Lyster, Peter M.
1999-01-01
Atmospheric data assimilation is a method of combining actual observations with model simulations to produce a more accurate description of the earth system than the observations alone provide. The output of data assimilation, sometimes called "the analysis", are accurate regular, gridded datasets of observed and unobserved variables. This is used not only for weather forecasting but is becoming increasingly important for climate research. For example, these datasets may be used to assess retrospectively energy budgets or the effects of trace gases such as ozone. This allows researchers to understand processes driving weather and climate, which have important scientific and policy implications. The primary goal of the NASA's Data Assimilation Office (DAO) is to provide datasets for climate research and to support NASA satellite and aircraft missions. This presentation will: (1) describe ongoing work on the advanced Kalman/Lagrangian filter parallel algorithm for the assimilation of trace gases in the stratosphere; and (2) discuss the Kalman filter in relation to other presentations from the DAO on Four Dimensional Data Assimilation at this meeting. Although the designation "Kalman filter" is often used to describe the overarching work, the series of talks will show that the scientific software and the kind of parallelization techniques that are being developed at the DAO are very different depending on the type of problem being considered, the extent to which the problem is mission critical, and the degree of Software Engineering that has to be applied.
FHSA-SED: Two-Locus Model Detection for Genome-Wide Association Study with Harmony Search Algorithm.
Tuo, Shouheng; Zhang, Junying; Yuan, Xiguo; Zhang, Yuanyuan; Liu, Zhaowen
2016-01-01
Two-locus model is a typical significant disease model to be identified in genome-wide association study (GWAS). Due to intensive computational burden and diversity of disease models, existing methods have drawbacks on low detection power, high computation cost, and preference for some types of disease models. In this study, two scoring functions (Bayesian network based K2-score and Gini-score) are used for characterizing two SNP locus as a candidate model, the two criteria are adopted simultaneously for improving identification power and tackling the preference problem to disease models. Harmony search algorithm (HSA) is improved for quickly finding the most likely candidate models among all two-locus models, in which a local search algorithm with two-dimensional tabu table is presented to avoid repeatedly evaluating some disease models that have strong marginal effect. Finally G-test statistic is used to further test the candidate models. We investigate our method named FHSA-SED on 82 simulated datasets and a real AMD dataset, and compare it with two typical methods (MACOED and CSE) which have been developed recently based on swarm intelligent search algorithm. The results of simulation experiments indicate that our method outperforms the two compared algorithms in terms of detection power, computation time, evaluation times, sensitivity (TPR), specificity (SPC), positive predictive value (PPV) and accuracy (ACC). Our method has identified two SNPs (rs3775652 and rs10511467) that may be also associated with disease in AMD dataset.
FHSA-SED: Two-Locus Model Detection for Genome-Wide Association Study with Harmony Search Algorithm
Tuo, Shouheng; Zhang, Junying; Yuan, Xiguo; Zhang, Yuanyuan; Liu, Zhaowen
2016-01-01
Motivation Two-locus model is a typical significant disease model to be identified in genome-wide association study (GWAS). Due to intensive computational burden and diversity of disease models, existing methods have drawbacks on low detection power, high computation cost, and preference for some types of disease models. Method In this study, two scoring functions (Bayesian network based K2-score and Gini-score) are used for characterizing two SNP locus as a candidate model, the two criteria are adopted simultaneously for improving identification power and tackling the preference problem to disease models. Harmony search algorithm (HSA) is improved for quickly finding the most likely candidate models among all two-locus models, in which a local search algorithm with two-dimensional tabu table is presented to avoid repeatedly evaluating some disease models that have strong marginal effect. Finally G-test statistic is used to further test the candidate models. Results We investigate our method named FHSA-SED on 82 simulated datasets and a real AMD dataset, and compare it with two typical methods (MACOED and CSE) which have been developed recently based on swarm intelligent search algorithm. The results of simulation experiments indicate that our method outperforms the two compared algorithms in terms of detection power, computation time, evaluation times, sensitivity (TPR), specificity (SPC), positive predictive value (PPV) and accuracy (ACC). Our method has identified two SNPs (rs3775652 and rs10511467) that may be also associated with disease in AMD dataset. PMID:27014873
EarthServer: Visualisation and use of uncertainty as a data exploration tool
NASA Astrophysics Data System (ADS)
Walker, Peter; Clements, Oliver; Grant, Mike
2013-04-01
The Ocean Science/Earth Observation community generates huge datasets from satellite observation. Until recently it has been difficult to obtain matching uncertainty information for these datasets and to apply this to their processing. In order to make use of uncertainty information when analysing "Big Data" we need both the uncertainty itself (attached to the underlying data) and a means of working with the combined product without requiring the entire dataset to be downloaded. The European Commission FP7 project EarthServer (http://earthserver.eu) is addressing the problem of accessing and ad-hoc analysis of extreme-size Earth Science data using cutting-edge Array Database technology. The core software (Rasdaman) and web services wrapper (Petascope) allow huge datasets to be accessed using Open Geospatial Consortium (OGC) standard interfaces including the well established standards, Web Coverage Service (WCS) and Web Map Service (WMS) as well as the emerging standard, Web Coverage Processing Service (WCPS). The WCPS standard allows the running of ad-hoc queries on any of the data stored within Rasdaman, creating an infrastructure where users are not restricted by bandwidth when manipulating or querying huge datasets. The ESA Ocean Colour - Climate Change Initiative (OC-CCI) project (http://www.esa-oceancolour-cci.org/), is producing high-resolution, global ocean colour datasets over the full time period (1998-2012) where high quality observations were available. This climate data record includes per-pixel uncertainty data for each variable, based on an analytic method that classifies how much and which types of water are present in a pixel, and assigns uncertainty based on robust comparisons to global in-situ validation datasets. These uncertainty values take two forms, Root Mean Square (RMS) and Bias uncertainty, respectively representing the expected variability and expected offset error. By combining the data produced through the OC-CCI project with the software from the EarthServer project we can produce a novel data offering that allows the use of traditional exploration and access mechanisms such as WMS and WCS. However the real benefits can be seen when utilising WCPS to explore the data . We will show two major benefits to this infrastructure. Firstly we will show that the visualisation of the combined chlorophyll and uncertainty datasets through a web based GIS portal gives users the ability to instantaneously assess the quality of the data they are exploring using traditional web based plotting techniques as well as through novel web based 3 dimensional visualisation. Secondly we will showcase the benefits available when combining these data with the WCPS standard. The uncertainty data can be utilised in queries using the standard WCPS query language. This allows selection of data either for download or use within the query, based on the respective uncertainty values as well as the possibility of incorporating both the chlorophyll data and uncertainty data into complex queries to produce additional novel data products. By filtering with uncertainty at the data source rather than the client we can minimise traffic over the network allowing huge datasets to be worked on with a minimal time penalty.
Relevant Feature Set Estimation with a Knock-out Strategy and Random Forests
Ganz, Melanie; Greve, Douglas N.; Fischl, Bruce; Konukoglu, Ender
2015-01-01
Group analysis of neuroimaging data is a vital tool for identifying anatomical and functional variations related to diseases as well as normal biological processes. The analyses are often performed on a large number of highly correlated measurements using a relatively smaller number of samples. Despite the correlation structure, the most widely used approach is to analyze the data using univariate methods followed by post-hoc corrections that try to account for the data’s multivariate nature. Although widely used, this approach may fail to recover from the adverse effects of the initial analysis when local effects are not strong. Multivariate pattern analysis (MVPA) is a powerful alternative to the univariate approach for identifying relevant variations. Jointly analyzing all the measures, MVPA techniques can detect global effects even when individual local effects are too weak to detect with univariate analysis. Current approaches are successful in identifying variations that yield highly predictive and compact models. However, they suffer from lessened sensitivity and instabilities in identification of relevant variations. Furthermore, current methods’ user-defined parameters are often unintuitive and difficult to determine. In this article, we propose a novel MVPA method for group analysis of high-dimensional data that overcomes the drawbacks of the current techniques. Our approach explicitly aims to identify all relevant variations using a “knock-out” strategy and the Random Forest algorithm. In evaluations with synthetic datasets the proposed method achieved substantially higher sensitivity and accuracy than the state-of-the-art MVPA methods, and outperformed the univariate approach when the effect size is low. In experiments with real datasets the proposed method identified regions beyond the univariate approach, while other MVPA methods failed to replicate the univariate results. More importantly, in a reproducibility study with the well-known ADNI dataset the proposed method yielded higher stability and power than the univariate approach. PMID:26272728
Accuracy of Digitally Fabricated Wax Denture Bases and Conventional Completed Complete Dentures.
Stawarczyk, Bogna; Lümkemann, Nina; Eichberger, Marlis; Wimmer, Timea
2017-12-19
The purpose of this investigation was to analyze the accuracy of digitally fabricated wax trial dentures and conventionally finalized complete dentures in comparison to a surface tessellation language (STL)-dataset. A generated data set for the denture bases and the tooth sockets was used, converted into STL-format, and saved as reference. Five mandibular and 5 maxillary denture bases were milled from wax blanks and denture teeth were waxed into their tooth sockets. Each complete denture was checked on fit, waxed onto the dental cast, and digitized using an optical laboratory scanning device. The complete dentures were completed conventionally using the injection method, finished, and scanned. The resulting STL-datasets were exported into the three-dimensional (3D) software GOM Inspect. Each of the 5 mandibular and 5 maxillary complete dentures was aligned with the STL- and the wax trial denture dataset. Alignment was performed based on a best-fit algorithm. A three-dimensional analysis of the spatial divergences in x -, y - and z -axes was performed by the 3D software and visualized in a color-coded illustration. The mean positive and negative deviations between the datasets were calculated automatically. In a direct comparison between maxillary wax trial dentures and complete dentures, complete dentures showed higher deviations from the STL-dataset than the wax trial dentures. The deviations occurred in the area of the teeth as well as in the distal area of the denture bases. In contrast, the highest deviations in both the mandibular wax trial dentures and the mandibular complete dentures were observed in the distal area. The complete dentures showed higher deviations on the occlusal surfaces of the teeth compared to the wax dentures. Computer-aided design/computer-aided manufacturing (CAD/CAM)-fabricated wax dentures exhibited fewer deviations from the STL-reference than the complete dentures. The deviations were significantly greater in the vicinity of the denture teeth area and the bases. The conventional transfer of CAD/CAM-fabricated wax dentures into acrylic resin leads to the highest deviations from the STL-reference.
Robust L1-norm two-dimensional linear discriminant analysis.
Li, Chun-Na; Shao, Yuan-Hai; Deng, Nai-Yang
2015-05-01
In this paper, we propose an L1-norm two-dimensional linear discriminant analysis (L1-2DLDA) with robust performance. Different from the conventional two-dimensional linear discriminant analysis with L2-norm (L2-2DLDA), where the optimization problem is transferred to a generalized eigenvalue problem, the optimization problem in our L1-2DLDA is solved by a simple justifiable iterative technique, and its convergence is guaranteed. Compared with L2-2DLDA, our L1-2DLDA is more robust to outliers and noises since the L1-norm is used. This is supported by our preliminary experiments on toy example and face datasets, which show the improvement of our L1-2DLDA over L2-2DLDA. Copyright © 2015 Elsevier Ltd. All rights reserved.
NASA Technical Reports Server (NTRS)
Dunagan, Stephen E.; Norman, Thomas R.
1987-01-01
A wind tunnel experiment simulating a steady three-dimensional helicopter rotor blade/vortex interaction is reported. The experimental configuration consisted of a vertical semispan vortex-generating wing, mounted upstream of a horizontal semispan rotor blade airfoil. A three-dimensional laser velocimeter was used to measure the velocity field in the region of the blade. Sectional lift coefficients were calculated by integrating the velocity field to obtain the bound vorticity. Total lift values, obtained by using an internal strain-gauge balance, verified the laser velocimeter data. Parametric variations of vortex strength, rotor blade angle of attack, and vortex position relative to the rotor blade were explored. These data are reported (with attention to experimental limitations) to provide a dataset for the validation of analytical work.
User's manual for Interactive Data Display System (IDDS)
NASA Technical Reports Server (NTRS)
Stegeman, James D.
1992-01-01
A computer graphics package for the visualization of three-dimensional flow in turbomachinery has been developed and tested. This graphics package, called IDDS (Interactive Data Display System), is able to 'unwrap' the volumetric data cone associated with a centrifugal compressor and display the results in an easy to understand two-dimensional manner. IDDS will provide the majority of the visualization and analysis capability for the ICE (Integrated CFD and Experiment) system. This document is intended to serve as a user's manual for IDDS in a stand-alone mode. Currently, IDDS is capable of plotting two- or three-dimensional simulation data, but work is under way to expand IDDS so that experimental data can be accepted, plotted, and compared with a simulation dataset of the actual hardware being tested.
Xarray: multi-dimensional data analysis in Python
NASA Astrophysics Data System (ADS)
Hoyer, Stephan; Hamman, Joe; Maussion, Fabien
2017-04-01
xarray (http://xarray.pydata.org) is an open source project and Python package that provides a toolkit and data structures for N-dimensional labeled arrays, which are the bread and butter of modern geoscientific data analysis. Key features of the package include label-based indexing and arithmetic, interoperability with the core scientific Python packages (e.g., pandas, NumPy, Matplotlib, Cartopy), out-of-core computation on datasets that don't fit into memory, a wide range of input/output options, and advanced multi-dimensional data manipulation tools such as group-by and resampling. In this contribution we will present the key features of the library and demonstrate its great potential for a wide range of applications, from (big-)data processing on super computers to data exploration in front of a classroom.
Wang, Shunfang; Liu, Shuhui
2015-12-19
An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one.
Saini, Sanjay; Zakaria, Nordin; Rambli, Dayang Rohaya Awang; Sulaiman, Suziah
2015-01-01
The high-dimensional search space involved in markerless full-body articulated human motion tracking from multiple-views video sequences has led to a number of solutions based on metaheuristics, the most recent form of which is Particle Swarm Optimization (PSO). However, the classical PSO suffers from premature convergence and it is trapped easily into local optima, significantly affecting the tracking accuracy. To overcome these drawbacks, we have developed a method for the problem based on Hierarchical Multi-Swarm Cooperative Particle Swarm Optimization (H-MCPSO). The tracking problem is formulated as a non-linear 34-dimensional function optimization problem where the fitness function quantifies the difference between the observed image and a projection of the model configuration. Both the silhouette and edge likelihoods are used in the fitness function. Experiments using Brown and HumanEva-II dataset demonstrated that H-MCPSO performance is better than two leading alternative approaches-Annealed Particle Filter (APF) and Hierarchical Particle Swarm Optimization (HPSO). Further, the proposed tracking method is capable of automatic initialization and self-recovery from temporary tracking failures. Comprehensive experimental results are presented to support the claims.
Multi-Dimensional Pattern Discovery of Trajectories Using Contextual Information
NASA Astrophysics Data System (ADS)
Sharif, M.; Alesheikh, A. A.
2017-10-01
Movement of point objects are highly sensitive to the underlying situations and conditions during the movement, which are known as contexts. Analyzing movement patterns, while accounting the contextual information, helps to better understand how point objects behave in various contexts and how contexts affect their trajectories. One potential solution for discovering moving objects patterns is analyzing the similarities of their trajectories. This article, therefore, contextualizes the similarity measure of trajectories by not only their spatial footprints but also a notion of internal and external contexts. The dynamic time warping (DTW) method is employed to assess the multi-dimensional similarities of trajectories. Then, the results of similarity searches are utilized in discovering the relative movement patterns of the moving point objects. Several experiments are conducted on real datasets that were obtained from commercial airplanes and the weather information during the flights. The results yielded the robustness of DTW method in quantifying the commonalities of trajectories and discovering movement patterns with 80 % accuracy. Moreover, the results revealed the importance of exploiting contextual information because it can enhance and restrict movements.
Real-Time Three-Dimensional Cell Segmentation in Large-Scale Microscopy Data of Developing Embryos.
Stegmaier, Johannes; Amat, Fernando; Lemon, William C; McDole, Katie; Wan, Yinan; Teodoro, George; Mikut, Ralf; Keller, Philipp J
2016-01-25
We present the Real-time Accurate Cell-shape Extractor (RACE), a high-throughput image analysis framework for automated three-dimensional cell segmentation in large-scale images. RACE is 55-330 times faster and 2-5 times more accurate than state-of-the-art methods. We demonstrate the generality of RACE by extracting cell-shape information from entire Drosophila, zebrafish, and mouse embryos imaged with confocal and light-sheet microscopes. Using RACE, we automatically reconstructed cellular-resolution tissue anisotropy maps across developing Drosophila embryos and quantified differences in cell-shape dynamics in wild-type and mutant embryos. We furthermore integrated RACE with our framework for automated cell lineaging and performed joint segmentation and cell tracking in entire Drosophila embryos. RACE processed these terabyte-sized datasets on a single computer within 1.4 days. RACE is easy to use, as it requires adjustment of only three parameters, takes full advantage of state-of-the-art multi-core processors and graphics cards, and is available as open-source software for Windows, Linux, and Mac OS. Copyright © 2016 Elsevier Inc. All rights reserved.
Wang, Shunfang; Liu, Shuhui
2015-01-01
An effective representation of a protein sequence plays a crucial role in protein sub-nuclear localization. The existing representations, such as dipeptide composition (DipC), pseudo-amino acid composition (PseAAC) and position specific scoring matrix (PSSM), are insufficient to represent protein sequence due to their single perspectives. Thus, this paper proposes two fusion feature representations of DipPSSM and PseAAPSSM to integrate PSSM with DipC and PseAAC, respectively. When constructing each fusion representation, we introduce the balance factors to value the importance of its components. The optimal values of the balance factors are sought by genetic algorithm. Due to the high dimensionality of the proposed representations, linear discriminant analysis (LDA) is used to find its important low dimensional structure, which is essential for classification and location prediction. The numerical experiments on two public datasets with KNN classifier and cross-validation tests showed that in terms of the common indexes of sensitivity, specificity, accuracy and MCC, the proposed fusing representations outperform the traditional representations in protein sub-nuclear localization, and the representation treated by LDA outperforms the untreated one. PMID:26703574
NASA Astrophysics Data System (ADS)
Gillespie, D.; La Pensée, A.; Cooper, M.
2013-07-01
Three dimensional (3D) laser scanning is an important documentation technique for cultural heritage. This technology has been adopted from the engineering and aeronautical industry and is an invaluable tool for the documentation of objects within museum collections (La Pensée, 2008). The datasets created via close range laser scanning are extremely accurate and the created 3D dataset allows for a more detailed analysis in comparison to other documentation technologies such as photography. The dataset can be used for a range of different applications including: documentation; archiving; surface monitoring; replication; gallery interactives; educational sessions; conservation and visualization. However, the novel nature of a 3D dataset is presenting a rather unique challenge with respect to its sharing and dissemination. This is in part due to the need for specialised 3D software and a supported graphics card to display high resolution 3D models. This can be detrimental to one of the main goals of cultural institutions, which is to share knowledge and enable activities such as research, education and entertainment. This has limited the presentation of 3D models of cultural heritage objects to mainly either images or videos. Yet with recent developments in computer graphics, increased internet speed and emerging technologies such as Adobe's Stage 3D (Adobe, 2013) and WebGL (Khronos, 2013), it is now possible to share a dataset directly within a webpage. This allows website visitors to interact with the 3D dataset allowing them to explore every angle of the object, gaining an insight into its shape and nature. This can be very important considering that it is difficult to offer the same level of understanding of the object through the use of traditional mediums such as photographs and videos. Yet this presents a range of problems: this is a very novel experience and very few people have engaged with 3D objects outside of 3D software packages or games. This paper presents results of research that aims to provide a methodology for museums and cultural institutions for prototyping a 3D viewer within a webpage, thereby not only allowing institutions to promote their collections via the internet but also providing a tool for users to engage in a meaningful way with cultural heritage datasets. The design process encompasses evaluation as the central part of the design methodology; focusing on how slight changes to navigation, object engagement and aesthetic appearance can influence the user's experience. The prototype used in this paper, was created using WebGL with the Three.Js (Three.JS, 2013) library and datasets were loaded as the OpenCTM (Geelnard, 2010) file format. The overall design is centred on creating an easy-tolearn interface allowing non-skilled users to interact with the datasets, and also providing tools allowing skilled users to discover more about the cultural heritage object. User testing was carried out, allowing users to interact with 3D datasets within the interactive viewer. The results are analysed and the insights learned are discussed in relation to an interface designed to interact with 3D content. The results will lead to the design of interfaces for interacting with 3D objects, which allow for both skilled and non skilled users to engage with 3D cultural heritage objects in a meaningful way.
Foundational Principles for Large-Scale Inference: Illustrations Through Correlation Mining.
Hero, Alfred O; Rajaratnam, Bala
2016-01-01
When can reliable inference be drawn in fue "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, wifu implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics fue dataset is often variable-rich but sample-starved: a regime where the number n of acquired samples (statistical replicates) is far fewer than fue number p of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data". Sample complexity however has received relatively less attention, especially in the setting when the sample size n is fixed, and the dimension p grows without bound. To address fuis gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where fue variable dimension is fixed and fue sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa cale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables fua t are of interest. Correlation mining arises in numerous applications and subsumes the regression context as a special case. we demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks.
Jang, Dae -Heung; Anderson-Cook, Christine Michaela
2016-11-22
With many predictors in regression, fitting the full model can induce multicollinearity problems. Least Absolute Shrinkage and Selection Operation (LASSO) is useful when the effects of many explanatory variables are sparse in a high-dimensional dataset. Influential points can have a disproportionate impact on the estimated values of model parameters. Here, this paper describes a new influence plot that can be used to increase understanding of the contributions of individual observations and the robustness of results. This can serve as a complement to other regression diagnostics techniques in the LASSO regression setting. Using this influence plot, we can find influential pointsmore » and their impact on shrinkage of model parameters and model selection. Lastly, we provide two examples to illustrate the methods.« less
Marchi, S; Bonora, M; Patergnani, S; Giorgi, C; Pinton, P
2017-01-01
It is widely acknowledged that mitochondria are highly active structures that rapidly respond to cellular and environmental perturbations by changing their shape, number, and distribution. Mitochondrial remodeling is a key component of diverse biological processes, ranging from cell cycle progression to autophagy. In this chapter, we describe different methodologies for the morphological study of the mitochondrial network. Instructions are given for the preparation of samples for fluorescent microscopy, based on genetically encoded strategies or the employment of synthetic fluorescent dyes. We also propose detailed protocols to analyze mitochondrial morphometric parameters from both three-dimensional and bidimensional datasets. Finally, we describe a protocol for the visualization and quantification of mitochondrial structures through electron microscopy. © 2017 Elsevier Inc. All rights reserved.
Model-Averaged ℓ1 Regularization using Markov Chain Monte Carlo Model Composition
Fraley, Chris; Percival, Daniel
2014-01-01
Bayesian Model Averaging (BMA) is an effective technique for addressing model uncertainty in variable selection problems. However, current BMA approaches have computational difficulty dealing with data in which there are many more measurements (variables) than samples. This paper presents a method for combining ℓ1 regularization and Markov chain Monte Carlo model composition techniques for BMA. By treating the ℓ1 regularization path as a model space, we propose a method to resolve the model uncertainty issues arising in model averaging from solution path point selection. We show that this method is computationally and empirically effective for regression and classification in high-dimensional datasets. We apply our technique in simulations, as well as to some applications that arise in genomics. PMID:25642001
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jang, Dae -Heung; Anderson-Cook, Christine Michaela
With many predictors in regression, fitting the full model can induce multicollinearity problems. Least Absolute Shrinkage and Selection Operation (LASSO) is useful when the effects of many explanatory variables are sparse in a high-dimensional dataset. Influential points can have a disproportionate impact on the estimated values of model parameters. Here, this paper describes a new influence plot that can be used to increase understanding of the contributions of individual observations and the robustness of results. This can serve as a complement to other regression diagnostics techniques in the LASSO regression setting. Using this influence plot, we can find influential pointsmore » and their impact on shrinkage of model parameters and model selection. Lastly, we provide two examples to illustrate the methods.« less
Particle Swarm Optimization approach to defect detection in armour ceramics.
Kesharaju, Manasa; Nagarajah, Romesh
2017-03-01
In this research, various extracted features were used in the development of an automated ultrasonic sensor based inspection system that enables defect classification in each ceramic component prior to despatch to the field. Classification is an important task and large number of irrelevant, redundant features commonly introduced to a dataset reduces the classifiers performance. Feature selection aims to reduce the dimensionality of the dataset while improving the performance of a classification system. In the context of a multi-criteria optimization problem (i.e. to minimize classification error rate and reduce number of features) such as one discussed in this research, the literature suggests that evolutionary algorithms offer good results. Besides, it is noted that Particle Swarm Optimization (PSO) has not been explored especially in the field of classification of high frequency ultrasonic signals. Hence, a binary coded Particle Swarm Optimization (BPSO) technique is investigated in the implementation of feature subset selection and to optimize the classification error rate. In the proposed method, the population data is used as input to an Artificial Neural Network (ANN) based classification system to obtain the error rate, as ANN serves as an evaluator of PSO fitness function. Copyright © 2016. Published by Elsevier B.V.
A Guideline to Univariate Statistical Analysis for LC/MS-Based Untargeted Metabolomics-Derived Data
Vinaixa, Maria; Samino, Sara; Saez, Isabel; Duran, Jordi; Guinovart, Joan J.; Yanes, Oscar
2012-01-01
Several metabolomic software programs provide methods for peak picking, retention time alignment and quantification of metabolite features in LC/MS-based metabolomics. Statistical analysis, however, is needed in order to discover those features significantly altered between samples. By comparing the retention time and MS/MS data of a model compound to that from the altered feature of interest in the research sample, metabolites can be then unequivocally identified. This paper reports on a comprehensive overview of a workflow for statistical analysis to rank relevant metabolite features that will be selected for further MS/MS experiments. We focus on univariate data analysis applied in parallel on all detected features. Characteristics and challenges of this analysis are discussed and illustrated using four different real LC/MS untargeted metabolomic datasets. We demonstrate the influence of considering or violating mathematical assumptions on which univariate statistical test rely, using high-dimensional LC/MS datasets. Issues in data analysis such as determination of sample size, analytical variation, assumption of normality and homocedasticity, or correction for multiple testing are discussed and illustrated in the context of our four untargeted LC/MS working examples. PMID:24957762
A Guideline to Univariate Statistical Analysis for LC/MS-Based Untargeted Metabolomics-Derived Data.
Vinaixa, Maria; Samino, Sara; Saez, Isabel; Duran, Jordi; Guinovart, Joan J; Yanes, Oscar
2012-10-18
Several metabolomic software programs provide methods for peak picking, retention time alignment and quantification of metabolite features in LC/MS-based metabolomics. Statistical analysis, however, is needed in order to discover those features significantly altered between samples. By comparing the retention time and MS/MS data of a model compound to that from the altered feature of interest in the research sample, metabolites can be then unequivocally identified. This paper reports on a comprehensive overview of a workflow for statistical analysis to rank relevant metabolite features that will be selected for further MS/MS experiments. We focus on univariate data analysis applied in parallel on all detected features. Characteristics and challenges of this analysis are discussed and illustrated using four different real LC/MS untargeted metabolomic datasets. We demonstrate the influence of considering or violating mathematical assumptions on which univariate statistical test rely, using high-dimensional LC/MS datasets. Issues in data analysis such as determination of sample size, analytical variation, assumption of normality and homocedasticity, or correction for multiple testing are discussed and illustrated in the context of our four untargeted LC/MS working examples.
R package MVR for Joint Adaptive Mean-Variance Regularization and Variance Stabilization
Dazard, Jean-Eudes; Xu, Hua; Rao, J. Sunil
2015-01-01
We present an implementation in the R language for statistical computing of our recent non-parametric joint adaptive mean-variance regularization and variance stabilization procedure. The method is specifically suited for handling difficult problems posed by high-dimensional multivariate datasets (p ≫ n paradigm), such as in ‘omics’-type data, among which are that the variance is often a function of the mean, variable-specific estimators of variances are not reliable, and tests statistics have low powers due to a lack of degrees of freedom. The implementation offers a complete set of features including: (i) normalization and/or variance stabilization function, (ii) computation of mean-variance-regularized t and F statistics, (iii) generation of diverse diagnostic plots, (iv) synthetic and real ‘omics’ test datasets, (v) computationally efficient implementation, using C interfacing, and an option for parallel computing, (vi) manual and documentation on how to setup a cluster. To make each feature as user-friendly as possible, only one subroutine per functionality is to be handled by the end-user. It is available as an R package, called MVR (‘Mean-Variance Regularization’), downloadable from the CRAN. PMID:26819572
Huang, Weilin; Wang, Runqiu; Li, Huijian; Chen, Yangkang
2017-09-20
Microseismic method is an essential technique for monitoring the dynamic status of hydraulic fracturing during the development of unconventional reservoirs. However, one of the challenges in microseismic monitoring is that those seismic signals generated from micro seismicity have extremely low amplitude. We develop a methodology to unveil the signals that are smeared in the strong ambient noise and thus facilitate a more accurate arrival-time picking that will ultimately improve the localization accuracy. In the proposed technique, we decompose the recorded data into several morphological multi-scale components. In order to unveil weak signal, we propose an orthogonalization operator which acts as a time-varying weighting in the morphological reconstruction. The orthogonalization operator is obtained using an inversion process. This orthogonalized morphological reconstruction can be interpreted as a projection of the higher-dimensional vector. We first test the proposed technique using a synthetic dataset. Then the proposed technique is applied to a field dataset recorded in a project in China, in which the signals induced from hydraulic fracturing are recorded by twelve three-component (3-C) geophones in a monitoring well. The result demonstrates that the orthogonalized morphological reconstruction can make the extremely weak microseismic signals detectable.
Anastasi, Giuseppe; Cutroneo, Giuseppina; Bruschetta, Daniele; Trimarchi, Fabio; Ielitro, Giuseppe; Cammaroto, Simona; Duca, Antonio; Bramanti, Placido; Favaloro, Angelo; Vaccarino, Gianluigi; Milardi, Demetrio
2009-11-01
We have applied high-quality medical imaging techniques to study the structure of the human ankle. Direct volume rendering, using specific algorithms, transforms conventional two-dimensional (2D) magnetic resonance image (MRI) series into 3D volume datasets. This tool allows high-definition visualization of single or multiple structures for diagnostic, research, and teaching purposes. No other image reformatting technique so accurately highlights each anatomic relationship and preserves soft tissue definition. Here, we used this method to study the structure of the human ankle to analyze tendon-bone-muscle relationships. We compared ankle MRI and computerized tomography (CT) images from 17 healthy volunteers, aged 18-30 years (mean 23 years). An additional subject had a partial rupture of the Achilles tendon. The MRI images demonstrated superiority in overall quality of detail compared to the CT images. The MRI series accurately rendered soft tissue and bone in simultaneous image acquisition, whereas CT required several window-reformatting algorithms, with loss of image data quality. We obtained high-quality digital images of the human ankle that were sufficiently accurate for surgical and clinical intervention planning, as well as for teaching human anatomy. Our approach demonstrates that complex anatomical structures such as the ankle, which is rich in articular facets and ligaments, can be easily studied non-invasively using MRI data.
PCA leverage: outlier detection for high-dimensional functional magnetic resonance imaging data.
Mejia, Amanda F; Nebel, Mary Beth; Eloyan, Ani; Caffo, Brian; Lindquist, Martin A
2017-07-01
Outlier detection for high-dimensional (HD) data is a popular topic in modern statistical research. However, one source of HD data that has received relatively little attention is functional magnetic resonance images (fMRI), which consists of hundreds of thousands of measurements sampled at hundreds of time points. At a time when the availability of fMRI data is rapidly growing-primarily through large, publicly available grassroots datasets-automated quality control and outlier detection methods are greatly needed. We propose principal components analysis (PCA) leverage and demonstrate how it can be used to identify outlying time points in an fMRI run. Furthermore, PCA leverage is a measure of the influence of each observation on the estimation of principal components, which are often of interest in fMRI data. We also propose an alternative measure, PCA robust distance, which is less sensitive to outliers and has controllable statistical properties. The proposed methods are validated through simulation studies and are shown to be highly accurate. We also conduct a reliability study using resting-state fMRI data from the Autism Brain Imaging Data Exchange and find that removal of outliers using the proposed methods results in more reliable estimation of subject-level resting-state networks using independent components analysis. © The Author 2017. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.
Anastasi, Giuseppe; Cutroneo, Giuseppina; Bruschetta, Daniele; Trimarchi, Fabio; Ielitro, Giuseppe; Cammaroto, Simona; Duca, Antonio; Bramanti, Placido; Favaloro, Angelo; Vaccarino, Gianluigi; Milardi, Demetrio
2009-01-01
We have applied high-quality medical imaging techniques to study the structure of the human ankle. Direct volume rendering, using specific algorithms, transforms conventional two-dimensional (2D) magnetic resonance image (MRI) series into 3D volume datasets. This tool allows high-definition visualization of single or multiple structures for diagnostic, research, and teaching purposes. No other image reformatting technique so accurately highlights each anatomic relationship and preserves soft tissue definition. Here, we used this method to study the structure of the human ankle to analyze tendon–bone–muscle relationships. We compared ankle MRI and computerized tomography (CT) images from 17 healthy volunteers, aged 18–30 years (mean 23 years). An additional subject had a partial rupture of the Achilles tendon. The MRI images demonstrated superiority in overall quality of detail compared to the CT images. The MRI series accurately rendered soft tissue and bone in simultaneous image acquisition, whereas CT required several window-reformatting algorithms, with loss of image data quality. We obtained high-quality digital images of the human ankle that were sufficiently accurate for surgical and clinical intervention planning, as well as for teaching human anatomy. Our approach demonstrates that complex anatomical structures such as the ankle, which is rich in articular facets and ligaments, can be easily studied non-invasively using MRI data. PMID:19678857
The mode of toxic action (MoA) has been recognized as a key determinant of chemical toxicity but MoA classification in aquatic toxicology has been limited. We developed a Bayesian network model to classify aquatic toxicity mode of action using a recently published dataset contain...
Spotting Incorrect Rules in Signed-Number Arithmetic by the Individual Consistency Index.
ERIC Educational Resources Information Center
Tatsuoka, Kikumi K.; Tatsuoka, Maurice M.
Criterion-referenced testing is an important area in the theory and practice of educational measurement. This study demonstrated that even these tests must be closely examined for construct validity. The dimensionality of a dataset will be affected by the examinee's cognitive processes as well as by the nature of the content domain. The methods of…
Scientific Visualization Tools for Enhancement of Undergraduate Research
NASA Astrophysics Data System (ADS)
Rodriguez, W. J.; Chaudhury, S. R.
2001-05-01
Undergraduate research projects that utilize remote sensing satellite instrument data to investigate atmospheric phenomena pose many challenges. A significant challenge is processing large amounts of multi-dimensional data. Remote sensing data initially requires mining; filtering of undesirable spectral, instrumental, or environmental features; and subsequently sorting and reformatting to files for easy and quick access. The data must then be transformed according to the needs of the investigation(s) and displayed for interpretation. These multidimensional datasets require views that can range from two-dimensional plots to multivariable-multidimensional scientific visualizations with animations. Science undergraduate students generally find these data processing tasks daunting. Generally, researchers are required to fully understand the intricacies of the dataset and write computer programs or rely on commercially available software, which may not be trivial to use. In the time that undergraduate researchers have available for their research projects, learning the data formats, programming languages, and/or visualization packages is impractical. When dealing with large multi-dimensional data sets appropriate Scientific Visualization tools are imperative in allowing students to have a meaningful and pleasant research experience, while producing valuable scientific research results. The BEST Lab at Norfolk State University has been creating tools for multivariable-multidimensional analysis of Earth Science data. EzSAGE and SAGE4D have been developed to sort, analyze and visualize SAGE II (Stratospheric Aerosol and Gas Experiment) data with ease. Three- and four-dimensional visualizations in interactive environments can be produced. EzSAGE provides atmospheric slices in three-dimensions where the researcher can change the scales in the three-dimensions, color tables and degree of smoothing interactively to focus on particular phenomena. SAGE4D provides a navigable four-dimensional interactive environment. These tools allow students to make higher order decisions based on large multidimensional sets of data while diminishing the level of frustration that results from dealing with the details of processing large data sets.
Three-dimensional spatiotemporal features for fast content-based retrieval of focal liver lesions.
Roy, Sharmili; Chi, Yanling; Liu, Jimin; Venkatesh, Sudhakar K; Brown, Michael S
2014-11-01
Content-based image retrieval systems for 3-D medical datasets still largely rely on 2-D image-based features extracted from a few representative slices of the image stack. Most 2 -D features that are currently used in the literature not only model a 3-D tumor incompletely but are also highly expensive in terms of computation time, especially for high-resolution datasets. Radiologist-specified semantic labels are sometimes used along with image-based 2-D features to improve the retrieval performance. Since radiological labels show large interuser variability, are often unstructured, and require user interaction, their use as lesion characterizing features is highly subjective, tedious, and slow. In this paper, we propose a 3-D image-based spatiotemporal feature extraction framework for fast content-based retrieval of focal liver lesions. All the features are computer generated and are extracted from four-phase abdominal CT images. Retrieval performance and query processing times for the proposed framework is evaluated on a database of 44 hepatic lesions comprising of five pathological types. Bull's eye percentage score above 85% is achieved for three out of the five lesion pathologies and for 98% of query lesions, at least one same type of lesion is ranked among the top two retrieved results. Experiments show that the proposed system's query processing is more than 20 times faster than other already published systems that use 2-D features. With fast computation time and high retrieval accuracy, the proposed system has the potential to be used as an assistant to radiologists for routine hepatic tumor diagnosis.
High-Resolution Forest Canopy Height Estimation in an African Blue Carbon Ecosystem
NASA Technical Reports Server (NTRS)
Lagomasino, David; Fatoyinbo, Temilola; Lee, Seung-Kuk; Simard, Marc
2015-01-01
Mangrove forests are one of the most productive and carbon dense ecosystems that are only found at tidally inundated coastal areas. Forest canopy height is an important measure for modeling carbon and biomass dynamics, as well as land cover change. By taking advantage of the flat terrain and dense canopy cover, the present study derived digital surface models (DSMs) using stereophotogrammetric techniques on high-resolution spaceborne imagery (HRSI) for southern Mozambique. A mean-weighted ground surface elevation factor was subtracted from the HRSI DSM to accurately estimate the canopy height in mangrove forests in southern Mozambique. The mean and H100 tree height measured in both the field and with the digital canopy model provided the most accurate results with a vertical error of 1.18-1.84 m, respectively. Distinct patterns were identified in the HRSI canopy height map that could not be discerned from coarse shuttle radar topography mission canopy maps even though the mode and distribution of canopy heights were similar over the same area. Through further investigation, HRSI DSMs have the potential of providing a new type of three-dimensional dataset that could serve as calibration/validation data for other DSMs generated from spaceborne datasets with much larger global coverage. HSRI DSMs could be used in lieu of Lidar acquisitions for canopy height and forest biomass estimation, and be combined with passive optical data to improve land cover classifications.
magHD: a new approach to multi-dimensional data storage, analysis, display and exploitation
NASA Astrophysics Data System (ADS)
Angleraud, Christophe
2014-06-01
The ever increasing amount of data and processing capabilities - following the well- known Moore's law - is challenging the way scientists and engineers are currently exploiting large datasets. The scientific visualization tools, although quite powerful, are often too generic and provide abstract views of phenomena, thus preventing cross disciplines fertilization. On the other end, Geographic information Systems allow nice and visually appealing maps to be built but they often get very confused as more layers are added. Moreover, the introduction of time as a fourth analysis dimension to allow analysis of time dependent phenomena such as meteorological or climate models, is encouraging real-time data exploration techniques that allow spatial-temporal points of interests to be detected by integration of moving images by the human brain. Magellium is involved in high performance image processing chains for satellite image processing as well as scientific signal analysis and geographic information management since its creation (2003). We believe that recent work on big data, GPU and peer-to-peer collaborative processing can open a new breakthrough in data analysis and display that will serve many new applications in collaborative scientific computing, environment mapping and understanding. The magHD (for Magellium Hyper-Dimension) project aims at developing software solutions that will bring highly interactive tools for complex datasets analysis and exploration commodity hardware, targeting small to medium scale clusters with expansion capabilities to large cloud based clusters.
High-Dimensional Bayesian Geostatistics
Banerjee, Sudipto
2017-01-01
With the growing capabilities of Geographic Information Systems (GIS) and user-friendly software, statisticians today routinely encounter geographically referenced data containing observations from a large number of spatial locations and time points. Over the last decade, hierarchical spatiotemporal process models have become widely deployed statistical tools for researchers to better understand the complex nature of spatial and temporal variability. However, fitting hierarchical spatiotemporal models often involves expensive matrix computations with complexity increasing in cubic order for the number of spatial locations and temporal points. This renders such models unfeasible for large data sets. This article offers a focused review of two methods for constructing well-defined highly scalable spatiotemporal stochastic processes. Both these processes can be used as “priors” for spatiotemporal random fields. The first approach constructs a low-rank process operating on a lower-dimensional subspace. The second approach constructs a Nearest-Neighbor Gaussian Process (NNGP) that ensures sparse precision matrices for its finite realizations. Both processes can be exploited as a scalable prior embedded within a rich hierarchical modeling framework to deliver full Bayesian inference. These approaches can be described as model-based solutions for big spatiotemporal datasets. The models ensure that the algorithmic complexity has ~ n floating point operations (flops), where n the number of spatial locations (per iteration). We compare these methods and provide some insight into their methodological underpinnings. PMID:29391920
NASA Astrophysics Data System (ADS)
Hendrickson, Kelli; Yue, Dick
2016-11-01
This work presents the development and a priori testing of closure models for the incompressible highly-variable density turbulent (IHVDT) flow in the near wake region of a transom stern. This complex, three-dimensional flow includes three regions with distinctly different flow behavior: (i) the convergent corner waves that originate from the body and collide on the ship center plane; (ii) the "rooster tail" that forms from the collision; and (iii) the diverging wave train. The characteristics of these regions involve violent free-surface flows and breaking waves with significant turbulent mass flux (TMF) at Atwood number At = (ρ2 -ρ1) / (ρ2 +ρ1) 1 for which there is little guidance in turbulence closure modeling for the momentum and scalar transport along the wake. Utilizing datasets from high-resolution simulations of the near wake of a canonical three-dimensional transom stern using conservative Volume-of-Fluid (cVOF), implicit Large Eddy Simulation (iLES), and Boundary Data Immersion Method (BDIM), we develop explicit algebraic turbulent mass flux closure models that incorporate the most relevant physical processes. Performance of these models in predicting the turbulent mass flux in all three regions of the wake will be presented. Office of Naval Research.
QUADRO: A SUPERVISED DIMENSION REDUCTION METHOD VIA RAYLEIGH QUOTIENT OPTIMIZATION.
Fan, Jianqing; Ke, Zheng Tracy; Liu, Han; Xia, Lucy
We propose a novel Rayleigh quotient based sparse quadratic dimension reduction method-named QUADRO (Quadratic Dimension Reduction via Rayleigh Optimization)-for analyzing high-dimensional data. Unlike in the linear setting where Rayleigh quotient optimization coincides with classification, these two problems are very different under nonlinear settings. In this paper, we clarify this difference and show that Rayleigh quotient optimization may be of independent scientific interests. One major challenge of Rayleigh quotient optimization is that the variance of quadratic statistics involves all fourth cross-moments of predictors, which are infeasible to compute for high-dimensional applications and may accumulate too many stochastic errors. This issue is resolved by considering a family of elliptical models. Moreover, for heavy-tail distributions, robust estimates of mean vectors and covariance matrices are employed to guarantee uniform convergence in estimating non-polynomially many parameters, even though only the fourth moments are assumed. Methodologically, QUADRO is based on elliptical models which allow us to formulate the Rayleigh quotient maximization as a convex optimization problem. Computationally, we propose an efficient linearized augmented Lagrangian method to solve the constrained optimization problem. Theoretically, we provide explicit rates of convergence in terms of Rayleigh quotient under both Gaussian and general elliptical models. Thorough numerical results on both synthetic and real datasets are also provided to back up our theoretical results.
Adamson, M W; Morozov, A Y; Kuzenkov, O A
2016-09-01
Mathematical models in biology are highly simplified representations of a complex underlying reality and there is always a high degree of uncertainty with regards to model function specification. This uncertainty becomes critical for models in which the use of different functions fitting the same dataset can yield substantially different predictions-a property known as structural sensitivity. Thus, even if the model is purely deterministic, then the uncertainty in the model functions carries through into uncertainty in model predictions, and new frameworks are required to tackle this fundamental problem. Here, we consider a framework that uses partially specified models in which some functions are not represented by a specific form. The main idea is to project infinite dimensional function space into a low-dimensional space taking into account biological constraints. The key question of how to carry out this projection has so far remained a serious mathematical challenge and hindered the use of partially specified models. Here, we propose and demonstrate a potentially powerful technique to perform such a projection by using optimal control theory to construct functions with the specified global properties. This approach opens up the prospect of a flexible and easy to use method to fulfil uncertainty analysis of biological models.
Classification of Microarray Data Using Kernel Fuzzy Inference System
Kumar Rath, Santanu
2014-01-01
The DNA microarray classification technique has gained more popularity in both research and practice. In real data analysis, such as microarray data, the dataset contains a huge number of insignificant and irrelevant features that tend to lose useful information. Classes with high relevance and feature sets with high significance are generally referred for the selected features, which determine the samples classification into their respective classes. In this paper, kernel fuzzy inference system (K-FIS) algorithm is applied to classify the microarray data (leukemia) using t-test as a feature selection method. Kernel functions are used to map original data points into a higher-dimensional (possibly infinite-dimensional) feature space defined by a (usually nonlinear) function ϕ through a mathematical process called the kernel trick. This paper also presents a comparative study for classification using K-FIS along with support vector machine (SVM) for different set of features (genes). Performance parameters available in the literature such as precision, recall, specificity, F-measure, ROC curve, and accuracy are considered to analyze the efficiency of the classification model. From the proposed approach, it is apparent that K-FIS model obtains similar results when compared with SVM model. This is an indication that the proposed approach relies on kernel function. PMID:27433543
High-Dimensional Bayesian Geostatistics.
Banerjee, Sudipto
2017-06-01
With the growing capabilities of Geographic Information Systems (GIS) and user-friendly software, statisticians today routinely encounter geographically referenced data containing observations from a large number of spatial locations and time points. Over the last decade, hierarchical spatiotemporal process models have become widely deployed statistical tools for researchers to better understand the complex nature of spatial and temporal variability. However, fitting hierarchical spatiotemporal models often involves expensive matrix computations with complexity increasing in cubic order for the number of spatial locations and temporal points. This renders such models unfeasible for large data sets. This article offers a focused review of two methods for constructing well-defined highly scalable spatiotemporal stochastic processes. Both these processes can be used as "priors" for spatiotemporal random fields. The first approach constructs a low-rank process operating on a lower-dimensional subspace. The second approach constructs a Nearest-Neighbor Gaussian Process (NNGP) that ensures sparse precision matrices for its finite realizations. Both processes can be exploited as a scalable prior embedded within a rich hierarchical modeling framework to deliver full Bayesian inference. These approaches can be described as model-based solutions for big spatiotemporal datasets. The models ensure that the algorithmic complexity has ~ n floating point operations (flops), where n the number of spatial locations (per iteration). We compare these methods and provide some insight into their methodological underpinnings.
Prabhu, David; Mehanna, Emile; Gargesha, Madhusudhana; Brandt, Eric; Wen, Di; van Ditzhuijzen, Nienke S; Chamie, Daniel; Yamamoto, Hirosada; Fujino, Yusuke; Alian, Ali; Patel, Jaymin; Costa, Marco; Bezerra, Hiram G; Wilson, David L
2016-04-01
Evidence suggests high-resolution, high-contrast, [Formula: see text] intravascular optical coherence tomography (IVOCT) can distinguish plaque types, but further validation is needed, especially for automated plaque characterization. We developed experimental and three-dimensional (3-D) registration methods to provide validation of IVOCT pullback volumes using microscopic, color, and fluorescent cryo-image volumes with optional registered cryo-histology. A specialized registration method matched IVOCT pullback images acquired in the catheter reference frame to a true 3-D cryo-image volume. Briefly, an 11-parameter registration model including a polynomial virtual catheter was initialized within the cryo-image volume, and perpendicular images were extracted, mimicking IVOCT image acquisition. Virtual catheter parameters were optimized to maximize cryo and IVOCT lumen overlap. Multiple assessments suggested that the registration error was better than the [Formula: see text] spacing between IVOCT image frames. Tests on a digital synthetic phantom gave a registration error of only [Formula: see text] (signed distance). Visual assessment of randomly presented nearby frames suggested registration accuracy within 1 IVOCT frame interval ([Formula: see text]). This would eliminate potential misinterpretations confronted by the typical histological approaches to validation, with estimated 1-mm errors. The method can be used to create annotated datasets and automated plaque classification methods and can be extended to other intravascular imaging modalities.
The MOSO field experiment - Overview of findings
NASA Astrophysics Data System (ADS)
Ólafsson, Haraldur; Jonassen, Marius O.; Ágústsson, Hálfdán; Rögnvaldsson, Ólafur; Hjarðar, Bjarni G. Þ.; Rasol, Dubravka; Reuder, Joachim; Jónsson, Sigurður; Líf Kristinsdóttir, Birta
2013-04-01
In 2009 and 2011, the MOSO I and MOSO II meteorological field experiments took place in SW-Iceland. The main objectives were to describe the low level atmospheric coastal flows in the vicinity of mountains. The observations for the MOSO dataset were made using a large number of automatic weather stations, microbarographs, radiosoundings and a remotely piloted aircraft. The highlights of the findings include a four-dimensional description of the sea-breeze in Iceland, weak downslope acceleration, summer- and winter-time mountain wake flow, transition between wake flow and sea-breeze. The orographic drag force is explored and shown to be not so high most of the time in the predicted high-drag state. The observations from the remotely piloted aircraft have been used successfully to nudge simulations of the flow and are shown to be promising for operational use in numerical prediction of mesoscale coastal and orographic flows.
Yasukawa, Kazutaka; Nakamura, Kentaro; Fujinaga, Koichiro; Ikehara, Minoru; Kato, Yasuhiro
2017-09-12
Multiple transient global warming events occurred during the early Palaeogene. Although these events, called hyperthermals, have been reported from around the globe, geologic records for the Indian Ocean are limited. In addition, the recovery processes from relatively modest hyperthermals are less constrained than those from the severest and well-studied hothouse called the Palaeocene-Eocene Thermal Maximum. In this study, we constructed a new and high-resolution geochemical dataset of deep-sea sediments clearly recording multiple Eocene hyperthermals in the Indian Ocean. We then statistically analysed the high-dimensional data matrix and extracted independent components corresponding to the biogeochemical responses to the hyperthermals. The productivity feedback commonly controls and efficiently sequesters the excess carbon in the recovery phases of the hyperthermals via an enhanced biological pump, regardless of the magnitude of the events. Meanwhile, this negative feedback is independent of nannoplankton assemblage changes generally recognised in relatively large environmental perturbations.
Simulation of FIB-SEM images for analysis of porous microstructures.
Prill, Torben; Schladitz, Katja
2013-01-01
Focused ion beam nanotomography-scanning electron microscopy tomography yields high-quality three-dimensional images of materials microstructures at the nanometer scale combining serial sectioning using a focused ion beam with SEM. However, FIB-SEM tomography of highly porous media leads to shine-through artifacts preventing automatic segmentation of the solid component. We simulate the SEM process in order to generate synthetic FIB-SEM image data for developing and validating segmentation methods. Monte-Carlo techniques yield accurate results, but are too slow for the simulation of FIB-SEM tomography requiring hundreds of SEM images for one dataset alone. Nevertheless, a quasi-analytic description of the specimen and various acceleration techniques, including a track compression algorithm and an acceleration for the simulation of secondary electrons, cut down the computing time by orders of magnitude, allowing for the first time to simulate FIB-SEM tomography. © Wiley Periodicals, Inc.
An Experimental and Computational Analysis of Primary Cilia Deflection Under Fluid Flow
Downs, Matthew E.; Nguyen, An M.; Herzog, Florian A.; Hoey, David A.; Jacobs, Christopher R.
2013-01-01
In this work we have developed a novel model of the deflection of primary cilia experiencing fluid flow accounting for phenomena not previously considered. Specifically, we developed a large rotation formulation that accounts for rotation at the base of the cilium, the initial shape of the cilium and fluid drag at high deflection angles. We utilized this model to analyze full three dimensional datasets of primary cilia deflecting under fluid flow acquired with high-speed confocal microscopy. We found a wide variety of previously unreported bending shapes and behaviors. We also analyzed post-flow relaxation patterns. Results from our combined experimental and theoretical approach suggest that the average flexural rigidity of primary cilia might be higher than previously reported (Schwartz et al. 1997). In addition our findings indicate the mechanics of primary cilia are richly varied and mechanisms may exist to alter their mechanical behavior. PMID:22452422
Replicates in high dimensions, with applications to latent variable graphical models.
Tan, Kean Ming; Ning, Yang; Witten, Daniela M; Liu, Han
2016-12-01
In classical statistics, much thought has been put into experimental design and data collection. In the high-dimensional setting, however, experimental design has been less of a focus. In this paper, we stress the importance of collecting multiple replicates for each subject in this setting. We consider learning the structure of a graphical model with latent variables, under the assumption that these variables take a constant value across replicates within each subject. By collecting multiple replicates for each subject, we are able to estimate the conditional dependence relationships among the observed variables given the latent variables. To test the null hypothesis of conditional independence between two observed variables, we propose a pairwise decorrelated score test. Theoretical guarantees are established for parameter estimation and for this test. We show that our proposal is able to estimate latent variable graphical models more accurately than some existing proposals, and apply the proposed method to a brain imaging dataset.
NASA Astrophysics Data System (ADS)
Darmawan, Herlan; Walter, Thomas R.; Brotopuspito, Kirbani Sri; Subandriyo; I Gusti Made Agung Nandaka
2018-01-01
Dome-building volcanoes undergo rapid and profound topographic changes that are important to quantify for the purposes of hazard assessment. However, as hazardous lava domes often develop on high-altitude volcanoes that exhibit steep-sided topography, it is challenging to obtain direct field access and thus to analyze these morphological and structural changes. Merapi Volcano in Indonesia is a type example of such a volcano, as soon after its 2010 eruption, a new lava dome developed. This dome was partially destroyed during six distinct steam-driven explosions that occurred between 2012 and 2014. Here, we investigate the topographic and structural changes associated with these six steam-driven explosions by comparing close-range photogrammetric data obtained before and after these explosions. To accomplish this, we performed two UAV campaigns in 2012 and 2015. By applying the Structure from Motion (SfM) technique, we are able to construct three-dimensional point clouds, assess their quality by comparing them to a terrestrial laser scanning (TLS) dataset, and generate high-resolution Digital Elevation Models (DEMs) and photomosaics. The comparison of these two DEMs and photomosaics reveals changes in topography and the appearance of fractures. In the 2012 dataset, we find a dense fracture network striking to the NNW-SSE. In the post-eruptive 2015 dataset, we see that this NNW-SSE fracture trend is much more strongly expressed; we also detect the formation of aligned and elongated explosion craters, which are associated with the removal of over 200,000 m3 of dome material, most of which ( 70%) was deposited outside the crater region. Therefore, this study suggests that the locations of the steam-driven explosions at Merapi Volcano were controlled by the reactivation of preexisting structures. Moreover, some of the newly developed and reactivated fractures delineate a block on the southern slope of the dome, which could become structurally unstable and potentially lead to rock avalanche hazards. This study therefore demonstrates the significance of characterizing structural fingerprints during the development of lava domes and exemplifies the value of topographic and fracture mapping, which is becoming increasingly feasible when using UAVs, even on high and steep stratovolcanoes. Fig. S2. The density of TLS point cloud dataset.
An assessment of differences in gridded precipitation datasets in complex terrain
NASA Astrophysics Data System (ADS)
Henn, Brian; Newman, Andrew J.; Livneh, Ben; Daly, Christopher; Lundquist, Jessica D.
2018-01-01
Hydrologic modeling and other geophysical applications are sensitive to precipitation forcing data quality, and there are known challenges in spatially distributing gauge-based precipitation over complex terrain. We conduct a comparison of six high-resolution, daily and monthly gridded precipitation datasets over the Western United States. We compare the long-term average spatial patterns, and interannual variability of water-year total precipitation, as well as multi-year trends in precipitation across the datasets. We find that the greatest absolute differences among datasets occur in high-elevation areas and in the maritime mountain ranges of the Western United States, while the greatest percent differences among datasets relative to annual total precipitation occur in arid and rain-shadowed areas. Differences between datasets in some high-elevation areas exceed 200 mm yr-1 on average, and relative differences range from 5 to 60% across the Western United States. In areas of high topographic relief, true uncertainties and biases are likely higher than the differences among the datasets; we present evidence of this based on streamflow observations. Precipitation trends in the datasets differ in magnitude and sign at smaller scales, and are sensitive to how temporal inhomogeneities in the underlying precipitation gauge data are handled.
Urtnasan, Erdenebayar; Park, Jong-Uk; Joo, Eun-Yeon; Lee, Kyoung-Joung
2018-04-23
In this study, we propose a method for the automated detection of obstructive sleep apnea (OSA) from a single-lead electrocardiogram (ECG) using a convolutional neural network (CNN). A CNN model was designed with six optimized convolution layers including activation, pooling, and dropout layers. One-dimensional (1D) convolution, rectified linear units (ReLU), and max pooling were applied to the convolution, activation, and pooling layers, respectively. For training and evaluation of the CNN model, a single-lead ECG dataset was collected from 82 subjects with OSA and was divided into training (including data from 63 patients with 34,281 events) and testing (including data from 19 patients with 8571 events) datasets. Using this CNN model, a precision of 0.99%, a recall of 0.99%, and an F 1 -score of 0.99% were attained with the training dataset; these values were all 0.96% when the CNN was applied to the testing dataset. These results show that the proposed CNN model can be used to detect OSA accurately on the basis of a single-lead ECG. Ultimately, this CNN model may be used as a screening tool for those suspected to suffer from OSA.
Jantarasaengaram, Surasak; Vairojanavong, Kittipong
2010-09-15
Theoretically, a cross-sectional image of any cardiac planes can be obtained from a STIC fetal heart volume dataset. We described a method to display 11 fetal echocardiographic planes from STIC volumes. Fetal heart volume datasets were acquired by transverse acquisition from 200 normal fetuses at 15 to 40 weeks of gestation. Analysis of the volume datasets using the described technique to display 11 echocardiographic planes in the multiplanar display mode were performed offline. Volume datasets from 18 fetuses were excluded due to poor image resolution. The mean visualization rates for all echocardiographic planes at 15-17, 18-22, 23-27, 28-32 and 33-40 weeks of gestation fetuses were 85.6% (range 45.2-96.8%, N = 31), 92.9% (range 64.0-100%, N = 64), 93.4% (range 51.4-100%, N = 37), 88.7%(range 54.5-100%, N = 33) and 81.8% (range 23.5-100%, N = 17) respectively. Overall, the applied technique can favorably display the pertinent echocardiographic planes. Description of the presented method provides a logical approach to explore the fetal heart volumes.
Whole lung morphometry with 3D multiple b-value hyperpolarized gas MRI and compressed sensing.
Chan, Ho-Fung; Stewart, Neil J; Parra-Robles, Juan; Collier, Guilhem J; Wild, Jim M
2017-05-01
To demonstrate three-dimensional (3D) multiple b-value diffusion-weighted (DW) MRI of hyperpolarized 3 He gas for whole lung morphometry with compressed sensing (CS). A fully-sampled, two b-value, 3D hyperpolarized 3 He DW-MRI dataset was acquired from the lungs of a healthy volunteer and retrospectively undersampled in the k y and k z phase-encoding directions for CS simulations. Optimal k-space undersampling patterns were determined by minimizing the mean absolute error between reconstructed and fully-sampled 3 He apparent diffusion coefficient (ADC) maps. Prospective three-fold, undersampled, 3D multiple b-value 3 He DW-MRI datasets were acquired from five healthy volunteers and one chronic obstructive pulmonary disease (COPD) patient, and the mean values of maps of ADC and mean alveolar dimension (Lm D ) were validated against two-dimensional (2D) and 3D fully-sampled 3 He DW-MRI experiments. Reconstructed undersampled datasets showed no visual artifacts and good preservation of the main image features and quantitative information. A good agreement between fully-sampled and prospective undersampled datasets was found, with a mean difference of +3.4% and +5.1% observed in mean global ADC and Lm D values, respectively. These differences were within the standard deviation range and consistent with values reported from healthy and COPD lungs. Accelerated CS acquisition has facilitated 3D multiple b-value 3 He DW-MRI scans in a single breath-hold, enabling whole lung morphometry mapping. Magn Reson Med 77:1916-1925, 2017. © 2016 The Authors Magnetic Resonance in Medicine published by Wiley Periodicals, Inc. on behalf of International Society for Magnetic Resonance in Medicine. This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited. © 2016 The Authors Magnetic Resonance in Medicine published by Wiley Periodicals, Inc. on behalf of International Society for Magnetic Resonance in Medicine.
Accuracy and Specific Value of Cardiovascular 3D-Models in Pediatric CT-Angiography.
Hammon, Matthias; Rompel, Oliver; Seuss, Hannes; Dittrich, Sven; Uder, Michael; Rüffer, Andrè; Cesnjevar, Robert; Ehret, Nicole; Glöckler, Martin
2017-12-01
Computed tomography (CT)-angiography is routinely performed prior to catheter-based and surgical treatment in congenital heart disease. To date, little is known about the accuracy and advantage of different 3D-reconstructions in CT-data. Exact anatomical information is crucial. We analyzed 35 consecutive CT-angiographies of infants with congenital heart disease. All datasets are reconstructed three-dimensionally using volume rendering technique (VRT) and threshold-based segmentation (stereolithographic model, STL). Additionally, the two-dimensional maximum intensity projection (MIP) reconstructs two-dimensional data. In each dataset and resulting image, measurements of vascular diameters for four different vessels were estimated and compared to the reference standard, measured via multiplanar reformation (MPR). The resulting measurements obtained via the STL-images, MIP-images, and the VRT-images were compared with the reference standard. There was a significant difference (p < 0.05) between measurements. The mean difference was 0.0 for STL-images, -0.1 for MIP-images, and -0.3 for VRT-images. The range of the differences was -0.7 to 1.0 mm for STL-images, -0.6 to 0.5 mm for MIP-images and -1.1 to 0.7 mm for VRT-images. There was an excellent correlation between the STL-, MIP-, VRT-measurements, and the reference standard. Inter-reader reliability was excellent (p < 0.01). STL-models of cardiovascular structures are more accurate than the traditional VRT-models. Additionally, they can be standardized and are reproducible.
Extraction and LOD control of colored interval volumes
NASA Astrophysics Data System (ADS)
Miyamura, Hiroko N.; Takeshima, Yuriko; Fujishiro, Issei; Saito, Takafumi
2005-03-01
Interval volume serves as a generalized isosurface and represents a three-dimensional subvolume for which the associated scalar filed values lie within a user-specified closed interval. In general, it is not an easy task for novices to specify the scalar field interval corresponding to their ROIs. In order to extract interval volumes from which desirable geometric features can be mined effectively, we propose a suggestive technique which extracts interval volumes automatically based on the global examination of the field contrast structure. Also proposed here is a simplification scheme for decimating resultant triangle patches to realize efficient transmission and rendition of large-scale interval volumes. Color distributions as well as geometric features are taken into account to select best edges to be collapsed. In addition, when a user wants to selectively display and analyze the original dataset, the simplified dataset is restructured to the original quality. Several simulated and acquired datasets are used to demonstrate the effectiveness of the present methods.
Characterizing Time Series Data Diversity for Wind Forecasting: Preprint
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hodge, Brian S; Chartan, Erol Kevin; Feng, Cong
Wind forecasting plays an important role in integrating variable and uncertain wind power into the power grid. Various forecasting models have been developed to improve the forecasting accuracy. However, it is challenging to accurately compare the true forecasting performances from different methods and forecasters due to the lack of diversity in forecasting test datasets. This paper proposes a time series characteristic analysis approach to visualize and quantify wind time series diversity. The developed method first calculates six time series characteristic indices from various perspectives. Then the principal component analysis is performed to reduce the data dimension while preserving the importantmore » information. The diversity of the time series dataset is visualized by the geometric distribution of the newly constructed principal component space. The volume of the 3-dimensional (3D) convex polytope (or the length of 1D number axis, or the area of the 2D convex polygon) is used to quantify the time series data diversity. The method is tested with five datasets with various degrees of diversity.« less
Chen Peng; Ao Li
2017-01-01
The emergence of multi-dimensional data offers opportunities for more comprehensive analysis of the molecular characteristics of human diseases and therefore improving diagnosis, treatment, and prevention. In this study, we proposed a heterogeneous network based method by integrating multi-dimensional data (HNMD) to identify GBM-related genes. The novelty of the method lies in that the multi-dimensional data of GBM from TCGA dataset that provide comprehensive information of genes, are combined with protein-protein interactions to construct a weighted heterogeneous network, which reflects both the general and disease-specific relationships between genes. In addition, a propagation algorithm with resistance is introduced to precisely score and rank GBM-related genes. The results of comprehensive performance evaluation show that the proposed method significantly outperforms the network based methods with single-dimensional data and other existing approaches. Subsequent analysis of the top ranked genes suggests they may be functionally implicated in GBM, which further corroborates the superiority of the proposed method. The source code and the results of HNMD can be downloaded from the following URL: http://bioinformatics.ustc.edu.cn/hnmd/ .
Spectral relative standard deviation: a practical benchmark in metabolomics.
Parsons, Helen M; Ekman, Drew R; Collette, Timothy W; Viant, Mark R
2009-03-01
Metabolomics datasets, by definition, comprise of measurements of large numbers of metabolites. Both technical (analytical) and biological factors will induce variation within these measurements that is not consistent across all metabolites. Consequently, criteria are required to assess the reproducibility of metabolomics datasets that are derived from all the detected metabolites. Here we calculate spectrum-wide relative standard deviations (RSDs; also termed coefficient of variation, CV) for ten metabolomics datasets, spanning a variety of sample types from mammals, fish, invertebrates and a cell line, and display them succinctly as boxplots. We demonstrate multiple applications of spectral RSDs for characterising technical as well as inter-individual biological variation: for optimising metabolite extractions, comparing analytical techniques, investigating matrix effects, and comparing biofluids and tissue extracts from single and multiple species for optimising experimental design. Technical variation within metabolomics datasets, recorded using one- and two-dimensional NMR and mass spectrometry, ranges from 1.6 to 20.6% (reported as the median spectral RSD). Inter-individual biological variation is typically larger, ranging from as low as 7.2% for tissue extracts from laboratory-housed rats to 58.4% for fish plasma. In addition, for some of the datasets we confirm that the spectral RSD values are largely invariant across different spectral processing methods, such as baseline correction, normalisation and binning resolution. In conclusion, we propose spectral RSDs and their median values contained herein as practical benchmarks for metabolomics studies.
NASA Astrophysics Data System (ADS)
Appel, Marius; Lahn, Florian; Buytaert, Wouter; Pebesma, Edzer
2018-04-01
Earth observation (EO) datasets are commonly provided as collection of scenes, where individual scenes represent a temporal snapshot and cover a particular region on the Earth's surface. Using these data in complex spatiotemporal modeling becomes difficult as soon as data volumes exceed a certain capacity or analyses include many scenes, which may spatially overlap and may have been recorded at different dates. In order to facilitate analytics on large EO datasets, we combine and extend the geospatial data abstraction library (GDAL) and the array-based data management and analytics system SciDB. We present an approach to automatically convert collections of scenes to multidimensional arrays and use SciDB to scale computationally intensive analytics. We evaluate the approach in three study cases on national scale land use change monitoring with Landsat imagery, global empirical orthogonal function analysis of daily precipitation, and combining historical climate model projections with satellite-based observations. Results indicate that the approach can be used to represent various EO datasets and that analyses in SciDB scale well with available computational resources. To simplify analyses of higher-dimensional datasets as from climate model output, however, a generalization of the GDAL data model might be needed. All parts of this work have been implemented as open-source software and we discuss how this may facilitate open and reproducible EO analyses.
Abatzoglou, John T; Dobrowski, Solomon Z; Parks, Sean A; Hegewisch, Katherine C
2018-01-09
We present TerraClimate, a dataset of high-spatial resolution (1/24°, ~4-km) monthly climate and climatic water balance for global terrestrial surfaces from 1958-2015. TerraClimate uses climatically aided interpolation, combining high-spatial resolution climatological normals from the WorldClim dataset, with coarser resolution time varying (i.e., monthly) data from other sources to produce a monthly dataset of precipitation, maximum and minimum temperature, wind speed, vapor pressure, and solar radiation. TerraClimate additionally produces monthly surface water balance datasets using a water balance model that incorporates reference evapotranspiration, precipitation, temperature, and interpolated plant extractable soil water capacity. These data provide important inputs for ecological and hydrological studies at global scales that require high spatial resolution and time varying climate and climatic water balance data. We validated spatiotemporal aspects of TerraClimate using annual temperature, precipitation, and calculated reference evapotranspiration from station data, as well as annual runoff from streamflow gauges. TerraClimate datasets showed noted improvement in overall mean absolute error and increased spatial realism relative to coarser resolution gridded datasets.
NASA Astrophysics Data System (ADS)
Abatzoglou, John T.; Dobrowski, Solomon Z.; Parks, Sean A.; Hegewisch, Katherine C.
2018-01-01
We present TerraClimate, a dataset of high-spatial resolution (1/24°, ~4-km) monthly climate and climatic water balance for global terrestrial surfaces from 1958-2015. TerraClimate uses climatically aided interpolation, combining high-spatial resolution climatological normals from the WorldClim dataset, with coarser resolution time varying (i.e., monthly) data from other sources to produce a monthly dataset of precipitation, maximum and minimum temperature, wind speed, vapor pressure, and solar radiation. TerraClimate additionally produces monthly surface water balance datasets using a water balance model that incorporates reference evapotranspiration, precipitation, temperature, and interpolated plant extractable soil water capacity. These data provide important inputs for ecological and hydrological studies at global scales that require high spatial resolution and time varying climate and climatic water balance data. We validated spatiotemporal aspects of TerraClimate using annual temperature, precipitation, and calculated reference evapotranspiration from station data, as well as annual runoff from streamflow gauges. TerraClimate datasets showed noted improvement in overall mean absolute error and increased spatial realism relative to coarser resolution gridded datasets.
A Manual Segmentation Tool for Three-Dimensional Neuron Datasets.
Magliaro, Chiara; Callara, Alejandro L; Vanello, Nicola; Ahluwalia, Arti
2017-01-01
To date, automated or semi-automated software and algorithms for segmentation of neurons from three-dimensional imaging datasets have had limited success. The gold standard for neural segmentation is considered to be the manual isolation performed by an expert. To facilitate the manual isolation of complex objects from image stacks, such as neurons in their native arrangement within the brain, a new Manual Segmentation Tool (ManSegTool) has been developed. ManSegTool allows user to load an image stack, scroll down the images and to manually draw the structures of interest stack-by-stack. Users can eliminate unwanted regions or split structures (i.e., branches from different neurons that are too close each other, but, to the experienced eye, clearly belong to a unique cell), to view the object in 3D and save the results obtained. The tool can be used for testing the performance of a single-neuron segmentation algorithm or to extract complex objects, where the available automated methods still fail. Here we describe the software's main features and then show an example of how ManSegTool can be used to segment neuron images acquired using a confocal microscope. In particular, expert neuroscientists were asked to segment different neurons from which morphometric variables were subsequently extracted as a benchmark for precision. In addition, a literature-defined index for evaluating the goodness of segmentation was used as a benchmark for accuracy. Neocortical layer axons from a DIADEM challenge dataset were also segmented with ManSegTool and compared with the manual "gold-standard" generated for the competition.
Baster, I.; Girardclos, S.; Pugin, A.; Wildi, W.
2003-01-01
A high-resolution seismic survey was conducted in western Lake Geneva on a small delta formed by the Promenthouse, the Asse and the Boiron rivers. This dataset provides information on changes in the geometry and sedimentation patterns of this delta from Late-glacial to Present. The geometry of the deposits of the lacustrine delta has been mapped using 300-m spaced grid lines acquired with a 12 kHz Echosounder subbottom profiler. A complete three dimensional image of the sediment architecture was reconstructed through seismic stratigraphic analysis. Six different delta lobes have been recognized in the prodelta area. Depositional centers and lateral extension of the delta have changed through time, indicating migration and fluctuation of river input as well as changes in lake currents and wind regime from the time of glacier retreat to the Present. The delta slope is characterized by a high instability causing stumps developing and by the accumulation of biogenic gas that prevents seismic penetration.
Applying machine-learning techniques to Twitter data for automatic hazard-event classification.
NASA Astrophysics Data System (ADS)
Filgueira, R.; Bee, E. J.; Diaz-Doce, D.; Poole, J., Sr.; Singh, A.
2017-12-01
The constant flow of information offered by tweets provides valuable information about all sorts of events at a high temporal and spatial resolution. Over the past year we have been analyzing in real-time geological hazards/phenomenon, such as earthquakes, volcanic eruptions, landslides, floods or the aurora, as part of the GeoSocial project, by geo-locating tweets filtered by keywords in a web-map. However, not all the filtered tweets are related with hazard/phenomenon events. This work explores two classification techniques for automatic hazard-event categorization based on tweets about the "Aurora". First, tweets were filtered using aurora-related keywords, removing stop words and selecting the ones written in English. For classifying the remaining between "aurora-event" or "no-aurora-event" categories, we compared two state-of-art techniques: Support Vector Machine (SVM) and Deep Convolutional Neural Networks (CNN) algorithms. Both approaches belong to the family of supervised learning algorithms, which make predictions based on labelled training dataset. Therefore, we created a training dataset by tagging 1200 tweets between both categories. The general form of SVM is used to separate two classes by a function (kernel). We compared the performance of four different kernels (Linear Regression, Logistic Regression, Multinomial Naïve Bayesian and Stochastic Gradient Descent) provided by Scikit-Learn library using our training dataset to build the SVM classifier. The results shown that the Logistic Regression (LR) gets the best accuracy (87%). So, we selected the SVM-LR classifier to categorise a large collection of tweets using the "dispel4py" framework.Later, we developed a CNN classifier, where the first layer embeds words into low-dimensional vectors. The next layer performs convolutions over the embedded word vectors. Results from the convolutional layer are max-pooled into a long feature vector, which is classified using a softmax layer. The CNN's accuracy is lower (83%) than the SVM-LR, since the algorithm needs a bigger training dataset to increase its accuracy. We used TensorFlow framework for applying CNN classifier to the same collection of tweets.In future we will modify both classifiers to work with other geo-hazards, use larger training datasets and apply them in real-time.
The Community Intercomparison Suite (CIS)
NASA Astrophysics Data System (ADS)
Watson-Parris, Duncan; Schutgens, Nick; Cook, Nick; Kipling, Zak; Kershaw, Phil; Gryspeerdt, Ed; Lawrence, Bryan; Stier, Philip
2017-04-01
Earth observations (both remote and in-situ) create vast amounts of data providing invaluable constraints for the climate science community. Efficient exploitation of these complex and highly heterogeneous datasets has been limited however by the lack of suitable software tools, particularly for comparison of gridded and ungridded data, thus reducing scientific productivity. CIS (http://cistools.net) is an open-source, command line tool and Python library which allows the straight-forward quantitative analysis, intercomparison and visualisation of remote sensing, in-situ and model data. The CIS can read gridded and ungridded remote sensing, in-situ and model data - and many other data sources 'out-of-the-box', such as ESA Aerosol and Cloud CCI product, MODIS, Cloud CCI, Cloudsat, AERONET. Perhaps most importantly however CIS also employs a modular plugin architecture to allow for the reading of limitless different data types. Users are able to write their own plugins for reading the data sources which they are familiar with, and share them within the community, allowing all to benefit from their expertise. To enable the intercomparison of this data the CIS provides a number of operations including: the aggregation of ungridded and gridded datasets to coarser representations using a number of different built in averaging kernels; the subsetting of data to reduce its extent or dimensionality; the co-location of two distinct datasets onto a single set of co-ordinates; the visualisation of the input or output data through a number of different plots and graphs; the evaluation of arbitrary mathematical expressions against any number of datasets; and a number of other supporting functions such as a statistical comparison of two co-located datasets. These operations can be performed efficiently on local machines or large computing clusters - and is already available on the JASMIN computing facility. A case-study using the GASSP collection of in-situ aerosol observations will demonstrate the power of using CIS to perform model evaluations. The use of an open-source, community developed tool in this way opens up a huge amount of data which would previously have been inaccessible to many users, while also providing replicable, repeatable analysis which scientists and policy-makers alike can trust and understand.
Jiang, Yueyang; Kim, John B.; Still, Christopher J.; Kerns, Becky K.; Kline, Jeffrey D.; Cunningham, Patrick G.
2018-01-01
Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies. PMID:29461513
Jiang, Yueyang; Kim, John B; Still, Christopher J; Kerns, Becky K; Kline, Jeffrey D; Cunningham, Patrick G
2018-02-20
Statistically downscaled climate data have been widely used to explore possible impacts of climate change in various fields of study. Although many studies have focused on characterizing differences in the downscaling methods, few studies have evaluated actual downscaled datasets being distributed publicly. Spatially focusing on the Pacific Northwest, we compare five statistically downscaled climate datasets distributed publicly in the US: ClimateNA, NASA NEX-DCP30, MACAv2-METDATA, MACAv2-LIVNEH and WorldClim. We compare the downscaled projections of climate change, and the associated observational data used as training data for downscaling. We map and quantify the variability among the datasets and characterize the spatio-temporal patterns of agreement and disagreement among the datasets. Pair-wise comparisons of datasets identify the coast and high-elevation areas as areas of disagreement for temperature. For precipitation, high-elevation areas, rainshadows and the dry, eastern portion of the study area have high dissimilarity among the datasets. By spatially aggregating the variability measures into watersheds, we develop guidance for selecting datasets within the Pacific Northwest climate change impact studies.
Conducting high-value secondary dataset analysis: an introductory guide and resources.
Smith, Alexander K; Ayanian, John Z; Covinsky, Kenneth E; Landon, Bruce E; McCarthy, Ellen P; Wee, Christina C; Steinman, Michael A
2011-08-01
Secondary analyses of large datasets provide a mechanism for researchers to address high impact questions that would otherwise be prohibitively expensive and time-consuming to study. This paper presents a guide to assist investigators interested in conducting secondary data analysis, including advice on the process of successful secondary data analysis as well as a brief summary of high-value datasets and online resources for researchers, including the SGIM dataset compendium ( www.sgim.org/go/datasets ). The same basic research principles that apply to primary data analysis apply to secondary data analysis, including the development of a clear and clinically relevant research question, study sample, appropriate measures, and a thoughtful analytic approach. A real-world case description illustrates key steps: (1) define your research topic and question; (2) select a dataset; (3) get to know your dataset; and (4) structure your analysis and presentation of findings in a way that is clinically meaningful. Secondary dataset analysis is a well-established methodology. Secondary analysis is particularly valuable for junior investigators, who have limited time and resources to demonstrate expertise and productivity.
[Parallel virtual reality visualization of extreme large medical datasets].
Tang, Min
2010-04-01
On the basis of a brief description of grid computing, the essence and critical techniques of parallel visualization of extreme large medical datasets are discussed in connection with Intranet and common-configuration computers of hospitals. In this paper are introduced several kernel techniques, including the hardware structure, software framework, load balance and virtual reality visualization. The Maximum Intensity Projection algorithm is realized in parallel using common PC cluster. In virtual reality world, three-dimensional models can be rotated, zoomed, translated and cut interactively and conveniently through the control panel built on virtual reality modeling language (VRML). Experimental results demonstrate that this method provides promising and real-time results for playing the role in of a good assistant in making clinical diagnosis.
Visualising nursing data using correspondence analysis.
Kokol, Peter; Blažun Vošner, Helena; Železnik, Danica
2016-09-01
Digitally stored, large healthcare datasets enable nurses to use 'big data' techniques and tools in nursing research. Big data is complex and multi-dimensional, so visualisation may be a preferable approach to analyse and understand it. To demonstrate the use of visualisation of big data in a technique called correspondence analysis. In the authors' study, relations among data in a nursing dataset were shown visually in graphs using correspondence analysis. The case presented demonstrates that correspondence analysis is easy to use, shows relations between data visually in a form that is simple to interpret, and can reveal hidden associations between data. Correspondence analysis supports the discovery of new knowledge. Implications for practice Knowledge obtained using correspondence analysis can be transferred immediately into practice or used to foster further research.
Peleato, Nicolas M; Legge, Raymond L; Andrews, Robert C
2018-06-01
The use of fluorescence data coupled with neural networks for improved predictability of drinking water disinfection by-products (DBPs) was investigated. Novel application of autoencoders to process high-dimensional fluorescence data was related to common dimensionality reduction techniques of parallel factors analysis (PARAFAC) and principal component analysis (PCA). The proposed method was assessed based on component interpretability as well as for prediction of organic matter reactivity to formation of DBPs. Optimal prediction accuracies on a validation dataset were observed with an autoencoder-neural network approach or by utilizing the full spectrum without pre-processing. Latent representation by an autoencoder appeared to mitigate overfitting when compared to other methods. Although DBP prediction error was minimized by other pre-processing techniques, PARAFAC yielded interpretable components which resemble fluorescence expected from individual organic fluorophores. Through analysis of the network weights, fluorescence regions associated with DBP formation can be identified, representing a potential method to distinguish reactivity between fluorophore groupings. However, distinct results due to the applied dimensionality reduction approaches were observed, dictating a need for considering the role of data pre-processing in the interpretability of the results. In comparison to common organic measures currently used for DBP formation prediction, fluorescence was shown to improve prediction accuracies, with improvements to DBP prediction best realized when appropriate pre-processing and regression techniques were applied. The results of this study show promise for the potential application of neural networks to best utilize fluorescence EEM data for prediction of organic matter reactivity. Copyright © 2018 Elsevier Ltd. All rights reserved.
NASA Astrophysics Data System (ADS)
Takayama, T.; Iwasaki, A.
2016-06-01
Above-ground biomass prediction of tropical rain forest using remote sensing data is of paramount importance to continuous large-area forest monitoring. Hyperspectral data can provide rich spectral information for the biomass prediction; however, the prediction accuracy is affected by a small-sample-size problem, which widely exists as overfitting in using high dimensional data where the number of training samples is smaller than the dimensionality of the samples due to limitation of require time, cost, and human resources for field surveys. A common approach to addressing this problem is reducing the dimensionality of dataset. Also, acquired hyperspectral data usually have low signal-to-noise ratio due to a narrow bandwidth and local or global shifts of peaks due to instrumental instability or small differences in considering practical measurement conditions. In this work, we propose a methodology based on fused lasso regression that select optimal bands for the biomass prediction model with encouraging sparsity and grouping, which solves the small-sample-size problem by the dimensionality reduction from the sparsity and the noise and peak shift problem by the grouping. The prediction model provided higher accuracy with root-mean-square error (RMSE) of 66.16 t/ha in the cross-validation than other methods; multiple linear analysis, partial least squares regression, and lasso regression. Furthermore, fusion of spectral and spatial information derived from texture index increased the prediction accuracy with RMSE of 62.62 t/ha. This analysis proves efficiency of fused lasso and image texture in biomass estimation of tropical forests.
FUn: a framework for interactive visualizations of large, high-dimensional datasets on the web.
Probst, Daniel; Reymond, Jean-Louis
2018-04-15
During the past decade, big data have become a major tool in scientific endeavors. Although statistical methods and algorithms are well-suited for analyzing and summarizing enormous amounts of data, the results do not allow for a visual inspection of the entire data. Current scientific software, including R packages and Python libraries such as ggplot2, matplotlib and plot.ly, do not support interactive visualizations of datasets exceeding 100 000 data points on the web. Other solutions enable the web-based visualization of big data only through data reduction or statistical representations. However, recent hardware developments, especially advancements in graphical processing units, allow for the rendering of millions of data points on a wide range of consumer hardware such as laptops, tablets and mobile phones. Similar to the challenges and opportunities brought to virtually every scientific field by big data, both the visualization of and interaction with copious amounts of data are both demanding and hold great promise. Here we present FUn, a framework consisting of a client (Faerun) and server (Underdark) module, facilitating the creation of web-based, interactive 3D visualizations of large datasets, enabling record level visual inspection. We also introduce a reference implementation providing access to SureChEMBL, a database containing patent information on more than 17 million chemical compounds. The source code and the most recent builds of Faerun and Underdark, Lore.js and the data preprocessing toolchain used in the reference implementation, are available on the project website (http://doc.gdb.tools/fun/). daniel.probst@dcb.unibe.ch or jean-louis.reymond@dcb.unibe.ch.
Mapping the spatial distribution of Aedes aegypti and Aedes albopictus.
Ding, Fangyu; Fu, Jingying; Jiang, Dong; Hao, Mengmeng; Lin, Gang
2018-02-01
Mosquito-borne infectious diseases, such as Rift Valley fever, Dengue, Chikungunya and Zika, have caused mass human death with the transnational expansion fueled by economic globalization. Simulating the distribution of the disease vectors is of great importance in formulating public health planning and disease control strategies. In the present study, we simulated the global distribution of Aedes aegypti and Aedes albopictus at a 5×5km spatial resolution with high-dimensional multidisciplinary datasets and machine learning methods Three relatively popular and robust machine learning models, including support vector machine (SVM), gradient boosting machine (GBM) and random forest (RF), were used. During the fine-tuning process based on training datasets of A. aegypti and A. albopictus, RF models achieved the highest performance with an area under the curve (AUC) of 0.973 and 0.974, respectively, followed by GBM (AUC of 0.971 and 0.972, respectively) and SVM (AUC of 0.963 and 0.964, respectively) models. The simulation difference between RF and GBM models was not statistically significant (p>0.05) based on the validation datasets, whereas statistically significant differences (p<0.05) were observed for RF and GBM simulations compared with SVM simulations. From the simulated maps derived from RF models, we observed that the distribution of A. albopictus was wider than that of A. aegypti along a latitudinal gradient. The discriminatory power of each factor in simulating the global distribution of the two species was also analyzed. Our results provided fundamental information for further study on disease transmission simulation and risk assessment. Copyright © 2017 Elsevier B.V. All rights reserved.
Dong, Ni; Huang, Helai; Zheng, Liang
2015-09-01
In zone-level crash prediction, accounting for spatial dependence has become an extensively studied topic. This study proposes Support Vector Machine (SVM) model to address complex, large and multi-dimensional spatial data in crash prediction. Correlation-based Feature Selector (CFS) was applied to evaluate candidate factors possibly related to zonal crash frequency in handling high-dimension spatial data. To demonstrate the proposed approaches and to compare them with the Bayesian spatial model with conditional autoregressive prior (i.e., CAR), a dataset in Hillsborough county of Florida was employed. The results showed that SVM models accounting for spatial proximity outperform the non-spatial model in terms of model fitting and predictive performance, which indicates the reasonableness of considering cross-zonal spatial correlations. The best model predictive capability, relatively, is associated with the model considering proximity of the centroid distance by choosing the RBF kernel and setting the 10% of the whole dataset as the testing data, which further exhibits SVM models' capacity for addressing comparatively complex spatial data in regional crash prediction modeling. Moreover, SVM models exhibit the better goodness-of-fit compared with CAR models when utilizing the whole dataset as the samples. A sensitivity analysis of the centroid-distance-based spatial SVM models was conducted to capture the impacts of explanatory variables on the mean predicted probabilities for crash occurrence. While the results conform to the coefficient estimation in the CAR models, which supports the employment of the SVM model as an alternative in regional safety modeling. Copyright © 2015 Elsevier Ltd. All rights reserved.
Kilborn, Joshua P; Jones, David L; Peebles, Ernst B; Naar, David F
2017-04-01
Clustering data continues to be a highly active area of data analysis, and resemblance profiles are being incorporated into ecological methodologies as a hypothesis testing-based approach to clustering multivariate data. However, these new clustering techniques have not been rigorously tested to determine the performance variability based on the algorithm's assumptions or any underlying data structures. Here, we use simulation studies to estimate the statistical error rates for the hypothesis test for multivariate structure based on dissimilarity profiles (DISPROF). We concurrently tested a widely used algorithm that employs the unweighted pair group method with arithmetic mean (UPGMA) to estimate the proficiency of clustering with DISPROF as a decision criterion. We simulated unstructured multivariate data from different probability distributions with increasing numbers of objects and descriptors, and grouped data with increasing overlap, overdispersion for ecological data, and correlation among descriptors within groups. Using simulated data, we measured the resolution and correspondence of clustering solutions achieved by DISPROF with UPGMA against the reference grouping partitions used to simulate the structured test datasets. Our results highlight the dynamic interactions between dataset dimensionality, group overlap, and the properties of the descriptors within a group (i.e., overdispersion or correlation structure) that are relevant to resemblance profiles as a clustering criterion for multivariate data. These methods are particularly useful for multivariate ecological datasets that benefit from distance-based statistical analyses. We propose guidelines for using DISPROF as a clustering decision tool that will help future users avoid potential pitfalls during the application of methods and the interpretation of results.
Collection of sequential imaging events for research in breast cancer screening
NASA Astrophysics Data System (ADS)
Patel, M. N.; Young, K.; Halling-Brown, M. D.
2016-03-01
Due to the huge amount of research involving medical images, there is a widely accepted need for comprehensive collections of medical images to be made available for research. This demand led to the design and implementation of a flexible image repository, which retrospectively collects images and data from multiple sites throughout the UK. The OPTIMAM Medical Image Database (OMI-DB) was created to provide a centralized, fully annotated dataset for research. The database contains both processed and unprocessed images, associated data, annotations and expert-determined ground truths. Collection has been ongoing for over three years, providing the opportunity to collect sequential imaging events. Extensive alterations to the identification, collection, processing and storage arms of the system have been undertaken to support the introduction of sequential events, including interval cancers. These updates to the collection systems allow the acquisition of many more images, but more importantly, allow one to build on the existing high-dimensional data stored in the OMI-DB. A research dataset of this scale, which includes original normal and subsequent malignant cases along with expert derived and clinical annotations, is currently unique. These data provide a powerful resource for future research and has initiated new research projects, amongst which, is the quantification of normal cases by applying a large number of quantitative imaging features, with a priori knowledge that eventually these cases develop a malignancy. This paper describes, extensions to the OMI-DB collection systems and tools and discusses the prospective applications of having such a rich dataset for future research applications.
Analysis of multispectral and hyperspectral longwave infrared (LWIR) data for geologic mapping
NASA Astrophysics Data System (ADS)
Kruse, Fred A.; McDowell, Meryl
2015-05-01
Multispectral MODIS/ASTER Airborne Simulator (MASTER) data and Hyperspectral Thermal Emission Spectrometer (HyTES) data covering the 8 - 12 μm spectral range (longwave infrared or LWIR) were analyzed for an area near Mountain Pass, California. Decorrelation stretched images were initially used to highlight spectral differences between geologic materials. Both datasets were atmospherically corrected using the ISAC method, and the Normalized Emissivity approach was used to separate temperature and emissivity. The MASTER data had 10 LWIR spectral bands and approximately 35-meter spatial resolution and covered a larger area than the HyTES data, which were collected with 256 narrow (approximately 17nm-wide) spectral bands at approximately 2.3-meter spatial resolution. Spectra for key spatially-coherent, spectrally-determined geologic units for overlap areas were overlain and visually compared to determine similarities and differences. Endmember spectra were extracted from both datasets using n-dimensional scatterplotting and compared to emissivity spectral libraries for identification. Endmember distributions and abundances were then mapped using Mixture-Tuned Matched Filtering (MTMF), a partial unmixing approach. Multispectral results demonstrate separation of silica-rich vs non-silicate materials, with distinct mapping of carbonate areas and general correspondence to the regional geology. Hyperspectral results illustrate refined mapping of silicates with distinction between similar units based on the position, character, and shape of high resolution emission minima near 9 μm. Calcite and dolomite were separated, identified, and mapped using HyTES based on a shift of the main carbonate emissivity minimum from approximately 11.3 to 11.2 μm respectively. Both datasets demonstrate the utility of LWIR spectral remote sensing for geologic mapping.
A study of metaheuristic algorithms for high dimensional feature selection on microarray data
NASA Astrophysics Data System (ADS)
Dankolo, Muhammad Nasiru; Radzi, Nor Haizan Mohamed; Sallehuddin, Roselina; Mustaffa, Noorfa Haszlinna
2017-11-01
Microarray systems enable experts to examine gene profile at molecular level using machine learning algorithms. It increases the potentials of classification and diagnosis of many diseases at gene expression level. Though, numerous difficulties may affect the efficiency of machine learning algorithms which includes vast number of genes features comprised in the original data. Many of these features may be unrelated to the intended analysis. Therefore, feature selection is necessary to be performed in the data pre-processing. Many feature selection algorithms are developed and applied on microarray which including the metaheuristic optimization algorithms. This paper discusses the application of the metaheuristics algorithms for feature selection in microarray dataset. This study reveals that, the algorithms have yield an interesting result with limited resources thereby saving computational expenses of machine learning algorithms.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Li, Xiang; Cox, Jonathan T.; Huang, Weiliang
2016-12-06
Reversible protein phosphorylation regulates essentially all cellular activities. Aberrant protein phosphorylation is an etiological factor in a wide array of diseases, including cancer1, diabetes2, and Alzheimer’s3. Given the broad impact of protein phosphorylation on cellular biology and organismal health, understanding how protein phosphorylation is regulated and the consequences of gain and loss of phosphoryl moieties from proteins is of primary importance. Advances in instrumentation, particularly in mass spectrometry, coupled with high throughput approaches have recently yielded large datasets cataloging tens of thousands of protein phosphorylation sites in multiple organisms4-6. While these studies are seminal in term of data collection, ourmore » understanding of protein phosphorylation regulation remains largely one-dimensional.« less
Machine learning for neuroimaging with scikit-learn.
Abraham, Alexandre; Pedregosa, Fabian; Eickenberg, Michael; Gervais, Philippe; Mueller, Andreas; Kossaifi, Jean; Gramfort, Alexandre; Thirion, Bertrand; Varoquaux, Gaël
2014-01-01
Statistical machine learning methods are increasingly used for neuroimaging data analysis. Their main virtue is their ability to model high-dimensional datasets, e.g., multivariate analysis of activation images or resting-state time series. Supervised learning is typically used in decoding or encoding settings to relate brain images to behavioral or clinical observations, while unsupervised learning can uncover hidden structures in sets of images (e.g., resting state functional MRI) or find sub-populations in large cohorts. By considering different functional neuroimaging applications, we illustrate how scikit-learn, a Python machine learning library, can be used to perform some key analysis steps. Scikit-learn contains a very large set of statistical learning algorithms, both supervised and unsupervised, and its application to neuroimaging data provides a versatile tool to study the brain.
Machine learning for neuroimaging with scikit-learn
Abraham, Alexandre; Pedregosa, Fabian; Eickenberg, Michael; Gervais, Philippe; Mueller, Andreas; Kossaifi, Jean; Gramfort, Alexandre; Thirion, Bertrand; Varoquaux, Gaël
2014-01-01
Statistical machine learning methods are increasingly used for neuroimaging data analysis. Their main virtue is their ability to model high-dimensional datasets, e.g., multivariate analysis of activation images or resting-state time series. Supervised learning is typically used in decoding or encoding settings to relate brain images to behavioral or clinical observations, while unsupervised learning can uncover hidden structures in sets of images (e.g., resting state functional MRI) or find sub-populations in large cohorts. By considering different functional neuroimaging applications, we illustrate how scikit-learn, a Python machine learning library, can be used to perform some key analysis steps. Scikit-learn contains a very large set of statistical learning algorithms, both supervised and unsupervised, and its application to neuroimaging data provides a versatile tool to study the brain. PMID:24600388
Ji, Jiadong; He, Di; Feng, Yang; He, Yong; Xue, Fuzhong; Xie, Lei
2017-10-01
A complex disease is usually driven by a number of genes interwoven into networks, rather than a single gene product. Network comparison or differential network analysis has become an important means of revealing the underlying mechanism of pathogenesis and identifying clinical biomarkers for disease classification. Most studies, however, are limited to network correlations that mainly capture the linear relationship among genes, or rely on the assumption of a parametric probability distribution of gene measurements. They are restrictive in real application. We propose a new Joint density based non-parametric Differential Interaction Network Analysis and Classification (JDINAC) method to identify differential interaction patterns of network activation between two groups. At the same time, JDINAC uses the network biomarkers to build a classification model. The novelty of JDINAC lies in its potential to capture non-linear relations between molecular interactions using high-dimensional sparse data as well as to adjust confounding factors, without the need of the assumption of a parametric probability distribution of gene measurements. Simulation studies demonstrate that JDINAC provides more accurate differential network estimation and lower classification error than that achieved by other state-of-the-art methods. We apply JDINAC to a Breast Invasive Carcinoma dataset, which includes 114 patients who have both tumor and matched normal samples. The hub genes and differential interaction patterns identified were consistent with existing experimental studies. Furthermore, JDINAC discriminated the tumor and normal sample with high accuracy by virtue of the identified biomarkers. JDINAC provides a general framework for feature selection and classification using high-dimensional sparse omics data. R scripts available at https://github.com/jijiadong/JDINAC. lxie@iscb.org. Supplementary data are available at Bioinformatics online. © The Author (2017). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com